预测和去除单细胞转录组的环境游离RNA污染

参考：

使用DecontX预测和去除单细胞转录组的环境游离RNA污染

Decontamination of ambient RNA in single-cell genomic data with DecontX

SoupX 帮助文档

Note

如果你的UMAP可视化时候总是出现毛毛躁躁的边缘和大量散在细胞，还有很多细胞亚群之间有连续的细胞（maybe可能时真是存在的过渡态细胞），就需要考虑这个使用DecontX预测和去除单细胞转录组的环境游离RNA污染。

In droplet based, single cell RNA-seq experiments, there is always a certain amount of background mRNAs present in the dilution that gets distributed into the droplets with cells and sequenced along with them. The net effect of this is to produce a background contamination that represents expression not from the cell contained within a droplet, but the solution that contained the cells.

在单细胞测序中，由于在处理细胞样品的时候需要用到机械解离或者酶解的步骤，所以免不了会造成细胞的破裂同时产生环境游离RNA。细胞悬液中的环境 RNA 可能会与细胞的原生 mRNA 一起被异常计数，从而导致不同细胞群之间转录本的交叉污染。环境游离RNA污染对细胞测序质量的影响较大，因此，有效地计算和预测游离RNA污染，去除污染严重的低质量细胞对单细胞测序分析具有重要意义。

目前，R包SoupX就可以实现这样的目的，详见：SoupX——去除RNA污染。这篇笔记我们介绍另一个R包decontX (Yang et al. 2020)。这个包的主要优点是使用方便，步骤简便，且结果容易使用。DecontX 是一种新的贝叶斯方法，用于估计和去除单细胞数据中的环境污染，可以整合到 scRNA-seq 工作流中，以改进下游分析。DecontX既可以用于去除单细胞转录组数据中的环境游离RNA污染（decontX），也可以用于单细胞蛋白表达谱中的污染（deconPro）。

安装SoupX：

BiocManager::install("decontX")

1 加载包

library(Seurat)
library(decontX)
library(ggplot2)
library(cowplot)
library(beepr)

2 数据导入/质控

DecontX can take either a SingleCellExperiment object or a counts matrix as input. decontX will attempt to convert any input matrix to class dgCMatrix from package Matrix before starting the analysis. 在Seurat工作流中，评估并去除环境游离RNA的污染一般在质控之后进行。这里我们用scRNA-seq_online-质控中已经完成质控的Seurat对象作为演示。最后和scRNA-seq_online-细胞聚类中没有进行环境游离RNA去除的细胞分群结果进行比较。

filtered_seurat <- readRDS("output/scRNA-seq_online/seurat_filtered.rds")
filtered_seurat

An object of class Seurat 
14065 features across 29629 samples within 1 assay 
Active assay: RNA (14065 features, 0 variable features)
 1 layer present: counts

head(filtered_seurat, 5)

                      orig.ident nCount_RNA nFeature_RNA sample
ctrl_AAACATACAATGCC-1          1       2344          874   ctrl
ctrl_AAACATACATTTCC-1          1       3125          896   ctrl
ctrl_AAACATACCAGAAA-1          1       2578          725   ctrl
ctrl_AAACATACCAGCTA-1          1       3261          979   ctrl
ctrl_AAACATACCATGCA-1          1        746          362   ctrl
                      log10GenesPerUMI  mitoRatio                 cells
ctrl_AAACATACAATGCC-1        0.8728630 0.01962457 ctrl_AAACATACAATGCC-1
ctrl_AAACATACATTTCC-1        0.8447596 0.01792000 ctrl_AAACATACATTTCC-1
ctrl_AAACATACCAGAAA-1        0.8384933 0.01551590 ctrl_AAACATACCAGAAA-1
ctrl_AAACATACCAGCTA-1        0.8512622 0.01379945 ctrl_AAACATACCAGCTA-1
ctrl_AAACATACCATGCA-1        0.8906861 0.02144772 ctrl_AAACATACCATGCA-1

3 执行`decontX`

In this scenario, decontX will estimate the contamination distribution for each cell cluster based on the profiles of the other cell clusters in the filtered dataset. The estimated contamination results can be found in the colData(sce)$decontX_contamination and the decontaminated counts can be accessed with decontXcounts(sce).

decontx_result <- decontX(filtered_seurat[["RNA"]]$counts)

# 提取环境RNA污染的预测结果
decontx_result$contamination[1:5]

[1] 0.0005892543 0.0642801956 0.0057000246 0.1023643666 0.2893165532

length(decontx_result$contamination)

[1] 29629

summary(decontx_result$contamination)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0002503 0.0120741 0.0336210 0.0833928 0.0955089 0.9974462

可以看到每个细胞都有一个decontx_result$contamination结果，即预测的环境游离RNA的比例。因此，接下来我们可以将其直接写入Seurat对象的meta.data中，并根据这一信息进行细胞过滤。

# 将环境RNA污染的预测结果写入Seurat对象的meta.data中
filtered_seurat$Contamination <- decontx_result$contamination
head(filtered_seurat, 5)

                      orig.ident nCount_RNA nFeature_RNA sample
ctrl_AAACATACAATGCC-1          1       2344          874   ctrl
ctrl_AAACATACATTTCC-1          1       3125          896   ctrl
ctrl_AAACATACCAGAAA-1          1       2578          725   ctrl
ctrl_AAACATACCAGCTA-1          1       3261          979   ctrl
ctrl_AAACATACCATGCA-1          1        746          362   ctrl
                      log10GenesPerUMI  mitoRatio                 cells
ctrl_AAACATACAATGCC-1        0.8728630 0.01962457 ctrl_AAACATACAATGCC-1
ctrl_AAACATACATTTCC-1        0.8447596 0.01792000 ctrl_AAACATACATTTCC-1
ctrl_AAACATACCAGAAA-1        0.8384933 0.01551590 ctrl_AAACATACCAGAAA-1
ctrl_AAACATACCAGCTA-1        0.8512622 0.01379945 ctrl_AAACATACCAGCTA-1
ctrl_AAACATACCATGCA-1        0.8906861 0.02144772 ctrl_AAACATACCATGCA-1
                      Contamination
ctrl_AAACATACAATGCC-1  0.0005892543
ctrl_AAACATACATTTCC-1  0.0642801956
ctrl_AAACATACCAGAAA-1  0.0057000246
ctrl_AAACATACCAGCTA-1  0.1023643666
ctrl_AAACATACCATGCA-1  0.2893165532

绘图展示污染率的分布情况

p1 <- filtered_seurat@meta.data |> 
    ggplot(aes(color = sample, x = Contamination, fill = sample)) + 
    geom_density(alpha = 0.2) + 
    scale_x_log10() + 
    ylab("Cell density") +
    geom_vline(xintercept = 0.2) +
    theme_bw()

p2 <- filtered_seurat@meta.data |> 
  ggplot(aes(x = sample, y = Contamination, fill = sample)) +
    geom_violin(width = 1.4) +
    geom_boxplot(width = 0.1, color = "gray") +
    theme_bw()

4 过滤高环境RNA污染的细胞

对于污染率的过滤标准，DecontX官方并没有给予明确的建议，这里参考的了SoupX的官方文档，选择Contamination < 0.2。

low_con_seurat <- subset(filtered_seurat, Contamination < 0.2)
paste0("通过decontX过滤掉了", ncol(filtered_seurat) - ncol(low_con_seurat), "个细胞")

[1] "通过decontX过滤掉了3574个细胞"

重新评估污染率的分布情况

p3 <- low_con_seurat@meta.data |> 
    ggplot(aes(color = sample, x = Contamination, fill = sample)) + 
    geom_density(alpha = 0.2) + 
    scale_x_log10() + 
    ylab("Cell density") +
    geom_vline(xintercept = 0.2) +
    theme_bw()

p4 <- low_con_seurat@meta.data |> 
    ggplot(aes(x = sample, y = Contamination, fill = sample)) +
    geom_violin(width = 1.4) +
    geom_boxplot(width = 0.1, color = "gray") +
    theme_bw()

plot_grid(p1, p2, p3, p4,
          ncol = 2, 
          labels = "AUTO")

5 SCTranform、整合

这里省略细胞周期、线粒体基因等非期望变异来源的评估，省略整合前细胞分群情况的评估。

# 分割layer，执行SCTranform
low_con_seurat[["RNA"]] <- split(low_con_seurat[["RNA"]], 
                                 f = low_con_seurat$sample) 
low_con_seurat

An object of class Seurat 
14065 features across 26055 samples within 1 assay 
Active assay: RNA (14065 features, 0 variable features)
 2 layers present: counts.ctrl, counts.stim

# SCTranform
low_con_seurat <- SCTransform(
  low_con_seurat, 
  vars.to.regress = c("mitoRatio"),
  verbose = FALSE
  )
low_con_seurat

An object of class Seurat 
28044 features across 26055 samples within 2 assays 
Active assay: SCT (13979 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA

# Run PCA
low_con_seurat <- RunPCA(low_con_seurat)
names(low_con_seurat@reductions)

[1] "pca"

# 整合
seurat_integrated <- IntegrateLayers(
  object = low_con_seurat,
  method = CCAIntegration,
  normalization.method = "SCT",
  orig.reduction = "pca",
  verbose = FALSE
); beep()

# 整合后重新合并RNA layer
seurat_integrated[["RNA"]] <- JoinLayers(seurat_integrated[["RNA"]])
# 查看整合后的降维信息
names(seurat_integrated@reductions)

[1] "pca"           "integrated.dr"

6 非线性降维、分群

# Set seed
set.seed(123456)

# Run UMAP
seurat_integrated <- RunUMAP(
  seurat_integrated,
  dims = 1:40,
  reduction = "integrated.dr", # 更改降维来源为整合后的"integrated.dr"
  reduction.name = "umap.integrated"
)

# 分群
seurat_integrated <- FindNeighbors(
  seurat_integrated,
  dims = 1:40,
  reduction = "integrated.dr"
) 

seurat_integrated <- FindClusters(
  seurat_integrated,
  resolution = 0.8 # 这里省略探索分辨率的步骤，为了和此前的章节一致，将分辨率设为0.8
)

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26055
Number of edges: 985955

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8805
Number of communities: 21
Elapsed time: 4 seconds

head(seurat_integrated, 5)

                      orig.ident nCount_RNA nFeature_RNA sample
ctrl_AAACATACAATGCC-1          1       2344          874   ctrl
ctrl_AAACATACATTTCC-1          1       3125          896   ctrl
ctrl_AAACATACCAGAAA-1          1       2578          725   ctrl
ctrl_AAACATACCAGCTA-1          1       3261          979   ctrl
ctrl_AAACATACCTCGCT-1          1       3519          866   ctrl
                      log10GenesPerUMI  mitoRatio                 cells
ctrl_AAACATACAATGCC-1        0.8728630 0.01962457 ctrl_AAACATACAATGCC-1
ctrl_AAACATACATTTCC-1        0.8447596 0.01792000 ctrl_AAACATACATTTCC-1
ctrl_AAACATACCAGAAA-1        0.8384933 0.01551590 ctrl_AAACATACCAGAAA-1
ctrl_AAACATACCAGCTA-1        0.8512622 0.01379945 ctrl_AAACATACCAGCTA-1
ctrl_AAACATACCTCGCT-1        0.8283053 0.01392441 ctrl_AAACATACCTCGCT-1
                      Contamination nCount_SCT nFeature_SCT SCT_snn_res.0.8
ctrl_AAACATACAATGCC-1  0.0005892543       1541          856               2
ctrl_AAACATACATTTCC-1  0.0642801956       1489          697               1
ctrl_AAACATACCAGAAA-1  0.0057000246       1519          662               5
ctrl_AAACATACCAGCTA-1  0.1023643666       1497          724               5
ctrl_AAACATACCTCGCT-1  0.0036639072       1411          573               1
                      seurat_clusters
ctrl_AAACATACAATGCC-1               2
ctrl_AAACATACATTTCC-1               1
ctrl_AAACATACCAGAAA-1               5
ctrl_AAACATACCAGCTA-1               5
ctrl_AAACATACCTCGCT-1               1

# Assign identity of clusters
Idents(seurat_integrated) <- "SCT_snn_res.0.8"
table(Idents(seurat_integrated))


   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
3525 3062 3034 2321 2203 1888 1775 1707 1181 1076 1074  701  489  464  460  336 
  16   17   18   19   20 
 330  248  109   46   26

# Plot the UMAP
DimPlot(seurat_integrated,
        reduction = "umap.integrated",
        label = FALSE,
        label.size = 6)

可以看到和来自scRNA-seq_online-细胞聚类（下图）的分群情况相比，去除了高环境RNA污染的细胞后，细胞群边缘的毛刺有所减少，细胞界线更加清晰。

Tip

decontX可以整合进标准的Seurat工作流中的质控步骤中。但是，也可以在先不进行decontX的情况下，进行常规的整合、降维、分群。如果分群效果不理想，即有明显的边缘毛躁和大量散在细胞的情况，可以再提取这时候的counts，执行decontX。这时候可以通过FeaturePlot()去可视化高游离RNA污染的细胞的分布情况：

FeaturePlot(
  seurat_integrated,
  features = "Contamination",
  raster = FALSE # 细胞过多时候需要加这个参数
) +
  scale_color_viridis_c() +
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  ) +
  xlab("scVI_UMAP_1") +
  ylab("scVI_UMAP_2")

通过`FeaturePlot()`可视化游离RNA分布情况的示例。可以看到，通过可视化Contamination，这些边缘的毛躁就是Contamination较高地细胞。

然后选择合适的阈值过滤细胞。之后用过滤后的counts和meta.data重新构建Seurat对象：

low_con_seurat <- subset(filtered_seurat, Contamination < 0.2)

new_seurat <- CreateSeuratObject(
  counts = low_con_seurat[["RNA"]]$counts,
  meta.data = low_con_seurat@meta.data)

然后重新执行整合、降维、分群等步骤。最后和此前的细胞分群情况进行比较。

```{r}
#| eval: false
#| fig-width: 10
#| fig-cap: 过滤环境RNA污染前（A）后（B）的UMAP图
p2 <- DimPlot(
  low_con_seurat, 
  reduction = "umap",
  label = T
  )
p2

plot_grid(
  p1, p2, 
  labels = "AUTO")
```

Session Info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] beepr_1.3          cowplot_1.1.2      ggplot2_3.4.4      decontX_1.0.0     
[5] Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-2          

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.21            splines_4.3.2              
  [3] later_1.3.2                 bitops_1.0-7               
  [5] tibble_3.2.1                polyclip_1.10-6            
  [7] fastDummies_1.7.3           lifecycle_1.0.4            
  [9] StanHeaders_2.32.5          globals_0.16.2             
 [11] lattice_0.22-5              MASS_7.3-60.0.1            
 [13] magrittr_2.0.3              plotly_4.10.4              
 [15] rmarkdown_2.25              yaml_2.3.8                 
 [17] httpuv_1.6.13               glmGamPoi_1.14.0           
 [19] sctransform_0.4.1           spam_2.10-0                
 [21] pkgbuild_1.4.3              spatstat.sparse_3.0-3      
 [23] reticulate_1.34.0           pbapply_1.7-2              
 [25] RColorBrewer_1.1-3          abind_1.4-5                
 [27] zlibbioc_1.48.0             audio_0.1-11               
 [29] Rtsne_0.17                  GenomicRanges_1.54.1       
 [31] purrr_1.0.2                 BiocGenerics_0.48.1        
 [33] RCurl_1.98-1.14             GenomeInfoDbData_1.2.11    
 [35] IRanges_2.36.0              S4Vectors_0.40.2           
 [37] ggrepel_0.9.5               inline_0.3.19              
 [39] irlba_2.3.5.1               listenv_0.9.0              
 [41] spatstat.utils_3.0-4        goftest_1.2-3              
 [43] RSpectra_0.16-1             spatstat.random_3.2-2      
 [45] fitdistrplus_1.1-11         parallelly_1.36.0          
 [47] DelayedMatrixStats_1.24.0   leiden_0.4.3.1             
 [49] codetools_0.2-19            DelayedArray_0.28.0        
 [51] scuttle_1.12.0              tidyselect_1.2.0           
 [53] farver_2.1.1                viridis_0.6.4              
 [55] ScaledMatrix_1.10.0         matrixStats_1.2.0          
 [57] stats4_4.3.2                spatstat.explore_3.2-5     
 [59] jsonlite_1.8.8              BiocNeighbors_1.20.2       
 [61] ellipsis_0.3.2              progressr_0.14.0           
 [63] ggridges_0.5.5              survival_3.5-7             
 [65] scater_1.30.1               dbscan_1.1-12              
 [67] tools_4.3.2                 ica_1.0-3                  
 [69] Rcpp_1.0.12                 glue_1.7.0                 
 [71] gridExtra_2.3               SparseArray_1.2.3          
 [73] xfun_0.41                   MatrixGenerics_1.14.0      
 [75] GenomeInfoDb_1.38.5         dplyr_1.1.4                
 [77] loo_2.6.0                   withr_3.0.0                
 [79] combinat_0.0-8              fastmap_1.1.1              
 [81] MCMCprecision_0.4.0         fansi_1.0.6                
 [83] rsvd_1.0.5                  digest_0.6.34              
 [85] R6_2.5.1                    mime_0.12                  
 [87] colorspace_2.1-0            scattermore_1.2            
 [89] tensor_1.5                  spatstat.data_3.0-4        
 [91] utf8_1.2.4                  tidyr_1.3.0                
 [93] generics_0.1.3              data.table_1.14.10         
 [95] httr_1.4.7                  htmlwidgets_1.6.4          
 [97] S4Arrays_1.2.0              uwot_0.1.16                
 [99] pkgconfig_2.0.3             gtable_0.3.4               
[101] lmtest_0.9-40               SingleCellExperiment_1.24.0
[103] XVector_0.42.0              htmltools_0.5.7            
[105] dotCall64_1.1-1             scales_1.3.0               
[107] Biobase_2.62.0              png_0.1-8                  
[109] knitr_1.45                  rstudioapi_0.15.0          
[111] reshape2_1.4.4              nlme_3.1-164               
[113] curl_5.2.0                  zoo_1.8-12                 
[115] stringr_1.5.1               KernSmooth_2.23-22         
[117] vipor_0.4.7                 parallel_4.3.2             
[119] miniUI_0.1.1.1              pillar_1.9.0               
[121] grid_4.3.2                  vctrs_0.6.5                
[123] RANN_2.6.1                  promises_1.2.1             
[125] BiocSingular_1.18.0         beachmat_2.18.0            
[127] xtable_1.8-4                cluster_2.1.6              
[129] beeswarm_0.4.0              evaluate_0.23              
[131] cli_3.6.2                   compiler_4.3.2             
[133] rlang_1.1.3                 crayon_1.5.2               
[135] rstantools_2.3.1.1          future.apply_1.11.1        
[137] labeling_0.4.3              ggbeeswarm_0.7.2           
[139] plyr_1.8.9                  stringi_1.8.3              
[141] rstan_2.32.5                viridisLite_0.4.2          
[143] deldir_2.0-2                QuickJSR_1.0.9             
[145] BiocParallel_1.36.0         munsell_0.5.0              
[147] lazyeval_0.2.2              spatstat.geom_3.2-7        
[149] V8_4.4.1                    Matrix_1.6-5               
[151] RcppHNSW_0.5.0              patchwork_1.2.0            
[153] sparseMatrixStats_1.14.0    future_1.33.1              
[155] shiny_1.8.0                 SummarizedExperiment_1.32.0
[157] ROCR_1.0-11                 igraph_1.6.0               
[159] RcppParallel_5.1.7

References

Yang, Shiyi, Sean E. Corbett, Yusuke Koga, Zhe Wang, W Evan Johnson, Masanao Yajima, and Joshua D. Campbell. 2020. “Decontamination of Ambient RNA in Single-Cell RNA-Seq with DecontX.” Genome Biology 21 (1). https://doi.org/10.1186/s13059-020-1950-6.