基于SCTransform的单细胞数据标准化

原文：Using sctransform in Seurat

原文发布日期：2023年10月31日

Biological heterogeneity in single-cell RNA-seq data is often confounded by technical factors including sequencing depth. The number of molecules detected in each cell can vary significantly between cells, even within the same celltype. Interpretation of scRNA-seq data requires effective pre-processing and normalization to remove this technical variability.

In our manuscript we introduce a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. This procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. We named this methodsctransform.

Inspired by important and rigorous work from Lause et al (Lause, Berens, and Kobak 2021), we released an updated manuscript (Choudhary and Satija 2022) and updated the sctransform software to a v2 version, which is now the default in Seurat v5.

1 载入数据

library(Seurat)
pbmc_data <- Read10X(data.dir = "data/seurat_official/filtered_gene_bc_matrices/hg19")
pbmc <- CreateSeuratObject(counts = pbmc_data)
pbmc

An object of class Seurat 
32738 features across 2700 samples within 1 assay 
Active assay: RNA (32738 features, 0 variable features)
 1 layer present: counts

2 质控

这里的质控步骤只简单的计算了线粒体基因的比例。

pbmc <- PercentageFeatureSet(pbmc, pattern = "^MT-", col.name = "percent.mt")
head(pbmc@meta.data, 5)

                    orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1 SeuratProject       2421          781  3.0152829
AAACATTGAGCTAC-1 SeuratProject       4903         1352  3.7935958
AAACATTGATCAGC-1 SeuratProject       3149         1131  0.8891712
AAACCGTGCTTCCG-1 SeuratProject       2639          960  1.7430845
AAACCGTGTATGCG-1 SeuratProject        981          522  1.2232416

3 运行`SCTransform`

SCTransform()替代了传统单细胞数据分析流程中的NormalizeData()、ScaleData()和FindVariableFeatures()函数的功能，因此不再需要运行这些函数。
During normalization, we can also remove confounding sources of variation, for example, mitochondrial mapping percentage。SCTransform也可以移除一些非期望变异来源，如线粒体基因的比例。这在传统的单细胞数据分析流程中由ScaleData来完成（见Seurat细胞分群官方教程）。
In Seurat v5, SCT v2 is applied by default. You can revert to v1 by setting vst.flavor = 'v1'
SCTransform的运算调用了glmGamPoi包以显著提升运算速度。所以事先需要通过BiocManager安装该包。

# BiocManager::install("glmGamPoi")
pbmc <- SCTransform(pbmc, 
                    vars.to.regress = "percent.mt", 
                    verbose = FALSE)

Transformed data will be available in the SCT assay, which is set as the default after running SCTransform：

pbmc[["SCT"]]$scale.data contains the residuals (normalized values), and is used directly as input to PCA. Please note that this matrix is non-sparse, and can therefore take up a lot of memory if stored for all genes. To save memory, we store these values only for variable genes, by setting the return.only.var.genes = TRUE by default in the SCTransform() function call.
To assist with visualization and interpretation, we also convert Pearson residuals back to ‘corrected’ UMI counts. You can interpret these as the UMI counts we would expect to observe if all cells were sequenced to the same depth. If you want to see exactly how we do this, please look at the correct function here.
The ‘corrected’ UMI counts are stored in pbmc[["SCT"]]$counts. We store log-normalized versions of these corrected counts in pbmc[["SCT"]]$data, which are very helpful for visualization.

4 降维

pbmc <- RunPCA(pbmc, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:30, verbose = FALSE)

降维可视化：

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot(pbmc, reduction = "umap")

DimHeatmap(pbmc, dims = 1:15, cells = 1000, balanced = TRUE)

ElbowPlot(pbmc)

5 聚类

pbmc <- FindNeighbors(pbmc, dims = 1:30, verbose = FALSE)
pbmc <- FindClusters(pbmc, verbose = FALSE)
DimPlot(pbmc, label = TRUE)

根据Seurat细胞分群官方教程，这个数据集”we can observe an ‘elbow’ around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs”。因此在FindNeighbors函数中指定了dims = 1:10。但是这里的FindNeighbors函数指定了更多的主成分（dims = 1:30）。下面的内容对此作出了解释：

Why can we choose more PCs when using sctransform?

In the Seurat细胞分群官方教程，we focus on 10 PCs for this dataset, though we highlight that the results are similar with higher settings for this parameter. Interestingly, we’ve found that when using SCTransform, we often benefit by pushing this parameter even higher. We believe this is because the SCTransform workflow performs more effective normalization, strongly removing technical effects from the data.

Even after standard log-normalization, variation in sequencing depth is still a confounding factor (see Figure 1), and this effect can subtly influence higher PCs. In SCTransform, this effect is substantially mitigated (see Figure 3). This means that higher PCs are more likely to represent subtle, but biologically relevant, sources of heterogeneity – so including them may improve downstream analysis.

In addition, SCTransform returns 3,000 variable features by default, instead of 2,000. The rationale is similar, the additional variable features are less likely to be driven by technical differences across cells, and instead may represent more subtle biological fluctuations. In general, we find that results produced with SCTransform are less dependent on these parameters (indeed, we achieve nearly identical results when using all genes in the transcriptome, though this does reduce computational efficiency). This can help users generate more robust results, and in addition, enables the application of standard analysis pipelines with identical parameter settings that can quickly be applied to new datasets.

6 Marker基因可视化

Users can individually annotate clusters based on canonical markers. However, the SCTransform normalization reveals sharper biological distinctions compared to the standard Seurat workflow, in a few ways:

Clear separation of at least 3 CD8 T cell populations (naive, memory, effector), based on CD8A, GZMK, CCL5, CCR7 expression
Clear separation of three CD4 T cell populations (naive, memory, IFN-activated) based on S100A4, CCR7, IL32, and ISG15
Additional developmental sub-structure in B cell cluster, based on TCL1A, FCER2
Additional separation of NK cells into CD56dim vs. bright clusters, based on XCL1 and FCGR3A

6.1 小提琴图：

VlnPlot(pbmc, 
        features = c("CD8A", "GZMK", "CCL5", "S100A4", "ANXA1", "CCR7", "ISG15", "CD3D"),
    pt.size = 0.2, 
    ncol = 4)

6.2 UMAP图：

FeaturePlot(pbmc, 
            features = c("CD8A", "GZMK", "CCL5", "S100A4", "ANXA1", "CCR7"), 
            pt.size = 0.2,
            ncol = 3)

FeaturePlot(pbmc, 
            features = c("CD3D", "ISG15", "TCL1A", "FCER2", "XCL1", "FCGR3A"), 
            pt.size = 0.2,
            ncol = 3)

Session Info

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-2          

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3          rstudioapi_0.15.0          
  [3] jsonlite_1.8.8              magrittr_2.0.3             
  [5] ggbeeswarm_0.7.2            spatstat.utils_3.0-4       
  [7] farver_2.1.1                rmarkdown_2.25             
  [9] zlibbioc_1.48.0             vctrs_0.6.5                
 [11] ROCR_1.0-11                 DelayedMatrixStats_1.24.0  
 [13] spatstat.explore_3.2-5      RCurl_1.98-1.14            
 [15] S4Arrays_1.2.0              htmltools_0.5.7            
 [17] SparseArray_1.2.3           sctransform_0.4.1          
 [19] parallelly_1.36.0           KernSmooth_2.23-22         
 [21] htmlwidgets_1.6.4           ica_1.0-3                  
 [23] plyr_1.8.9                  plotly_4.10.4              
 [25] zoo_1.8-12                  igraph_1.6.0               
 [27] mime_0.12                   lifecycle_1.0.4            
 [29] pkgconfig_2.0.3             Matrix_1.6-5               
 [31] R6_2.5.1                    fastmap_1.1.1              
 [33] GenomeInfoDbData_1.2.11     MatrixGenerics_1.14.0      
 [35] fitdistrplus_1.1-11         future_1.33.1              
 [37] shiny_1.8.0                 digest_0.6.34              
 [39] colorspace_2.1-0            patchwork_1.2.0            
 [41] S4Vectors_0.40.2            tensor_1.5                 
 [43] RSpectra_0.16-1             irlba_2.3.5.1              
 [45] GenomicRanges_1.54.1        labeling_0.4.3             
 [47] progressr_0.14.0            fansi_1.0.6                
 [49] spatstat.sparse_3.0-3       httr_1.4.7                 
 [51] polyclip_1.10-6             abind_1.4-5                
 [53] compiler_4.3.2              withr_3.0.0                
 [55] fastDummies_1.7.3           R.utils_2.12.3             
 [57] MASS_7.3-60.0.1             DelayedArray_0.28.0        
 [59] tools_4.3.2                 vipor_0.4.7                
 [61] lmtest_0.9-40               beeswarm_0.4.0             
 [63] httpuv_1.6.13               future.apply_1.11.1        
 [65] goftest_1.2-3               R.oo_1.25.0                
 [67] glmGamPoi_1.14.0            glue_1.7.0                 
 [69] nlme_3.1-164                promises_1.2.1             
 [71] grid_4.3.2                  Rtsne_0.17                 
 [73] cluster_2.1.6               reshape2_1.4.4             
 [75] generics_0.1.3              gtable_0.3.4               
 [77] spatstat.data_3.0-4         R.methodsS3_1.8.2          
 [79] tidyr_1.3.0                 data.table_1.14.10         
 [81] XVector_0.42.0              utf8_1.2.4                 
 [83] BiocGenerics_0.48.1         spatstat.geom_3.2-7        
 [85] RcppAnnoy_0.0.21            ggrepel_0.9.5              
 [87] RANN_2.6.1                  pillar_1.9.0               
 [89] stringr_1.5.1               spam_2.10-0                
 [91] RcppHNSW_0.5.0              later_1.3.2                
 [93] splines_4.3.2               dplyr_1.1.4                
 [95] lattice_0.22-5              survival_3.5-7             
 [97] deldir_2.0-2                tidyselect_1.2.0           
 [99] miniUI_0.1.1.1              pbapply_1.7-2              
[101] knitr_1.45                  gridExtra_2.3              
[103] IRanges_2.36.0              SummarizedExperiment_1.32.0
[105] scattermore_1.2             stats4_4.3.2               
[107] xfun_0.41                   Biobase_2.62.0             
[109] matrixStats_1.2.0           stringi_1.8.3              
[111] lazyeval_0.2.2              yaml_2.3.8                 
[113] evaluate_0.23               codetools_0.2-19           
[115] tibble_3.2.1                cli_3.6.2                  
[117] uwot_0.1.16                 xtable_1.8-4               
[119] reticulate_1.34.0           munsell_0.5.0              
[121] Rcpp_1.0.12                 GenomeInfoDb_1.38.5        
[123] globals_0.16.2              spatstat.random_3.2-2      
[125] png_0.1-8                   ggrastr_1.0.2              
[127] parallel_4.3.2              ellipsis_0.3.2             
[129] ggplot2_3.4.4               dotCall64_1.1-1            
[131] sparseMatrixStats_1.14.0    bitops_1.0-7               
[133] listenv_0.9.0               viridisLite_0.4.2          
[135] scales_1.3.0                ggridges_0.5.5             
[137] crayon_1.5.2                leiden_0.4.3.1             
[139] purrr_1.0.2                 rlang_1.1.3                
[141] cowplot_1.1.2

References

Choudhary, Saket, and Rahul Satija. 2022. “Comparison and Evaluation of Statistical Error Models for scRNA-Seq.” Genome Biology 23 (1). https://doi.org/10.1186/s13059-021-02584-9.

Lause, Jan, Philipp Berens, and Dmitry Kobak. 2021. “Analytic Pearson Residuals for Normalization of Single-Cell RNA-Seq UMI Data.” Genome Biology 22 (1). https://doi.org/10.1186/s13059-021-02451-7.