消除细胞周期的影响

原文:Cell-Cycle Scoring and Regression

原文发布日期:2023年10月31日

We demonstrate how to mitigate the effects of cell cycle heterogeneity in scRNA-seq data by calculating cell cycle phase scores based on canonical markers, and regressing these out of the data during pre-processing. We demonstrate this on a dataset of murine hematopoietic progenitors (Nestorowa et al. 2016).You can download the files needed to run this vignette here.

1 数据读取和预处理

1.1 创建Seurat对象

# Read in the expression matrix 
exp.mat <- read.table(file = "data/seurat_official/cell_cycle_vignette_files/nestorawa_forcellcycle_expressionMatrix.txt",
                      header = TRUE, 
                      as.is = TRUE, #保留字符型变量
                      row.names = 1)
# Create our Seurat object and complete the initalization steps
library(Seurat)
marrow <- CreateSeuratObject(counts = Matrix::Matrix(as.matrix(exp.mat), 
                                                     sparse = T))
marrow
An object of class Seurat 
24193 features across 774 samples within 1 assay 
Active assay: RNA (24193 features, 0 variable features)
 1 layer present: counts
head(marrow@meta.data, 5)
         orig.ident nCount_RNA nFeature_RNA
Prog_013       Prog    2563089        10211
Prog_019       Prog    3030620         9991
Prog_031       Prog    1293487        10192
Prog_037       Prog    1357987         9599
Prog_008       Prog    4079891        10540
marrow <- NormalizeData(marrow)
marrow <- FindVariableFeatures(marrow, selection.method = "vst")
marrow <- ScaleData(marrow, features = rownames(marrow))

1.2 获取细胞周期marker基因列表

A list of cell cycle markers, from Tirosh et al, 2015, is loaded with Seurat. We can segregate this list into markers of G2/M phase and markers of S phase

s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

1.3 降维

If we run a PCA on our object, using the variable genes we found in FindVariableFeatures() above, we see that while most of the variance can be explained by lineage, PC8 and PC10 are split on cell-cycle genes including TOP2A and MKI67. We will attempt to regress this signal from the data, so that cell-cycle heterogeneity does not contribute to PCA or downstream analysis.

marrow <- RunPCA(marrow, 
                 features = VariableFeatures(marrow), 
                 ndims.print = 6:10, 
                 nfeatures.print = 10)
DimHeatmap(marrow, dims = c(8, 10))

2 消除细胞周期的影响

2.1 计算细胞周期评分

First, we assign each cell a score, based on its expression of G2/M and S phase markers. These marker sets should be anticorrelated in their expression levels, and cells expressing neither are likely not cycling and in G1 phase.

细胞周期(cell cycle),是指能持续分裂的真核细胞从一次有丝分裂结束后,再到下一次分裂结束的循环过程(准确来说只要有DNA复制,不管是不是有丝分裂它都有细胞周期。生殖细胞无细胞周期。)。

细胞周期的划分

总的看来,细胞周期通常可划分为分裂间期(I期)和分裂期(M期),分裂间期是物质准备和积累阶段,分裂期则是细胞增殖的实施过程。整个周期表示为 I期→M期。

其中分裂间期(I期)又常常可以划分为DNA合成前期(G1)DNA合成期(S)DNA合成后期(G2)。在此期间的任务主要是完成染色质中的DNA复制和相关蛋白质的合成。将I期细分之后,整个细胞周期可以表示为:G1期→S期→G2期→M期。

细胞进入G1期可能出现三种情况,其中暂不继续增殖,如骨髓干细胞和处于不利状态下的癌细胞,但在某些刺激下,这些细胞又可以继续生长分裂,因此有人把这种非增殖状态的G1期细胞称为G0期细胞。以区别处于增殖状态的G1期细胞。 而分裂期通常分作分裂前期(Prophase)、前中期(Prometaphase)、中期(Metaphase)、后期(Anaphase)和末期(Telophase)5个阶段,在此期间进行细胞物质的平均分配并形成两个新的细胞。

  • G0: Quiescence or resting phase. The cell is not actively dividing, which is common for cells that are fully differentiated. Some types of cells enter G0 for long periods of time (many neuronal cells), while other cell types never enter G0 by continuously dividing (epithelial cells).
  • G1: Gap 1 phase represents the beginning of interphase. During G1 there is growth of the non-chromosomal components of the cells. From this phase, the cell may enter G0 or S phase.
  • S: Synthesis phase for the replication of the chromosomes (also part of interphase).
  • G2: Gap 2 phase represents the end of interphase, prior to entering the mitotic phase. During this phase th cell grows in preparation for mitosis and the spindle forms.
  • M: M phase is the nuclear division of the cell (consisting of prophase, metaphase, anaphase and telophase).
状态 阶段 缩写 描述
休息 G0 G0 细胞离开周期并停止分裂的阶段。
间期 G1 G1 G1检查点控制机制确保一切准备好进行DNA合成。
合成 S S DNA复制发生在这个阶段。
G2 G2 G2 在DNA合成和有丝分裂之间的差距期间,细胞将继续增长。 G2检查点控制机制确保一切准备好进入M(有丝分裂)阶段并分裂。
细胞分裂 有丝分裂 M 细胞生长停止,细胞能量集中在有序地分裂成两个子细胞。有丝分裂中期的检查点(Metaphase Checkpoint)确保细胞可以完成细胞分裂。

We assign scores in the CellCycleScoring() function, which stores S and G2/M scores in object meta data, along with the predicted classification of each cell in either G2M, S or G1 phaseCellCycleScoring() can also set the identity of the Seurat object to the cell-cycle phase by passing set.ident = TRUE (the original identities are stored as old.ident). Please note that Seurat does not use the discrete classifications (G2M/G1/S) in downstream cell cycle regression. Instead, it uses the quantitative scores for G2M and S phase. However, we provide our predicted classifications in case they are of interest.

We score single cells based on the scoring strategy described in (Tirosh et al. 2016). See ?AddModuleScore() in Seurat for more information, this function can be used to calculate supervised module scores for any gene list.

marrow <- CellCycleScoring(marrow, 
                           s.features = s.genes, 
                           g2m.features = g2m.genes, 
                           set.ident = TRUE)
table(Idents(marrow))

 G1 G2M   S 
287 168 319 
# view cell cycle scores and phase assignments
head(marrow@meta.data)
         orig.ident nCount_RNA nFeature_RNA     S.Score  G2M.Score Phase
Prog_013       Prog    2563089        10211 -0.14248691 -0.4680395    G1
Prog_019       Prog    3030620         9991 -0.16915786  0.5851766   G2M
Prog_031       Prog    1293487        10192 -0.34627038 -0.3971879    G1
Prog_037       Prog    1357987         9599 -0.44270212  0.6820229   G2M
Prog_008       Prog    4079891        10540  0.55854051  0.1284359     S
Prog_014       Prog    2569783        10788  0.07116218  0.3166073   G2M
         old.ident
Prog_013      Prog
Prog_019      Prog
Prog_031      Prog
Prog_037      Prog
Prog_008      Prog
Prog_014      Prog

2.2 可视化细胞周期的影响

可视化细胞周期marker的表达分布:

RidgePlot(marrow, 
          features = c("PCNA", "TOP2A", "MCM6", "MKI67"), 
          ncol = 2)

以细胞周期marker为计算依据运行PCA:

marrow <- RunPCA(marrow, features = c(s.genes, g2m.genes))
DimPlot(marrow)

The PCA running on cell cycle genes reveals, unsurprisingly, that cells separate entirely by phase.

这个PCA图也可以用ggplot2的语法进一步修改:

library(ggplot2)
DimPlot(marrow) + 
  theme(axis.title = element_text(size = 18), 
        legend.text = element_text(size = 18)) +
  guides(colour = guide_legend(override.aes = list(size = 10)))

2.3 回归(regress out)细胞周期评分

We now attempt to subtract (‘regress out’) this source of heterogeneity from the data. For users of Seurat v1.4, this was implemented in RegressOut. However, as the results of this procedure are stored in the scaled data slot (therefore overwriting the output of ScaleData()), we now merge this functionality into the ScaleData() function itself.

For each gene, Seurat models the relationship between gene expression and the S and G2M cell cycle scores. The scaled residuals of this model represent a ‘corrected’ expression matrix, that can be used downstream for dimensional reduction.

marrow <- ScaleData(marrow, 
                    vars.to.regress = c("S.Score", "G2M.Score"), 
                    features = rownames(marrow))

Now, a PCA on the variable genes no longer returns components associated with cell cycle:

marrow <- RunPCA(marrow, 
                 features = VariableFeatures(marrow), 
                 nfeatures.print = 10)

When running a PCA on only cell cycle genes, cells no longer separate by cell-cycle phase:

marrow <- RunPCA(marrow, 
                 features = c(s.genes, g2m.genes))
DimPlot(marrow)

As the best cell cycle markers are extremely well conserved across tissues and species, we have found this procedure to work robustly and reliably on diverse datasets.

3 消除细胞周期的影响同时保留增殖细胞与静止细胞的区分

The procedure above removes all signal associated with cell cycle. In some cases, we’ve found that this can negatively impact downstream analysis, particularly in differentiating processes (like murine hematopoiesis), where stem cells are quiescent and differentiated cells are proliferating (or vice versa). In this case, regressing out all cell cycle effects can blur the distinction between stem and progenitor cells as well.

As an alternative, we suggest regressing out the difference between the G2/M and S phase scores. This means that signals separating non-cycling cells and cycling cells will be maintained, but differences in cell cycle phase among proliferating cells (which are often uninteresting), will be regressed out of the data

marrow$CC.Difference <- marrow$S.Score - marrow$G2M.Score
marrow <- ScaleData(marrow, 
                    vars.to.regress = "CC.Difference", 
                    features = rownames(marrow))

Cell cycle effects strongly mitigated in PCA:

marrow <- RunPCA(marrow, 
                 features = VariableFeatures(marrow), 
                 nfeatures.print = 10)

When running a PCA on cell cycle genes, actively proliferating cells remain distinct from G1 cells however, within actively proliferating cells, G2M and S phase cells group together:

marrow <- RunPCA(marrow, features = c(s.genes, g2m.genes))
DimPlot(marrow)


R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.4      Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-2          

loaded via a namespace (and not attached):
  [1] deldir_2.0-2           pbapply_1.7-2          gridExtra_2.3         
  [4] rlang_1.1.3            magrittr_2.0.3         RcppAnnoy_0.0.21      
  [7] spatstat.geom_3.2-7    matrixStats_1.2.0      ggridges_0.5.5        
 [10] compiler_4.3.2         png_0.1-8              vctrs_0.6.5           
 [13] reshape2_1.4.4         stringr_1.5.1          pkgconfig_2.0.3       
 [16] fastmap_1.1.1          ellipsis_0.3.2         labeling_0.4.3        
 [19] utf8_1.2.4             promises_1.2.1         rmarkdown_2.25        
 [22] ggbeeswarm_0.7.2       purrr_1.0.2            xfun_0.41             
 [25] jsonlite_1.8.8         goftest_1.2-3          later_1.3.2           
 [28] spatstat.utils_3.0-4   irlba_2.3.5.1          parallel_4.3.2        
 [31] cluster_2.1.6          R6_2.5.1               ica_1.0-3             
 [34] stringi_1.8.3          RColorBrewer_1.1-3     spatstat.data_3.0-4   
 [37] reticulate_1.34.0      parallelly_1.36.0      lmtest_0.9-40         
 [40] scattermore_1.2        Rcpp_1.0.12            knitr_1.45            
 [43] tensor_1.5             future.apply_1.11.1    zoo_1.8-12            
 [46] sctransform_0.4.1      httpuv_1.6.13          Matrix_1.6-5          
 [49] splines_4.3.2          igraph_1.6.0           tidyselect_1.2.0      
 [52] abind_1.4-5            rstudioapi_0.15.0      yaml_2.3.8            
 [55] spatstat.random_3.2-2  codetools_0.2-19       miniUI_0.1.1.1        
 [58] spatstat.explore_3.2-5 listenv_0.9.0          lattice_0.22-5        
 [61] tibble_3.2.1           plyr_1.8.9             withr_3.0.0           
 [64] shiny_1.8.0            ROCR_1.0-11            ggrastr_1.0.2         
 [67] evaluate_0.23          Rtsne_0.17             future_1.33.1         
 [70] fastDummies_1.7.3      survival_3.5-7         polyclip_1.10-6       
 [73] fitdistrplus_1.1-11    pillar_1.9.0           KernSmooth_2.23-22    
 [76] plotly_4.10.4          generics_0.1.3         RcppHNSW_0.5.0        
 [79] munsell_0.5.0          scales_1.3.0           globals_0.16.2        
 [82] xtable_1.8-4           glue_1.7.0             lazyeval_0.2.2        
 [85] tools_4.3.2            data.table_1.14.10     RSpectra_0.16-1       
 [88] RANN_2.6.1             leiden_0.4.3.1         dotCall64_1.1-1       
 [91] cowplot_1.1.2          grid_4.3.2             tidyr_1.3.0           
 [94] colorspace_2.1-0       nlme_3.1-164           patchwork_1.2.0       
 [97] beeswarm_0.4.0         vipor_0.4.7            cli_3.6.2             
[100] spatstat.sparse_3.0-3  spam_2.10-0            fansi_1.0.6           
[103] viridisLite_0.4.2      dplyr_1.1.4            uwot_0.1.16           
[106] gtable_0.3.4           digest_0.6.34          progressr_0.14.0      
[109] ggrepel_0.9.5          farver_2.1.1           htmlwidgets_1.6.4     
[112] htmltools_0.5.7        lifecycle_1.0.4        httr_1.4.7            
[115] mime_0.12              MASS_7.3-60.0.1       

References

Nestorowa, Sonia, Fiona K. Hamey, Blanca Pijuan Sala, Evangelia Diamanti, Mairi Shepherd, Elisa Laurenti, Nicola K. Wilson, David G. Kent, and Berthold Göttgens. 2016. “A Single-Cell Resolution Map of Mouse Hematopoietic Stem and Progenitor Cell Differentiation.” Blood 128 (8): e20–31. https://doi.org/10.1182/blood-2016-05-716480.
Tirosh, Itay, Benjamin Izar, Sanjay M. Prakadan, Marc H. Wadsworth, Daniel Treacy, John J. Trombetta, Asaf Rotem, et al. 2016. “Dissecting the Multicellular Ecosystem of Metastatic Melanoma by Single-Cell RNA-Seq.” Science 352 (6282): 189–96. https://doi.org/10.1126/science.aad0501.