整合(integration)

原文:Introduction to scRNA-seq integration

原文发布日期:2023年10月31日

Tip

本篇主要介绍来自不同样本类型的单细胞数据的整合。对于如何整合不同测序技术的单细胞数据集,参考Seurat官方文档:Integrative analysis in Seurat v5

Integration of single-cell sequencing datasets, for example across experimental batches, donors, or conditions, is often an important step in scRNA-seq workflows. Integrative analysis can help to match shared cell types and states across datasets, which can boost statistical power, and most importantly, facilitate accurate comparative analysis across datasets. In previous versions of Seurat we introduced methods for integrative analysis, including our ‘anchor-based’ integration workflow. Many labs have also published powerful and pioneering methods, including Harmony and scVI, for integrative analysis.

数据整合的目标:

The following tutorial is designed to give you an overview of the kinds of comparative analyses on complex cell types that are possible using the Seurat integration procedure. Here, we address a few key goals:

1 数据读取和分层

```{r}
#| eval: false
library(Seurat)
library(SeuratData)
InstallData("ifnb")
ifnb <- LoadData("ifnb")
```

从本地下载好的数据读取:

library(Seurat)
ifnb <- readRDS("data/seurat_official/pbmc_ifnb.rds")
ifnb
An object of class Seurat 
14053 features across 13999 samples within 1 assay 
Active assay: RNA (14053 features, 0 variable features)
 2 layers present: counts, data
head(ifnb@meta.data, 5)
                  orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
table(ifnb$stim)

CTRL STIM 
6548 7451 

The object contains data from human PBMC from two conditions, interferon-stimulated and control cells (stored in the stim column in the object metadata). We will aim to integrate the two conditions together, so that we can jointly identify cell subpopulations across datasets, and then explore how each group differs across conditions

In previous versions of Seurat, we would require the data to be represented as two different Seurat objects. In Seurat v5, we keep all the data in one object, but simply split it into multiple ‘layers’. To learn more about layers, check out our Seurat object interaction vignette.

Important

Seurat v5 assays store data in layers. These layers can store:

  • raw, un-normalized counts (layer='counts')

  • normalized data (layer='data')

  • z-scored/variance-stabilized data (layer='scale.data').

split the RNA measurements into two layers one for control cells, one for stimulated cells:

library(Seurat)
ifnb[["RNA"]] <- split(ifnb[["RNA"]], 
                       f = ifnb$stim) # 按照meta.data中的“stim”列进行分割
ifnb
An object of class Seurat 
14053 features across 13999 samples within 1 assay 
Active assay: RNA (14053 features, 0 variable features)
 4 layers present: counts.CTRL, counts.STIM, data.CTRL, data.STIM

现在可以发现ifnb被分为了4个layer,此前是2个layer(countsdata):

2 不进行整合的情况下的数据处理

进行标准的数据处理流程:

ifnb <- NormalizeData(ifnb)
ifnb <- FindVariableFeatures(ifnb)
ifnb <- ScaleData(ifnb)
ifnb <- RunPCA(ifnb)
ifnb <- FindNeighbors(ifnb, dims = 1:30, reduction = "pca")
ifnb <- FindClusters(ifnb, 
                     resolution = 2, 
                     cluster.name = "unintegrated_clusters")
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13999
Number of edges: 555146

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8153
Number of communities: 26
Elapsed time: 1 seconds
ifnb <- RunUMAP(ifnb, 
                dims = 1:30, 
                reduction = "pca", 
                reduction.name = "umap.unintegrated")

分别按照样本分组(“stim”细胞聚类情况(“seurat_clusters”着色绘制UMAP图:

DimPlot(ifnb, 
        reduction = "umap.unintegrated", 
        group.by = c("stim", "seurat_clusters"))
Figure 1: 未整合时的细胞分群情况(左:按照刺激条件着色;右:按照细胞聚类情况着色)

可以发现:The resulting clusters are defined both by cell type and stimulation condition, which creates challenges for downstream analysis.

3 进行数据整合

We now aim to integrate data from the two conditions, so that cells from the same cell type/subpopulation will cluster together.

We often refer to this procedure as intergration/alignment. When aligning two genome sequences together, identification of shared/homologous regions can help to interpret differences between the sequences as well. Similarly for scRNA-seq integration, our goal is not to remove biological differences across conditions, but to learn shared cell types/states in an initial step-specifically because that will enable us to compare control stimulated and control profiles for these individual cell types.

The Seurat v5 integration procedure aims to return a single dimensional reduction that captures the shared sources of variance across multiple layers, so that cells in a similar biological state will cluster. The method returns a dimensional reduction (i.e. integrated.cca) which can be used for visualization and unsupervised clustering analysis. For evaluating performance, we can use cell type labels that are pre-loaded in the seurat_annotations metadata column.

# 整合,比较耗时间,进度条会一直显示0%直至运算完成
ifnb_integrated <- IntegrateLayers(ifnb, 
                                   method = CCAIntegration, 
                                   orig.reduction = "pca", 
                                   new.reduction = "integrated.cca", # 整合后新的降维数据的名称
                                   verbose = FALSE)

可以看到经过整合的Seurat对象的降维(“reduction”)中多出了整合后的降维(“integrated.cca”):

目前的RNA assay的counts和data数据仍然按照“CTRL”和“STIM”被分成了4个layer。因此,我们通过JoinLayers进一步合并这些layers:

# 目前RNA assay包含的layers
Layers(ifnb_integrated[["RNA"]])
[1] "counts.CTRL" "counts.STIM" "data.CTRL"   "data.STIM"   "scale.data" 
# 合并RNA assay的layers
ifnb_integrated[["RNA"]] <- JoinLayers(ifnb_integrated[["RNA"]])
# 再次检查RNA assay包含的layers
Layers(ifnb_integrated[["RNA"]])
[1] "data"       "counts"     "scale.data"

可以看到,目前的RNA assay只包含三个layers:data”、“counts”、“scale.data”。

Warning

Once integrative analysis is complete, you can rejoin the layers - which collapses the individual datasets together and recreates the original counts and data layers. You will need to do this before performing any differential expression analysis. However, you can always resplit the layers in case you would like to reperform integrative analysis.

3.1 整合后重新聚类、降维

# 重新聚类
ifnb_integrated <- FindNeighbors(ifnb_integrated, 
                                 reduction = "integrated.cca", #更改降维来源为"integrated.cca"
                                 dims = 1:30)
ifnb_integrated <- FindClusters(ifnb_integrated, resolution = 1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13999
Number of edges: 590406

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8448
Number of communities: 18
Elapsed time: 1 seconds
# 重新降维
ifnb_integrated <- RunUMAP(ifnb_integrated, 
                           dims = 1:30, 
                           reduction = "integrated.cca") #更改降维来源为"integrated.cca"

# Visualization:
DimPlot(ifnb_integrated, 
        reduction = "umap", 
        group.by = c("stim", "seurat_annotations"))
Figure 2: 整合后的细胞分群情况(左:按照刺激条件着色;右:按照细胞聚类情况着色)

可以看到和 Figure 1 相比,在整合后,细胞就只按照细胞类型进行聚类了。

也可以按照刺激条件(“stim”)绘制分面图,分别展示刺激组和对照组的细胞分群情况:

DimPlot(ifnb_integrated, reduction = "umap", split.by = "stim")

可以看到,和上面的结论一致,两种条件下的细胞分群基本一致。

4 执行SCTransform标准化流程之后的整合

As an alternative to log-normalization, Seurat also includes support for preprocessing of scRNA-seq using the SCTransform() workflow. The IntegrateLayers function also supports SCTransform-normalized data, by setting the normalization.method parameter, as shown below.

4.1 不进行整合的情况下的数据分析

rm(list = ls())

# 重新载入原始的Seurat对象ifnb
library(Seurat)
ifnb <- readRDS("data/seurat_official/pbmc_ifnb.rds")

# 同样先拆分数据集,然后进行无整合情况下的降维
ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)
ifnb <- SCTransform(ifnb, verbose = FALSE)
ifnb <- RunPCA(ifnb)
ifnb <- RunUMAP(ifnb, dims = 1:30)
DimPlot(ifnb, 
        reduction = "umap", 
        group.by = c("stim", "seurat_annotations"))
Figure 3: 未整合时的细胞分群情况(左:按照刺激条件着色;右:按照细胞聚类情况着色)

可以看到,如果不进行整合,不同样本(STIM vs. STIM)的细胞类型差异很大。

4.2 进行整合

同样通过IntegrateLayers函数进行数据整合,只不过需要将默认的标准化方法由”LogNormalize”指定为”SCT”(normalization.method = "SCT"):

ifnb_integrated <- IntegrateLayers(ifnb, 
                                   method = CCAIntegration, 
                                   normalization.method = "SCT", 
                                   verbose = F)
# 整合后重新合并RNA的layers
Layers(ifnb_integrated[["RNA"]])
[1] "counts.CTRL" "counts.STIM" "data.CTRL"   "data.STIM"  
ifnb_integrated[["RNA"]] <- JoinLayers(ifnb_integrated[["RNA"]])
Layers(ifnb_integrated[["RNA"]])
[1] "data"   "counts"

可以看到经过整合的Seurat对象的降维(“reduction”)信息中多出了整合后的降维(“integrated.dr”)。

4.3 整合后聚类

ifnb_integrated <- FindNeighbors(ifnb_integrated, 
                                 reduction = "integrated.dr", #更改降维来源为"integrated.dr"
                                 dims = 1:30)
ifnb_integrated <- FindClusters(ifnb_integrated, 
                                resolution = 0.6)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13999
Number of edges: 527905

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9058
Number of communities: 19
Elapsed time: 1 seconds
ifnb_integrated <- RunUMAP(ifnb_integrated, 
                           dims = 1:30, 
                           reduction = "integrated.dr")
DimPlot(ifnb_integrated, 
        reduction = "umap", 
        group.by = c("stim", "seurat_annotations"))

可以看到和 Figure 3 相比,整合后在样本间的细胞类型基本均匀分布。

4.4 保存整合后的Seurat对象

saveRDS(ifnb_integrated, file = "output/seurat_official/ifnb_integrated.rds")

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-2          

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3          rstudioapi_0.15.0          
  [3] jsonlite_1.8.8              magrittr_2.0.3             
  [5] spatstat.utils_3.0-4        farver_2.1.1               
  [7] rmarkdown_2.25              zlibbioc_1.48.0            
  [9] vctrs_0.6.5                 ROCR_1.0-11                
 [11] DelayedMatrixStats_1.24.0   spatstat.explore_3.2-5     
 [13] RCurl_1.98-1.14             S4Arrays_1.2.0             
 [15] htmltools_0.5.7             SparseArray_1.2.3          
 [17] sctransform_0.4.1           parallelly_1.36.0          
 [19] KernSmooth_2.23-22          htmlwidgets_1.6.4          
 [21] ica_1.0-3                   plyr_1.8.9                 
 [23] plotly_4.10.4               zoo_1.8-12                 
 [25] igraph_1.6.0                mime_0.12                  
 [27] lifecycle_1.0.4             pkgconfig_2.0.3            
 [29] Matrix_1.6-5                R6_2.5.1                   
 [31] fastmap_1.1.1               GenomeInfoDbData_1.2.11    
 [33] MatrixGenerics_1.14.0       fitdistrplus_1.1-11        
 [35] future_1.33.1               shiny_1.8.0                
 [37] digest_0.6.34               colorspace_2.1-0           
 [39] patchwork_1.2.0             S4Vectors_0.40.2           
 [41] tensor_1.5                  RSpectra_0.16-1            
 [43] irlba_2.3.5.1               GenomicRanges_1.54.1       
 [45] labeling_0.4.3              progressr_0.14.0           
 [47] fansi_1.0.6                 spatstat.sparse_3.0-3      
 [49] httr_1.4.7                  polyclip_1.10-6            
 [51] abind_1.4-5                 compiler_4.3.2             
 [53] withr_3.0.0                 fastDummies_1.7.3          
 [55] MASS_7.3-60.0.1             DelayedArray_0.28.0        
 [57] tools_4.3.2                 lmtest_0.9-40              
 [59] httpuv_1.6.13               future.apply_1.11.1        
 [61] goftest_1.2-3               glmGamPoi_1.14.0           
 [63] glue_1.7.0                  nlme_3.1-164               
 [65] promises_1.2.1              grid_4.3.2                 
 [67] Rtsne_0.17                  cluster_2.1.6              
 [69] reshape2_1.4.4              generics_0.1.3             
 [71] gtable_0.3.4                spatstat.data_3.0-4        
 [73] tidyr_1.3.0                 data.table_1.14.10         
 [75] XVector_0.42.0              utf8_1.2.4                 
 [77] BiocGenerics_0.48.1         spatstat.geom_3.2-7        
 [79] RcppAnnoy_0.0.21            ggrepel_0.9.5              
 [81] RANN_2.6.1                  pillar_1.9.0               
 [83] stringr_1.5.1               spam_2.10-0                
 [85] RcppHNSW_0.5.0              later_1.3.2                
 [87] splines_4.3.2               dplyr_1.1.4                
 [89] lattice_0.22-5              survival_3.5-7             
 [91] deldir_2.0-2                tidyselect_1.2.0           
 [93] miniUI_0.1.1.1              pbapply_1.7-2              
 [95] knitr_1.45                  gridExtra_2.3              
 [97] IRanges_2.36.0              SummarizedExperiment_1.32.0
 [99] scattermore_1.2             stats4_4.3.2               
[101] xfun_0.41                   Biobase_2.62.0             
[103] matrixStats_1.2.0           stringi_1.8.3              
[105] lazyeval_0.2.2              yaml_2.3.8                 
[107] evaluate_0.23               codetools_0.2-19           
[109] tibble_3.2.1                cli_3.6.2                  
[111] uwot_0.1.16                 xtable_1.8-4               
[113] reticulate_1.34.0           munsell_0.5.0              
[115] Rcpp_1.0.12                 GenomeInfoDb_1.38.5        
[117] globals_0.16.2              spatstat.random_3.2-2      
[119] png_0.1-8                   parallel_4.3.2             
[121] ellipsis_0.3.2              ggplot2_3.4.4              
[123] dotCall64_1.1-1             sparseMatrixStats_1.14.0   
[125] bitops_1.0-7                listenv_0.9.0              
[127] viridisLite_0.4.2           scales_1.3.0               
[129] ggridges_0.5.5              crayon_1.5.2               
[131] leiden_0.4.3.1              purrr_1.0.2                
[133] rlang_1.1.3                 cowplot_1.1.2