Seurat v5单细胞数据整合分析

原文：Integrative analysis in Seurat v5

原文发布日期：2023年10月31日

Integration of single-cell sequencing datasets, for example across experimental batches, donors, or conditions, is often an important step in scRNA-seq workflows. Integrative analysis can help to match shared cell types and states across datasets, which can boost statistical power, and most importantly, facilitate accurate comparative analysis across datasets.

In previous versions of Seurat we introduced methods for integrative analysis, including our ‘anchor-based’ integration workflow. Many labs have also published powerful and pioneering methods, including Harmony and scVI, for integrative analysis. We recognize that while the goal of matching shared cell types across datasets may be important for many problems, users may also be concerned about which method to use, or that integration could result in a loss of biological resolution.

In Seurat v5, we introduce more flexible and streamlined infrastructure to run different integration algorithms with a single line of code. This makes it easier to explore the results of different integration methods, and to compare these results to a workflow that excludes integration steps.

For this vignette, we use a dataset of human PBMC profiled with seven different technologies (Ding et al. 2020), profiled as part of a systematic comparative analysis (pbmcsca). The data is available as part of our SeuratData package.

1 加载数据

在线读取（可能需要全局代理）

```{r}
#| eval: false
library(SeuratData)
InstallData("pbmcsca")
library(Seurat)
obj <- LoadData("pbmcsca")
```

从本地下载好的数据读取：

library(Seurat)
obj <- readRDS("data/seurat_official/pbmcsca.rds")
obj

An object of class Seurat 
33694 features across 31021 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)
 2 layers present: counts, data

colnames(obj)[1:8]

[1] "pbmc1_SM2_Cell_108" "pbmc1_SM2_Cell_115" "pbmc1_SM2_Cell_133"
[4] "pbmc1_SM2_Cell_142" "pbmc1_SM2_Cell_143" "pbmc1_SM2_Cell_144"
[7] "pbmc1_SM2_Cell_146" "pbmc1_SM2_Cell_148"

rownames(obj)[1:5]

[1] "TSPAN6"   "TNMD"     "DPM1"     "SCYL3"    "C1orf112"

head(obj@meta.data, 5)

                   orig.ident nCount_RNA nFeature_RNA nGene   nUMI
pbmc1_SM2_Cell_108      pbmc1     437125         2200  2200 437125
pbmc1_SM2_Cell_115      pbmc1     335596         2438  2438 335596
pbmc1_SM2_Cell_133      pbmc1     302204         1874  1874 302204
pbmc1_SM2_Cell_142      pbmc1     377420         2480  2480 377420
pbmc1_SM2_Cell_143      pbmc1     385514         2196  2196 385514
                         percent.mito Cluster         CellType Experiment
pbmc1_SM2_Cell_108 0.0297434465355702       0 Cytotoxic T cell      pbmc1
pbmc1_SM2_Cell_115 0.0311521658159055       0 Cytotoxic T cell      pbmc1
pbmc1_SM2_Cell_133 0.0431128105727693       0 Cytotoxic T cell      pbmc1
pbmc1_SM2_Cell_142 0.0260323569927476       0 Cytotoxic T cell      pbmc1
pbmc1_SM2_Cell_143 0.0404759383962183       0 Cytotoxic T cell      pbmc1
                       Method
pbmc1_SM2_Cell_108 Smart-seq2
pbmc1_SM2_Cell_115 Smart-seq2
pbmc1_SM2_Cell_133 Smart-seq2
pbmc1_SM2_Cell_142 Smart-seq2
pbmc1_SM2_Cell_143 Smart-seq2

table(obj$Method)


  10x Chromium (v2) 10x Chromium (v2) A 10x Chromium (v2) B   10x Chromium (v3) 
               3362                3222                3222                3222 
           CEL-Seq2            Drop-seq             inDrops            Seq-Well 
                526                6584                6584                3773 
         Smart-seq2 
                526

The object contains data from nine different batches (stored in the Method column in the object metadata), representing seven different technologies. We will aim to integrate the different batches together.

2 数据质控

过滤低质量细胞：

obj <- subset(obj, nFeature_RNA > 1000)

这个地方为了验证后面整合的效果对细胞类型提前进行了注释。

table(obj$CellType)


                     B cell              CD14+ monocyte 
                       1525                        1557 
             CD16+ monocyte                 CD4+ T cell 
                        404                        3018 
           Cytotoxic T cell              Dendritic cell 
                       2791                         281 
              Megakaryocyte         Natural killer cell 
                          9                         763 
Plasmacytoid dendritic cell                  Unassigned 
                         78                           8

3 分割数据

In previous versions of Seurat, if we want to integrate this data, we would require the data to be represented as nine different Seurat objects. When using Seurat v5 assays, we can instead keep all the data in one object, but simply split the layers:

obj[["RNA"]] <- split(obj[["RNA"]], f = obj$Method)
obj

An object of class Seurat 
33694 features across 10434 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)
 18 layers present: counts.Smart-seq2, counts.CEL-Seq2, counts.10x_Chromium_v2_A, counts.10x_Chromium_v2_B, counts.10x_Chromium_v3, counts.Drop-seq, counts.Seq-Well, counts.inDrops, counts.10x_Chromium_v2, data.Smart-seq2, data.CEL-Seq2, data.10x_Chromium_v2_A, data.10x_Chromium_v2_B, data.10x_Chromium_v3, data.Drop-seq, data.Seq-Well, data.inDrops, data.10x_Chromium_v2

After splitting, there are now 18 layers (a counts and data layer for each batch).

4 未整合情况下的标准scRNA-seq分析流程

4.1 标准化、找高变基因、归一化、降维

We can now run a standard scRNA-seq analysis (i.e. without integration). Note that since the data is split into layers, normalization and variable feature identification is performed for each batch independently (a consensus set of variable features is automatically identified).

obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)
obj <- RunUMAP(obj, 
               dims = 1:30, 
               reduction = "pca", 
               reduction.name = "umap.unintegrated") # name to store dimensional reduction in the Seurat object

4.2 聚类、可视化

We can now visualize the results of a standard analysis without integration. Note that cells are grouping both by cell type and by underlying method. While a UMAP analysis is just a visualization of this, clustering this dataset would return predominantly batch-specific clusters. Especially if previous cell-type annotations were not available, this would make downstream analysis extremely challenging.

obj <- FindNeighbors(obj, dims = 1:30, reduction = "pca")
obj <- FindClusters(obj, resolution = 2, cluster.name = "unintegrated_clusters")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10434
Number of edges: 412660

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8981
Number of communities: 48
Elapsed time: 0 seconds

DimPlot(obj, 
        reduction = "umap.unintegrated", 
        group.by = c("Method", "CellType"))

可以看到，不同的测序技术间的细胞类型差异较大。因此需要对数据进行整合。

5 数据整合

Seurat v5 enables streamlined integrative analysis using the IntegrateLayers function. The method currently supports five integration methods. Each of these methods performs integration in low-dimensional space, and returns a dimensional reduction (i.e. integrated.rpca) that aims to co-embed shared cell types across batches:

Anchor-based CCA integration (method=CCAIntegration)
Anchor-based RPCA integration (method=RPCAIntegration)
Harmony (method=HarmonyIntegration)
FastMNN (method= FastMNNIntegration)
scVI (method=scVIIntegration)

Note that our anchor-based RPCA integration represents a faster and more conservative (less correction) method for integration. For interested users, we discuss this method in more detail in our previous RPCA vignette.

Tip

当存在以下情况之一时，最好使用RPCA进行整合：

一个数据集中的很大一部分细胞在另一个数据集中没有匹配类型；
数据集来自同一平台（即10x的多个通道）；
有大量的数据集或细胞要整合。

You can find more detail on each method, and any installation prerequisites, in Seurat’s documentation (for example, ?HarmonyIntegration). For example, harmony整合需要先安装harmony包（install.packages("harmony")）；scVI integration requires reticulate which can be installed from CRAN (install.packages("reticulate")) as well as scvi-tools and its dependencies installed in a conda environment. Please see scVI installation instructions here.

Each of the following lines perform a new integration using a single line of code:

（这里我们选择其中的CCAIntegration和HarmonyIntegration两种方式分别对数据进行整合，整合后后的降维信息分别储存在”integrated.cca”和”harmony”中）

obj <- IntegrateLayers(
  object = obj,
  method = CCAIntegration,
  orig.reduction = "pca",
  new.reduction = "integrated.cca",
  verbose = FALSE
)
names(obj@reductions)

[1] "pca"               "umap.unintegrated" "integrated.cca"

```{r}
#| eval: false
obj <- IntegrateLayers(
  object = obj, 
  method = RPCAIntegration,
  orig.reduction = "pca", 
  new.reduction = "integrated.rpca",
  verbose = FALSE
)
names(obj@reductions)
```

obj <- IntegrateLayers(
  object = obj, 
  method = HarmonyIntegration,
  orig.reduction = "pca", 
  new.reduction = "harmony",
  verbose = FALSE
)
names(obj@reductions)

[1] "pca"               "umap.unintegrated" "integrated.cca"   
[4] "harmony"

```{r}
#| eval: false
obj <- IntegrateLayers(
  object = obj, 
  method = FastMNNIntegration,
  new.reduction = "integrated.mnn",
  verbose = FALSE
)
names(obj@reductions)
```

```{r}
#| eval: false
obj <- IntegrateLayers(
  object = obj, 
  method = scVIIntegration,
  new.reduction = "integrated.scvi",
  conda_env = "../miniconda3/envs/scvi-env", 
  verbose = FALSE
)
names(obj@reductions)
```

6 重新降维、聚类、可视化

CCAIntegration：

obj <- RunUMAP(obj, 
               reduction = "integrated.cca", 
               dims = 1:30, 
               reduction.name = "umap.cca")
obj <- FindNeighbors(obj, reduction = "integrated.cca", dims = 1:30)
obj <- FindClusters(obj, resolution = 2, cluster.name = "cca_clusters")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10434
Number of edges: 617481

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8039
Number of communities: 26
Elapsed time: 1 seconds

colnames(obj@meta.data)

 [1] "orig.ident"            "nCount_RNA"            "nFeature_RNA"         
 [4] "nGene"                 "nUMI"                  "percent.mito"         
 [7] "Cluster"               "CellType"              "Experiment"           
[10] "Method"                "unintegrated_clusters" "seurat_clusters"      
[13] "cca_clusters"

p1 <- DimPlot(obj,
              reduction = "umap.cca",
              group.by = c("Method", "CellType", "cca_clusters"),
              combine = FALSE, 
              label.size = 2)

HarmonyIntegration：

obj <- RunUMAP(obj, 
               reduction = "harmony", 
               dims = 1:30, 
               reduction.name = "umap.harmony")
obj <- FindNeighbors(obj, reduction = "harmony", dims = 1:30)
obj <- FindClusters(obj, resolution = 2, cluster.name = "harmony_clusters")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10434
Number of edges: 459472

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.7900
Number of communities: 24
Elapsed time: 1 seconds

colnames(obj@meta.data)

 [1] "orig.ident"            "nCount_RNA"            "nFeature_RNA"         
 [4] "nGene"                 "nUMI"                  "percent.mito"         
 [7] "Cluster"               "CellType"              "Experiment"           
[10] "Method"                "unintegrated_clusters" "seurat_clusters"      
[13] "cca_clusters"          "harmony_clusters"

p2 <- DimPlot(obj,
              reduction = "umap.harmony",
              group.by = c("Method", "CellType", "harmony_clusters"),
              combine = FALSE, 
              label.size = 2)

合并UMAP图：

library(patchwork)
wrap_plots(c(p1, p2), ncol = 2, byrow = F)

7 检验细胞类型marker基因的表达

We hope that by simplifying the process of performing integrative analysis, users can more carefully evaluate the biological information retained in the integrated dataset. For example, users can compare the expression of biological markers based on different clustering solutions, or visualize one method’s clustering solution on different UMAP visualizations.

library(ggplot2)
p1 <- VlnPlot(obj,
              features = "rna_CD8A", 
              group.by = "unintegrated_clusters",
              pt.size = 0) + 
  NoLegend() + 
  ggtitle("CD8A - Unintegrated Clusters")
p2 <- VlnPlot(obj, 
              "rna_CD8A",
              group.by = "cca_clusters",
              pt.size = 0) + 
  NoLegend() + 
  ggtitle("CD8A - CCA Clusters")
p3 <- VlnPlot(obj, 
              "rna_CD8A",
              group.by = "harmony_clusters",
              pt.size = 0) + 
  NoLegend() + 
  ggtitle("CD8A - harmony Clusters")
p1 | p2 | p3

8 重新合并layers

Once integrative analysis is complete, you can rejoin the layers - which collapses the individual datasets together and recreates the original counts and data layers. You will need to do this before performing any differential expression analysis. However, you can always resplit the layers in case you would like to reperform integrative analysis.

obj <- JoinLayers(obj)
obj

An object of class Seurat 
33694 features across 10434 samples within 1 assay 
Active assay: RNA (33694 features, 2000 variable features)
 3 layers present: data, counts, scale.data
 6 dimensional reductions calculated: pca, umap.unintegrated, integrated.cca, harmony, umap.cca, umap.harmony

9 对`SCTransform`处理后的数据的整合

Users can also perform integration using sctransform-normalized data (see 基于SCTransform的单细胞数据标准化 for more information), by first running SCTransform normalization, and then setting the normalization.method argument in IntegrateLayers（和执行SCTransform标准化流程之后的整合中一样）。

rm(list = ls())
#重新载入数据、质控、分割
obj <- readRDS("data/seurat_official/pbmcsca.rds")
obj <- subset(obj, nFeature_RNA > 1000)
obj[["RNA"]] <- split(obj[["RNA"]], f = obj$Method)
# 执行SCTransform
obj <- SCTransform(obj)
# 降维
obj <- RunPCA(obj, npcs = 30, verbose = F)
# 整合
#options(future.globals.maxSize = 3e+09)
obj <- IntegrateLayers(object = obj,
                       method = CCAIntegration,
                       normalization.method = "SCT",
                       orig.reduction = "pca",
                       new.reduction = "integrated.cca",
                       verbose = F)
# 重新降维、聚类、可视化
obj <- RunUMAP(obj, 
               dims = 1:30, 
               reduction = "integrated.cca", 
               reduction.name = "umap.cca")
obj <- FindNeighbors(obj, dims = 1:30, reduction = "integrated.cca")
obj <- FindClusters(obj, resolution = 2, cluster.name = "cca_clusters")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10434
Number of edges: 499367

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8342
Number of communities: 26
Elapsed time: 1 seconds

DimPlot(obj,
        reduction = "umap.cca",
        group.by = c("Method", "cca_clusters"),
        label.size = 1)

Session Info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.4      patchwork_1.2.0    Seurat_5.0.1       SeuratObject_5.0.1
[5] sp_2.1-2          

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3          rstudioapi_0.15.0          
  [3] jsonlite_1.8.8              magrittr_2.0.3             
  [5] spatstat.utils_3.0-4        ggbeeswarm_0.7.2           
  [7] farver_2.1.1                rmarkdown_2.25             
  [9] zlibbioc_1.48.0             vctrs_0.6.5                
 [11] ROCR_1.0-11                 DelayedMatrixStats_1.24.0  
 [13] spatstat.explore_3.2-5      RCurl_1.98-1.14            
 [15] S4Arrays_1.2.0              htmltools_0.5.7            
 [17] SparseArray_1.2.3           sctransform_0.4.1          
 [19] parallelly_1.36.0           KernSmooth_2.23-22         
 [21] htmlwidgets_1.6.4           ica_1.0-3                  
 [23] plyr_1.8.9                  plotly_4.10.4              
 [25] zoo_1.8-12                  igraph_1.6.0               
 [27] mime_0.12                   lifecycle_1.0.4            
 [29] pkgconfig_2.0.3             Matrix_1.6-5               
 [31] R6_2.5.1                    fastmap_1.1.1              
 [33] GenomeInfoDbData_1.2.11     MatrixGenerics_1.14.0      
 [35] fitdistrplus_1.1-11         future_1.33.1              
 [37] shiny_1.8.0                 digest_0.6.34              
 [39] colorspace_2.1-0            S4Vectors_0.40.2           
 [41] tensor_1.5                  RSpectra_0.16-1            
 [43] irlba_2.3.5.1               GenomicRanges_1.54.1       
 [45] labeling_0.4.3              progressr_0.14.0           
 [47] fansi_1.0.6                 spatstat.sparse_3.0-3      
 [49] httr_1.4.7                  polyclip_1.10-6            
 [51] abind_1.4-5                 compiler_4.3.2             
 [53] withr_3.0.0                 fastDummies_1.7.3          
 [55] MASS_7.3-60.0.1             DelayedArray_0.28.0        
 [57] tools_4.3.2                 vipor_0.4.7                
 [59] lmtest_0.9-40               beeswarm_0.4.0             
 [61] httpuv_1.6.13               future.apply_1.11.1        
 [63] goftest_1.2-3               glmGamPoi_1.14.0           
 [65] glue_1.7.0                  nlme_3.1-164               
 [67] promises_1.2.1              grid_4.3.2                 
 [69] Rtsne_0.17                  cluster_2.1.6              
 [71] reshape2_1.4.4              generics_0.1.3             
 [73] gtable_0.3.4                spatstat.data_3.0-4        
 [75] tidyr_1.3.0                 data.table_1.14.10         
 [77] XVector_0.42.0              utf8_1.2.4                 
 [79] BiocGenerics_0.48.1         spatstat.geom_3.2-7        
 [81] RcppAnnoy_0.0.21            ggrepel_0.9.5              
 [83] RANN_2.6.1                  pillar_1.9.0               
 [85] stringr_1.5.1               spam_2.10-0                
 [87] RcppHNSW_0.5.0              later_1.3.2                
 [89] splines_4.3.2               dplyr_1.1.4                
 [91] lattice_0.22-5              survival_3.5-7             
 [93] deldir_2.0-2                tidyselect_1.2.0           
 [95] miniUI_0.1.1.1              pbapply_1.7-2              
 [97] knitr_1.45                  gridExtra_2.3              
 [99] IRanges_2.36.0              SummarizedExperiment_1.32.0
[101] scattermore_1.2             stats4_4.3.2               
[103] RhpcBLASctl_0.23-42         xfun_0.41                  
[105] Biobase_2.62.0              matrixStats_1.2.0          
[107] stringi_1.8.3               lazyeval_0.2.2             
[109] yaml_2.3.8                  evaluate_0.23              
[111] codetools_0.2-19            tibble_3.2.1               
[113] cli_3.6.2                   uwot_0.1.16                
[115] xtable_1.8-4                reticulate_1.34.0          
[117] munsell_0.5.0               harmony_1.2.0              
[119] GenomeInfoDb_1.38.5         Rcpp_1.0.12                
[121] globals_0.16.2              spatstat.random_3.2-2      
[123] png_0.1-8                   ggrastr_1.0.2              
[125] parallel_4.3.2              ellipsis_0.3.2             
[127] dotCall64_1.1-1             sparseMatrixStats_1.14.0   
[129] bitops_1.0-7                listenv_0.9.0              
[131] viridisLite_0.4.2           scales_1.3.0               
[133] ggridges_0.5.5              crayon_1.5.2               
[135] leiden_0.4.3.1              purrr_1.0.2                
[137] rlang_1.1.3                 cowplot_1.1.2

References

Ding, Jiarui, Xian Adiconis, Sean K. Simmons, Monika S. Kowalczyk, Cynthia C. Hession, Nemanja D. Marjanovic, Travis K. Hughes, et al. 2020. “Systematic Comparison of Single-Cell and Single-Nucleus RNA-Sequencing Methods.” Nature Biotechnology 38 (6): 737–46. https://doi.org/10.1038/s41587-020-0465-8.