单细胞数据整合(Integration)

Learning Objectives:
  • Perform integration of cells across conditions to identify cells that are similar to each other
  • Describe complex integration tasks and alternative tools for integration


Goals:

Challenges:

Recommendations:


1 不进行整合时检验细胞分群情况

Generally, we always look at our clustering without integration before deciding whether we need to perform any alignment. Do not just always perform integration because you think there might be differences - explore the data. If we had performed the normalization on both conditions together in a Seurat object and visualized the similarity between cells, we would have seen condition-specific clustering.

1.1 数据导入

We use the split_seurat object from the previous lesson.

library(Seurat)
split_seurat <- readRDS("output/scRNA-seq_online/split_seurat.rds")
split_seurat
An object of class Seurat 
28130 features across 29629 samples within 2 assays 
Active assay: SCT (14065 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
 1 dimensional reduction calculated: pca
# Check which assays are stored in objects
split_seurat@assays
$RNA
Assay (v5) data with 14065 features for 29629 cells
First 10 features:
 AL627309.1, AL669831.5, LINC00115, FAM41C, NOC2L, KLHL17, PLEKHN1,
HES4, ISG15, AGRN 
Layers:
 counts.ctrl, counts.stim 

$SCT
SCTAssay data with 14065 features for 29629 cells, and 2 SCTModel(s) 
Top 10 variable features:
 FTL, IGKC, CCL2, GNLY, IGLC2, CCL3, CCL4, CXCL10, CCL7, TIMP1 
# 查看目前默认的assay
DefaultAssay(split_seurat)
[1] "SCT"
# 查看默认assay的layers
Layers(split_seurat)
[1] "counts"     "data"       "scale.data"
# 查看每种样本包含多少细胞
table(split_seurat$sample)

 ctrl  stim 
14847 14782 
# 查看降维信息
names(split_seurat@reductions)
[1] "pca"

1.2 降维、分群

# Run PCA
split_seurat <- RunPCA(split_seurat)
# Run UMAP
split_seurat <- RunUMAP(split_seurat, 
                        dims = 1:40, 
                        reduction = "pca", 
                        reduction.name = "umap.unintegrated")

# 分群
split_seurat <- FindNeighbors(split_seurat, 
                              dims = 1:40, 
                              reduction = "pca")
split_seurat <- FindClusters(split_seurat, 
                             cluster.name = "unintegrated_clusters")
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 29629
Number of edges: 1104454

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8974
Number of communities: 22
Elapsed time: 4 seconds
# Plot UMAP
p1 <- DimPlot(split_seurat, reduction = "umap.unintegrated", group.by = "sample")
p2 <- DimPlot(split_seurat, reduction = "umap.unintegrated", split.by = "sample")
library(cowplot)
plot_grid(p1, p2, ncol = 2, labels = "AUTO")
Figure 1: 整合前的细胞分群情况

可以看到,如果不进行整合,不同样本(ctrl vs. stim)间的细胞类型差异很大。The resulting clusters are defined both by cell type and stimulation condition, which creates challenges for downstream analysis. Condition-specific clustering of the cells indicates that we need to integrate the cells across conditions to ensure that cells of the same cell type cluster together.

Tip

关于FindNeighbors()FindClusters()的详细说明见下一节

Why is it important the cells of the same cell type cluster together?

We want to identify cell types which are present in all samples/conditions/modalities within our dataset, and therefore would like to observe a representation of cells from both samples/conditions/modalities in every cluster. This will enable more interpretable results downstream (i.e. DE analysis, ligand-receptor analysis, differential abundance analysis…).

In this lesson, we will cover the integration of our samples across conditions, which is adapted from Seurat-整合.

Note

在此前的Seurat-基于SCTransform中我们学习了how to run through the workflow from normalization to clustering without integration. Other steps in the workflow remain fairly similar, but the samples would not necessarily be split in the beginning and integration would not be performed.

It can help to first run conditions individually if unsure what clusters to expect or expecting some different cell types between conditions (e.g. tumor and control samples), then run them together to see whether there are condition-specific clusters for cell types present in both conditions. Oftentimes, when clustering cells from multiple conditions there are condition-specific clusters and integration can help ensure the same cell types cluster together.

2 整合

If cells cluster by sample, condition, batch, dataset, modality, this integration step can greatly improve the clustering and the downstream analyses.

To integrate, we will use the shared highly variable genes (identified using SCTransform) from each group, then, we will “integrate” or “harmonize” the groups to overlay cells that are similar or have a “common set of biological features” between groups. For example, we could integrate across:

  • Different conditions (e.g. control and stimulated)

  • Different datasets (e.g. scRNA-seq from datasets generated using different library preparation methods on the same samples)

  • Different modalities (e.g. scRNA-seq and scATAC-seq)

  • Different batches (e.g. when experimental conditions make batch processing of samples necessary)

The goal of integration is to ensure that the cell types of one condition/dataset align with the same celltypes of the other conditions/datasets (e.g. control macrophages align with stimulated macrophages).

2.1 Integration using CCA

Seurat v5 enables streamlined integrative analysis using the IntegrateLayers function. The method currently supports five integration methods. Each of these methods performs integration in low-dimensional space, and returns a dimensional reduction (i.e. integrated.rpca) that aims to co-embed shared cell types across batches:

The anchor-based CCA integration (method=CCAIntegration) utilizes the canonical correlation analysis (CCA). This method expects “correspondences” or shared biological states among at least a subset of single cells across the groups. The steps in the Seurat integration workflow are outlined in the figure below:

Image credit: (Stuart et al. 2019)

The different steps applied are as follows:

  1. Perform canonical correlation analysis (CCA):

    CCA identifies shared sources of variation between the conditions/groups. It is a form of PCA, in that it identifies the greatest sources of variation in the data, but only if it is shared or conserved across the conditions/groups (using the 3000 most variant genes from each sample).

    This step roughly aligns the cells using the greatest shared sources of variation.

    Note

    The shared highly variable genes are used because they are the most likely to represent those genes distinguishing the different cell types present.

  2. Identify anchors or mutual nearest neighbors (MNNs) across datasets (sometimes incorrect anchors are identified):

    MNNs can be thought of as ‘best buddies’. For each cell in one condition:

    • The cell’s closest neighbor in the other condition is identified based on gene expression values - its ‘best buddy’.
    • The reciprocal analysis is performed, and if the two cells are ‘best buddies’ in both directions, then those cells will be marked as anchors to ‘anchor’ the two datasets together.

    “The difference in expression values between cells in an MNN pair provides an estimate of the batch effect, which is made more precise by averaging across many such pairs. A correction vector is obtained and applied to the expression values to perform batch correction.” (Stuart et al. 2019)

  3. Filter anchors to remove incorrect anchors:

    Assess the similarity between anchor pairs by the overlap in their local neighborhoods (incorrect anchors will have low scores) - do the adjacent cells have ‘best buddies’ that are adjacent to each other?

  4. Integrate the conditions/datasets:

    Use anchors and corresponding scores to transform the cell expression values, allowing for the integration of the conditions/datasets (different samples, conditions, datasets, modalities)

    Note

    Transformation of each cell uses a weighted average of the two cells of each anchor across anchors of the datasets. Weights determined by cell similarity score (distance between cell and k nearest anchors) and anchor scores, so cells in the same neighborhood should have similar correction values.

    If cell types are present in one dataset, but not the other, then the cells will still appear as a separate sample-specific cluster.

Now, using our SCTransform object as input, let’s perform the integration across conditions.

# 可以看到目前的降维信息包括"pca"和"umap.unintegrated"
names(split_seurat@reductions)
[1] "pca"               "umap.unintegrated"
# 整合,比较耗时间(约13min),进度条会一直显示0%直至运算完成
library(beepr)
seurat_integrated <- IntegrateLayers(object = split_seurat,
                                     method = CCAIntegration,
                                     normalization.method = "SCT", # 指定使用的标准化方法为SCTransform
                                     orig.reduction = "pca",
                                     new.reduction = "integrated.cca", #  整合后新的降维数据的名称
                                     verbose = FALSE); beep()

# 整合后重新合并RNA layer
seurat_integrated[["RNA"]] <- JoinLayers(seurat_integrated[["RNA"]])

# 查看整合后的降维信息
names(seurat_integrated@reductions)
[1] "pca"               "umap.unintegrated" "integrated.cca"   
Tip

如何调用Seurat嵌入的其他整合算法进行整合,参考:Seurat v5单细胞数据整合分析

3 整合后检验细胞分群情况

After integration, to visualize the integrated data we can use dimensionality reduction techniques, such as PCA and Uniform Manifold Approximation and Projection (UMAP). While PCA will determine all PCs, we can only plot two at a time. In contrast, UMAP will take the information from any number of top PCs to arrange the cells in this multidimensional space. It will take those distances in multidimensional space and plot them in two dimensions working to preserve local and global structure. In this way, the distances between cells represent similarity in expression. If you wish to explore UMAP in more detail, this post is a nice introduction to UMAP theory.

Here, we visualize with UMAP. UMAP is a stochastic algorithm – this means that it makes use of randomness both to speed up approximation steps, and to aid in solving hard optimization problems. Due to the stochastic nature, different runs of UMAP can produce different results. We can set the seed to a specific (but random) number, and this avoids the creation of a slightly different UMAP each time re-run our code.

# Set seed
set.seed(123456)
# Run UMAP
seurat_integrated <- RunUMAP(seurat_integrated, 
                             dims = 1:40,
                             reduction = "integrated.cca", # 更改降维来源为整合后的"integrated.cca"
                             reduction.name = "umap.integrated") 

# 分群
seurat_integrated <- FindNeighbors(seurat_integrated, 
                                   dims = 1:40, 
                                   reduction = "integrated.cca") #更改降维来源为"integrated.cca"
                                   
seurat_integrated <- FindClusters(seurat_integrated, 
                                  cluster.name = "integrated_clusters")
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 29629
Number of edges: 1128935

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8864
Number of communities: 22
Elapsed time: 4 seconds
# Plot UMAP                             
p3 <- DimPlot(seurat_integrated, 
              reduction = "umap.integrated", 
              group.by = "sample")
p4 <- DimPlot(seurat_integrated, 
              reduction = "umap.integrated", 
              split.by = "sample")
plot_grid(p1, p2, p3, p4, 
          ncol = 2, labels = "AUTO")
Figure 2: 整合前后细胞分群情况(A, B: 整合前;C, D: 整合后)

When we compare the similarity between the ctrl and stim clusters in the above plot with what we see using the the unintegrated dataset, it is clear that this dataset benefitted from the integration!

3.1 Save the “integrated” object!

# Save integrated seurat object
saveRDS(seurat_integrated, "output/scRNA-seq_online/integrated_seurat.rds")

4 Complex Integration Tasks

In the section above, we’ve presented the Seurat integration workflow, which uses canonical correlation analysis (CCA) and multiple nearest neighbors (MNN) to find “anchors” and integrate across samples, conditions, modalities, etc. While the Seurat integration approach is widely used and several benchmarking studies support its great performance in many cases, it is important to recognize that alternative integration algorithms exist and may work better for more complex integration tasks (see (Luecken et al. 2021) for a comprehensive review).

Not all integration algorithms rely on the same methodology, and they do not always provide the same type of corrected output (embeddings, count matrix…). Their performance is also affected by preliminary data processing steps, including which normalization method was used and how highly variable genes (HVGs) were determined. All those considerations are important to keep in mind when selecting a data integration approach for your study.

What do we mean by a “complex” integration task?

In their benchmarking study (Luecken et al. 2021) compared the performance of different scRNA-seq integration tools when confronted to different “complex” tasks. The “complexity” of integrating a dataset may relate to the number of samples (perhaps generated using different protocols) but also to the biological question the study seeks to address (e.g. comparing cell types across tissues, species…). In these contexts, you may need to integrate across multiple confounding factors before you can start exploring the biology of your system.

In these more complex scenarios, you want to select a data integration approach that successfully balances out the following challenges:

  • Correcting for inter-sample variability due to source samples from different donors
  • Correcting for variability across protocols/technologies (10X, SMART-Seq2, inDrop…; single-cell vs. single nucleus; variable number of input cells and sequencing depth; different sample preparation steps…)
  • Identifying consistent cell types across different tissues (peripheral blood, bone marrow, lung…) and/or different locations (e.g. areas of the brain)
  • Keeping apart cell subtypes (or even cell states) that show similar transcriptomes (CD4 naive vs. memory, NK vs NKT)
  • Keeping apart cell subtypes that are unique to a tissue/condition
  • Conserving the developmental trajectory, if applicable

Not all tools may perform as well on every task, and complex datasets may require testing several data integration approaches. You might want to analyze independently each of the batches you consider to integrate across, in order to define cell identities at this level before integrating and checking that the initially annotated cell types are mixed as expected.

4.1 Harmonizing as a method of integration

Harmony (Korsunsky et al. 2019) was devleoped in 2019, and is an example of a tool that can work with complex integration tasks. It is available as an GitHub and CRAN, and it has functions for standalone and Seurat pipeline analyses. It has been shown to perform incredibly well from recent benchmarking studies (Tran et al. 2020).

在Seurat工作流的基础上实现基于Harmony的单细胞数据整合,可以使用Harmony包(available on GitHub and CRAN),详见该教程。同时,Seurat V5中的IntegrateLayers函数集成了Harmony整合算法,可以直接调用,详见:Seurat v5单细胞数据整合分析。Compared to other algorithms, Harmony notably presents the following advantages (Korsunsky et al. 2019) (Tran et al. 2020):

  1. Possibility to integrate data across several variables (for example, by experimental batch and by condition)

  2. Significant gain in speed and lower memory requirements for integration of large datasets

  3. Interoperability with the Seurat workflow


R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] beepr_1.3          cowplot_1.1.2      Seurat_5.0.1       SeuratObject_5.0.1
[5] sp_2.1-2          

loaded via a namespace (and not attached):
  [1] deldir_2.0-2           pbapply_1.7-2          gridExtra_2.3         
  [4] rlang_1.1.3            magrittr_2.0.3         RcppAnnoy_0.0.21      
  [7] spatstat.geom_3.2-7    matrixStats_1.2.0      ggridges_0.5.5        
 [10] compiler_4.3.2         png_0.1-8              vctrs_0.6.5           
 [13] reshape2_1.4.4         stringr_1.5.1          pkgconfig_2.0.3       
 [16] fastmap_1.1.1          ellipsis_0.3.2         labeling_0.4.3        
 [19] utf8_1.2.4             promises_1.2.1         rmarkdown_2.25        
 [22] purrr_1.0.2            xfun_0.41              jsonlite_1.8.8        
 [25] goftest_1.2-3          later_1.3.2            spatstat.utils_3.0-4  
 [28] irlba_2.3.5.1          parallel_4.3.2         cluster_2.1.6         
 [31] R6_2.5.1               ica_1.0-3              stringi_1.8.3         
 [34] RColorBrewer_1.1-3     spatstat.data_3.0-4    reticulate_1.34.0     
 [37] parallelly_1.36.0      lmtest_0.9-40          scattermore_1.2       
 [40] Rcpp_1.0.12            knitr_1.45             tensor_1.5            
 [43] future.apply_1.11.1    audio_0.1-11           zoo_1.8-12            
 [46] sctransform_0.4.1      httpuv_1.6.13          Matrix_1.6-5          
 [49] splines_4.3.2          igraph_1.6.0           tidyselect_1.2.0      
 [52] abind_1.4-5            rstudioapi_0.15.0      yaml_2.3.8            
 [55] spatstat.random_3.2-2  codetools_0.2-19       miniUI_0.1.1.1        
 [58] spatstat.explore_3.2-5 listenv_0.9.0          lattice_0.22-5        
 [61] tibble_3.2.1           plyr_1.8.9             withr_3.0.0           
 [64] shiny_1.8.0            ROCR_1.0-11            evaluate_0.23         
 [67] Rtsne_0.17             future_1.33.1          fastDummies_1.7.3     
 [70] survival_3.5-7         polyclip_1.10-6        fitdistrplus_1.1-11   
 [73] pillar_1.9.0           KernSmooth_2.23-22     plotly_4.10.4         
 [76] generics_0.1.3         RcppHNSW_0.5.0         ggplot2_3.4.4         
 [79] munsell_0.5.0          scales_1.3.0           globals_0.16.2        
 [82] xtable_1.8-4           glue_1.7.0             lazyeval_0.2.2        
 [85] tools_4.3.2            data.table_1.14.10     RSpectra_0.16-1       
 [88] RANN_2.6.1             leiden_0.4.3.1         dotCall64_1.1-1       
 [91] grid_4.3.2             tidyr_1.3.0            colorspace_2.1-0      
 [94] nlme_3.1-164           patchwork_1.2.0        cli_3.6.2             
 [97] spatstat.sparse_3.0-3  spam_2.10-0            fansi_1.0.6           
[100] viridisLite_0.4.2      dplyr_1.1.4            uwot_0.1.16           
[103] gtable_0.3.4           digest_0.6.34          progressr_0.14.0      
[106] ggrepel_0.9.5          farver_2.1.1           htmlwidgets_1.6.4     
[109] htmltools_0.5.7        lifecycle_1.0.4        httr_1.4.7            
[112] mime_0.12              MASS_7.3-60.0.1       

References

Korsunsky, Ilya, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. 2019. “Fast, Sensitive and Accurate Integration of Single-Cell Data with Harmony.” Nature Methods 16 (12): 1289–96. https://doi.org/10.1038/s41592-019-0619-0.
Luecken, Malte D., M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M. F. Mueller, D. C. Strobl, et al. 2021. “Benchmarking Atlas-Level Data Integration in Single-Cell Genomics.” Nature Methods 19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.
Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M. Mauck, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888–1902.e21. https://doi.org/10.1016/j.cell.2019.05.031.
Tran, Hoa Thi Nhu, Kok Siong Ang, Marion Chevrier, Xiaomeng Zhang, Nicole Yee Shin Lee, Michelle Goh, and Jinmiao Chen. 2020. “A Benchmark of Batch-Effect Correction Methods for Single-Cell RNA Sequencing Data.” Genome Biology 21 (1). https://doi.org/10.1186/s13059-019-1850-9.