基于参考集的细胞注释

原文：Mapping and annotating query datasets

原文发布日期：2023年10月31日

In this vignette, we first build an integrated reference and then demonstrate how to leverage this reference to annotate new query datasets. Generating an integrated reference follows the same workflow described in more detail in the integration introduction. Once generated, this reference can be used to analyze additional query datasets through tasks like cell type label transfer and projecting query cells onto reference UMAPs. Notably, this does not require correction of the underlying raw query data and can therefore be an efficient strategy if a high quality reference is available.

1 构建参考数据集

For the purposes of this example, we’ve chosen human pancreatic islet cell datasets produced across four technologies, CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061). For convenience, we distribute this dataset through our SeuratData package. The metadata contains the technology (tech column) and cell type annotations (celltype column) for each cell in the four datasets.

1.1 数据读取

在线读取（可能需要全局代理）

```{r}
#| eval: false
library(SeuratData)
InstallData("panc8")
library(Seurat)
panc8 <- LoadData("panc8")
```

从本地下载好的数据读取：

panc8 <- readRDS("data/seurat_official/panc8.rds")
panc8

An object of class Seurat 
34363 features across 14890 samples within 1 assay 
Active assay: RNA (34363 features, 0 variable features)
 2 layers present: counts, data

colnames(panc8)[1:5]

[1] "D101_5"  "D101_7"  "D101_10" "D101_13" "D101_14"

rownames(panc8)[1:5]

[1] "A1BG-AS1" "A1BG"     "A1CF"     "A2M-AS1"  "A2ML1"

head(panc8@meta.data, 5)

        orig.ident nCount_RNA nFeature_RNA   tech replicate assigned_cluster
D101_5        D101   4615.810         1986 celseq    celseq             <NA>
D101_7        D101  29001.563         4209 celseq    celseq             <NA>
D101_10       D101   6707.857         2408 celseq    celseq             <NA>
D101_13       D101   8797.224         2964 celseq    celseq             <NA>
D101_14       D101   5032.558         2264 celseq    celseq             <NA>
        celltype dataset
D101_5     gamma  celseq
D101_7    acinar  celseq
D101_10    alpha  celseq
D101_13    delta  celseq
D101_14     beta  celseq

table(panc8$tech)


    celseq    celseq2 fluidigmc1     indrop  smartseq2 
      1004       2285        638       8569       2394

可以看到，该数据包含了5种单细胞转录组测序技术获得的单细胞数据。

As a demonstration, we will use a subset of technologies to construct a reference. We will then map the remaining datasets onto this reference. we will use data from 2 technologies (celseq2和smartseq2) for the reference。

library(Seurat)
pancreas.ref <- subset(panc8, tech %in% c("celseq2", "smartseq2"))
# 按照不同的测序技术将表达矩阵分为不同的layer
pancreas.ref[["RNA"]] <- split(pancreas.ref[["RNA"]], 
                               f = pancreas.ref$tech)

1.2 数据预处理

标准化、找高变基因、归一化、降维、聚类、可视化：

pancreas.ref <- NormalizeData(pancreas.ref)
pancreas.ref <- FindVariableFeatures(pancreas.ref)
pancreas.ref <- ScaleData(pancreas.ref)
pancreas.ref <- RunPCA(pancreas.ref)
pancreas.ref <- FindNeighbors(pancreas.ref, dims = 1:30)
pancreas.ref <- FindClusters(pancreas.ref)

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 4679
Number of edges: 174953

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9180
Number of communities: 19
Elapsed time: 0 seconds

pancreas.ref <- RunUMAP(pancreas.ref, dims = 1:30)
DimPlot(pancreas.ref, group.by = c("celltytpe", "tech"))

DimPlot(pancreas.ref, split.by = "tech")

可以看到，不同的测序技术间的细胞类型差异较大。因此需要对数据进行整合，方法同此前的章节一致。

pancreas.ref <- IntegrateLayers(object = pancreas.ref, 
                                method = CCAIntegration, 
                                orig.reduction = "pca",
                                new.reduction = "integrated.cca", 
                                verbose = FALSE)
# 重新聚类
pancreas.ref <- FindNeighbors(pancreas.ref, 
                              reduction = "integrated.cca",#更改降维来源为"integrated.cca"
                              dims = 1:30)
pancreas.ref <- FindClusters(pancreas.ref)

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 4679
Number of edges: 190152

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8680
Number of communities: 15
Elapsed time: 0 seconds

# 重新降维
pancreas.ref <- RunUMAP(pancreas.ref, 
                        reduction = "integrated.cca", #更改降维来源为"integrated.cca"
                        dims = 1:30)
DimPlot(pancreas.ref, group.by = c("tech", "celltype"))

DimPlot(pancreas.ref, split.by = "tech")

可以看到，和此前相比，整合后不再有不同测序技术间细胞类型的差异。

2 基于参考集的细胞注释

Seurat also supports the projection of reference data (or meta data) onto a query object. While many of the methods are conserved (both procedures begin by identifying anchors), there are two important distinctions between data transfer and integration:

In data transfer, Seurat does not correct or modify the query expression data.
In data transfer, Seurat has an option (set by default) to project the PCA structure of a reference onto the query, instead of learning a joint structure with CCA. We generally suggest using this option when projecting data between scRNA-seq datasets.

Seurat支持将参考数据集的注释信息（meta.data）映射到查询数据集上。基于Seurat的数据注释映射（data transfer）和上面的数据整合（integration）之间许多步骤都是类似的（例如这两个过程都是从识别锚点开始的），但它们之间有两个重要的区别:

在data transfer中，Seurat不矫正或修改待查询数据集的表达矩阵。

在 data transfer中，Seurat有一个选项（默认设置），可以将参考基因集的PCA结构投影到查询对象上，而不是使用CCA学习联合结构。

After finding anchors, we use the TransferData() function to classify the query cells based on reference data (a vector of reference cell type labels). TransferData() returns a matrix with predicted IDs and prediction scores, which we can add to the query metadata.

这里为了演示，还是选取panc8中的两个测序技术（“fluidigmc1”和”celseq”）的数据作为查询数据集。

pancreas.query <- subset(panc8, tech %in% c("fluidigmc1", "celseq"))
# 标准化查询数据集
pancreas.query <- NormalizeData(pancreas.query)
# 寻找transfer锚点
pancreas.anchors <- FindTransferAnchors(reference = pancreas.ref, 
                                        query = pancreas.query, 
                                        dims = 1:30,
                                        reference.reduction = "pca")
# 映射数据
predictions <- TransferData(anchorset = pancreas.anchors, 
                            refdata = pancreas.ref$celltype, 
                            dims = 1:30)
dim(predictions)

[1] 1642   15

predictions[1:5, 1:3]

        predicted.id prediction.score.alpha prediction.score.endothelial
D101_5         gamma                      0                            0
D101_7        acinar                      0                            0
D101_10        alpha                      1                            0
D101_13        delta                      0                            0
D101_14         beta                      0                            0

可以看到映射之后生成的predictions是一个数据框，将查询数据集中的每一个细胞和预测的细胞类型（predicted.id）一一对应，并给出了这种预测的分数。

接下来，只需将predictions作为metadata添加到查询数据集：

pancreas.query <- AddMetaData(pancreas.query, 
                              metadata = predictions)
colnames(pancreas.query@meta.data)

 [1] "orig.ident"                          "nCount_RNA"                         
 [3] "nFeature_RNA"                        "tech"                               
 [5] "replicate"                           "assigned_cluster"                   
 [7] "celltype"                            "dataset"                            
 [9] "predicted.id"                        "prediction.score.alpha"             
[11] "prediction.score.endothelial"        "prediction.score.delta"             
[13] "prediction.score.beta"               "prediction.score.ductal"            
[15] "prediction.score.acinar"             "prediction.score.mast"              
[17] "prediction.score.gamma"              "prediction.score.activated_stellate"
[19] "prediction.score.macrophage"         "prediction.score.quiescent_stellate"
[21] "prediction.score.epsilon"            "prediction.score.schwann"           
[23] "prediction.score.max"

table(pancreas.query$predicted.id)


            acinar activated_stellate              alpha               beta 
               262                 39                436                419 
             delta             ductal        endothelial              gamma 
                73                330                 19                 41 
        macrophage               mast            schwann 
                15                  2                  6

现在的查询数据集中就多出了映射后的细胞注释信息。

Because we have the original label annotations from our full integrated analysis, we can evaluate how well our predicted cell type annotations match the full reference.

table(pancreas.query$predicted.id == pancreas.query$celltype)


FALSE  TRUE 
   63  1579

In this example, we find that there is a high agreement in cell type classification, with over 96%（1579/1642） of cells being labeled correctly.

To verify this further, we can examine some canonical cell type markers for specific pancreatic islet cell populations.

VlnPlot(pancreas.query, 
        c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), 
        group.by = "predicted.id")

Note that even though some of these cell types are only represented by one or two cells (e.g. epsilon cells), we are still able to classify them correctly.

2.1 UMAP映射

We also enable projection of a query onto the reference UMAP structure. This can be achieved by computing the reference UMAP model and then calling MapQuery() instead of TransferData().

pancreas.ref <- RunUMAP(pancreas.ref, 
                        dims = 1:30, 
                        reduction = "integrated.cca", 
                        return.model = TRUE)
pancreas.query <- MapQuery(anchorset = pancreas.anchors, 
                           query = pancreas.query, 
                           reference = pancreas.ref, 
                           refdata = list(celltype = "celltype"), #需要transfer的参考数据集的列
                           reference.reduction = "pca", 
                           reduction.model = "umap")

可以看到现在的查询数据集pancreas.query中有了降维信息（reduction）。这些信息实际上是映射的参考数据集pancreas.ref的降维信息：

What is MapQuery doing?

MapQuery()打包了三个函数的功能: TransferData(), IntegrateEmbeddings(), and ProjectUMAP(). TransferData() is used to transfer cell type labels and impute the ADT values; IntegrateEmbeddings() is used to integrate reference with query by correcting the query’s projected low-dimensional embeddings; and finally ProjectUMAP() is used to project the query data onto the UMAP structure of the reference. 所以，运行MapQuery()的效果和运行下面的脚本一样:

pancreas.query <- TransferData(anchorset = pancreas.anchors, 
                               reference = pancreas.ref, 
                               query = pancreas.query,
                               refdata = list(celltype = "celltype"))
pancreas.query <- IntegrateEmbeddings(anchorset = pancreas.anchors, 
                                      reference = pancreas.ref, 
                                      query = pancreas.query,
                                      new.reduction.name = "ref.pca")
pancreas.query <- ProjectUMAP(query = pancreas.query, 
                              query.reduction = "ref.pca", 
                              reference = pancreas.ref,
                              reference.reduction = "pca", 
                              reduction.model = "umap")

We can now visualize the query cells alongside our reference.

library(ggplot2)
DimPlot(pancreas.ref, 
        reduction = "umap", 
        group.by = "celltype", 
        label = TRUE, 
        label.size = 6,
        repel = TRUE) + 
  NoLegend() + 
  ggtitle("Reference annotations")
DimPlot(pancreas.query, 
        reduction = "ref.umap", 
        group.by = "predicted.celltype", 
        label = TRUE,
        label.size = 6, 
        repel = TRUE) + 
  NoLegend() + 
  ggtitle("Query transferred labels")

Tip

本篇主要介绍了基于单细胞转录组测序（scRNA-seq）数据的参考数据集的制作和映射。Seurat现在还提供了一种’bridge integration’的方法，可以将其他单细胞组学数据（如scATAC-seq、scDNAme、CyTOF）映射到scRNA-seq参考数据集上。详见：Dictionary Learning for cross-modality integration。

Session Info

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.4      Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-2          

loaded via a namespace (and not attached):
  [1] deldir_2.0-2           pbapply_1.7-2          gridExtra_2.3         
  [4] rlang_1.1.3            magrittr_2.0.3         RcppAnnoy_0.0.21      
  [7] spatstat.geom_3.2-7    matrixStats_1.2.0      ggridges_0.5.5        
 [10] compiler_4.3.2         png_0.1-8              vctrs_0.6.5           
 [13] reshape2_1.4.4         stringr_1.5.1          pkgconfig_2.0.3       
 [16] fastmap_1.1.1          ellipsis_0.3.2         labeling_0.4.3        
 [19] utf8_1.2.4             promises_1.2.1         rmarkdown_2.25        
 [22] ggbeeswarm_0.7.2       purrr_1.0.2            xfun_0.41             
 [25] jsonlite_1.8.8         goftest_1.2-3          later_1.3.2           
 [28] spatstat.utils_3.0-4   irlba_2.3.5.1          parallel_4.3.2        
 [31] cluster_2.1.6          R6_2.5.1               ica_1.0-3             
 [34] stringi_1.8.3          RColorBrewer_1.1-3     spatstat.data_3.0-4   
 [37] reticulate_1.34.0      parallelly_1.36.0      lmtest_0.9-40         
 [40] scattermore_1.2        Rcpp_1.0.12            knitr_1.45            
 [43] tensor_1.5             future.apply_1.11.1    zoo_1.8-12            
 [46] sctransform_0.4.1      httpuv_1.6.13          Matrix_1.6-5          
 [49] splines_4.3.2          igraph_1.6.0           tidyselect_1.2.0      
 [52] abind_1.4-5            rstudioapi_0.15.0      yaml_2.3.8            
 [55] spatstat.random_3.2-2  codetools_0.2-19       miniUI_0.1.1.1        
 [58] spatstat.explore_3.2-5 listenv_0.9.0          lattice_0.22-5        
 [61] tibble_3.2.1           plyr_1.8.9             withr_3.0.0           
 [64] shiny_1.8.0            ROCR_1.0-11            ggrastr_1.0.2         
 [67] evaluate_0.23          Rtsne_0.17             future_1.33.1         
 [70] fastDummies_1.7.3      survival_3.5-7         polyclip_1.10-6       
 [73] fitdistrplus_1.1-11    pillar_1.9.0           KernSmooth_2.23-22    
 [76] plotly_4.10.4          generics_0.1.3         RcppHNSW_0.5.0        
 [79] munsell_0.5.0          scales_1.3.0           globals_0.16.2        
 [82] xtable_1.8-4           glue_1.7.0             lazyeval_0.2.2        
 [85] tools_4.3.2            data.table_1.14.10     RSpectra_0.16-1       
 [88] RANN_2.6.1             leiden_0.4.3.1         dotCall64_1.1-1       
 [91] cowplot_1.1.2          grid_4.3.2             tidyr_1.3.0           
 [94] colorspace_2.1-0       nlme_3.1-164           patchwork_1.2.0       
 [97] beeswarm_0.4.0         vipor_0.4.7            cli_3.6.2             
[100] spatstat.sparse_3.0-3  spam_2.10-0            fansi_1.0.6           
[103] viridisLite_0.4.2      dplyr_1.1.4            uwot_0.1.16           
[106] gtable_0.3.4           digest_0.6.34          progressr_0.14.0      
[109] ggrepel_0.9.5          farver_2.1.1           htmlwidgets_1.6.4     
[112] htmltools_0.5.7        lifecycle_1.0.4        httr_1.4.7            
[115] mime_0.12              MASS_7.3-60.0.1