多个单细胞数据集整合分析（上）

参考：单细胞多数据集整合示例

有时候，为了扩大数据量和得到更加可靠的结论，我们可能获取和下载了多个单细胞数据集。如果我们只关注某一细胞类型，如T cells或B cells或髓系细胞，那么就需要合并分析这些数据集，提取感兴趣的细胞亚群。这时候有两种方法可以选择：

对各个数据集分别降维分群，然后提取各个数据集的感兴趣细胞亚群并整合到一起
在一开始分析的时候就对这些数据集进行整合，然后分群，提起感兴趣细胞群

这个部分怎么分析并没有一个特定的答案或者规则，具体还是得看数据集适合以上提出的哪种分析顺序。

例如(Gong et al. 2023)就用了第二种方法。选取了三个鼻咽癌单细胞转录组数据： GSE150825, GSE150430, GSE162025。首先对三组数据进行整合，然后选取其中的一部分T细胞进行进一步分析。

这里我们介绍第一种方式，即对各数据集分别降维分群，然后提取并整合各个数据集的感兴趣细胞亚群。

1 加载包

library(Seurat)
library(ggplot2)
library(beepr)
library(cowplot)
library(dplyr)
library(clustree)

2 数据导入

这里我们直接载入读取非标准大体积文本文件单细胞数据中构建好的Seurat对象，该案例数据的介绍及读取过程参见该章节。

merged_seurat <- readRDS("output/sc_supplementary/GSE150430_merged_seurat.rds")
merged_seurat

An object of class Seurat 
24720 features across 48584 samples within 1 assay 
Active assay: RNA (24720 features, 0 variable features)
 16 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10, counts.11, counts.12, counts.13, counts.14, counts.15, counts.16

head(merged_seurat, 4)

                       orig.ident nCount_RNA nFeature_RNA samples
N01_CAGAATCAGTAATCCC.1        N01   3470.970         4173     N01
N01_TGCGGGTAGTACGTAA.1        N01   3155.725         2994     N01
N01_AAACCTGAGGGTTTCT.1        N01   2347.034         1946     N01
N01_AAACCTGAGTTCCACA.1        N01   2362.495         1993     N01

unique(merged_seurat$samples)

 [1] "N01" "P01" "P02" "P03" "P04" "P05" "P06" "P07" "P08" "P09" "P10" "P11"
[13] "P12" "P13" "P14" "P15"

3 质控

# Add number of genes per UMI for each cell to metadata
merged_seurat$log10GenesPerUMI <- log10(merged_seurat$nFeature_RNA) / log10(merged_seurat$nCount_RNA)
summary(merged_seurat$log10GenesPerUMI)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8897  0.9488  0.9670  0.9773  0.9971  1.1429

# Compute percent mito ratio
merged_seurat$mitoRatio <- PercentageFeatureSet(object = merged_seurat, 
                                                pattern = "^MT-")
merged_seurat$mitoRatio <- merged_seurat$mitoRatio / 100
summary(merged_seurat$mitoRatio)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0006859 0.0106660 0.0127552 0.0129526 0.0150420 0.0407058

boxplot(merged_seurat$mitoRatio)

# Add cell IDs to metadata
merged_seurat$cells <- rownames(merged_seurat@meta.data)

# Visualize the number of cell counts per sample
merged_seurat@meta.data  |> 
    ggplot(aes(x = samples, fill = samples)) + 
    geom_bar() +
    theme_classic() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
    ggtitle("NCells")
# Visualize the number UMIs/transcripts per cell
merged_seurat@meta.data |> 
    ggplot(aes(color = samples, x = nCount_RNA, fill = samples)) + 
    geom_density(alpha = 0.2) + 
    scale_x_log10() + 
    theme_classic() +
    ylab("Cell density") +
    geom_vline(xintercept = 500)
# Visualize the distribution of genes detected per cell via histogram
merged_seurat@meta.data |>
    ggplot(aes(color = samples, x = nFeature_RNA, fill= samples)) + 
    geom_density(alpha = 0.2) + 
    theme_classic() +
    scale_x_log10() + 
    geom_vline(xintercept = 200)
# Visualize the overall complexity of the gene expression by visualizing the genes detected per UMI (novelty score)
merged_seurat@meta.data |>
    ggplot(aes(x = log10GenesPerUMI, color = samples, fill=samples)) +
    geom_density(alpha = 0.2) +
    theme_classic() +
    geom_vline(xintercept = 0.8)
# Visualize the distribution of mitochondrial gene expression detected per cell
merged_seurat@meta.data |>
    ggplot(aes(color = samples, x = mitoRatio, fill = samples)) + 
    geom_density(alpha = 0.2) + 
    scale_x_log10() + 
    theme_classic() +
    geom_vline(xintercept = 0.2)

# Visualize the correlation between genes detected and number of UMIs and determine whether strong presence of cells with low numbers of genes/UMIs
merged_seurat@meta.data |> 
    ggplot(aes(x = nCount_RNA, y = nFeature_RNA, color = mitoRatio)) + 
    geom_point() + 
    scale_colour_gradient(low = "gray90", high = "black") +
    stat_smooth(method = lm) +
    scale_x_log10() + 
    scale_y_log10() + 
    theme_classic() +
    geom_vline(xintercept = 500) +
    geom_hline(yintercept = 250) +
    facet_wrap(~samples)

可以看到该数据集已经进行了质控，各项指标均在正常范围内，因此，我们可以跳过细胞/基因过滤，直接进入下面的环节。

4 归一化及消除非期望变异来源

Caution

由于这里SCTransform的数据量太大，在我的16GB MacBook Pro上超出了R内存分配上限，出现了“Error: vector memory exhausted (limit reached?)”报错。因此，这里在运行下面的脚本之前通过提高R内存分配上限（macOS）的方法进行了处理。

# SCTranform
merged_seurat <- SCTransform(merged_seurat, verbose = FALSE); beep()
gc() # 释放未使用内存

            used   (Mb) gc trigger    (Mb) limit (Mb)   max used    (Mb)
Ncells   8268868  441.7   15695064   838.3         NA   14114318   753.8
Vcells 988416251 7541.1 3232710770 24663.7     102400 3232709930 24663.7

merged_seurat

An object of class Seurat 
46357 features across 48584 samples within 2 assays 
Active assay: SCT (21637 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA

# Check which assays are stored in objects
merged_seurat@assays

$RNA
Assay (v5) data with 24720 features for 48584 cells
First 10 features:
 RP11-34P13.7, RP11-34P13.8, AL627309.1, AP006222.2, RP4-669L17.10,
RP4-669L17.2, RP5-857K21.4, RP11-206L10.3, RP11-206L10.5, RP11-206L10.2 
Layers:
 counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7,
counts.8, counts.9, counts.10, counts.11, counts.12, counts.13,
counts.14, counts.15, counts.16 

$SCT
SCTAssay data with 21637 features for 48584 cells, and 16 SCTModel(s) 
Top 10 variable features:
 RNASE6, PHACTR1, NR4A3, C1orf162, HMOX1, IFIT2, CD7, CD86, CLNK, CRIP1

# 查看目前默认的assay
DefaultAssay(merged_seurat)

[1] "SCT"

# 查看默认assay的layers
Layers(merged_seurat)

[1] "counts"     "data"       "scale.data"

4.1 评估细胞周期的影响

# Load cell cycle markers
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

# Score cells for cell cycle
merged_seurat <- CellCycleScoring(merged_seurat, 
                                  g2m.features = g2m.genes, 
                                  s.features = s.genes)

# 现在的meta.data中多出了细胞周期评分“S.Score”和“G2M.Score”，以及推断的细胞所处的周期“Phase”
head(merged_seurat@meta.data)

                       orig.ident nCount_RNA nFeature_RNA samples
N01_CAGAATCAGTAATCCC.1        N01   3470.970         4173     N01
N01_TGCGGGTAGTACGTAA.1        N01   3155.725         2994     N01
N01_AAACCTGAGGGTTTCT.1        N01   2347.034         1946     N01
N01_AAACCTGAGTTCCACA.1        N01   2362.495         1993     N01
N01_AAACCTGGTTCTCATT.1        N01   2130.465         1459     N01
N01_AAACGGGGTCCTCCAT.1        N01   2772.010         2573     N01
                       log10GenesPerUMI   mitoRatio                  cells
N01_CAGAATCAGTAATCCC.1        1.0225953 0.008644269 N01_CAGAATCAGTAATCCC.1
N01_TGCGGGTAGTACGTAA.1        0.9934705 0.009895666 N01_TGCGGGTAGTACGTAA.1
N01_AAACCTGAGGGTTTCT.1        0.9758564 0.015183419 N01_AAACCTGAGGGTTTCT.1
N01_AAACCTGAGTTCCACA.1        0.9781039 0.014965958 N01_AAACCTGAGTTCCACA.1
N01_AAACCTGGTTCTCATT.1        0.9506023 0.017437508 N01_AAACCTGGTTCTCATT.1
N01_AAACGGGGTCCTCCAT.1        0.9906021 0.011233365 N01_AAACGGGGTCCTCCAT.1
                       nCount_SCT nFeature_SCT     S.Score    G2M.Score Phase
N01_CAGAATCAGTAATCCC.1       1840         1674 -0.05179250 -0.070444626    G1
N01_TGCGGGTAGTACGTAA.1       2840         2604 -0.04341940 -0.074302336    G1
N01_AAACCTGAGGGTTTCT.1       2443         1931 -0.02396942 -0.007239586    G1
N01_AAACCTGAGTTCCACA.1       2488         1972 -0.01294260 -0.021167805    G1
N01_AAACCTGGTTCTCATT.1       2104         1451 -0.05440456 -0.049531592    G1
N01_AAACGGGGTCCTCCAT.1       2892         2520 -0.10189203 -0.087537749    G1

# 查看一下细胞周期的分布情况
table(merged_seurat$Phase)


   G1   G2M     S 
33941  5147  9496

# 执行PCA
merged_seurat <- RunPCA(merged_seurat)

# Plot the PCA colored by cell cycle phase
p1 <- DimPlot(merged_seurat,
              reduction = "pca",
              group.by= "Phase")
p2 <- DimPlot(merged_seurat,
              reduction = "pca",
              group.by= "Phase",
              split.by = "Phase")
plot_grid(p1, p2, ncol = 2, labels = "AUTO")

可以看到细胞周期不是变异来源。

4.2 评估线粒体基因的影响

# Check quartile values
mito_sum <- summary(merged_seurat$mitoRatio)
mito_sum

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0006859 0.0106660 0.0127552 0.0129526 0.0150420 0.0407058

# Turn mitoRatio into categorical factor vector based on quartile values
merged_seurat$mitoFr <- cut(merged_seurat$mitoRatio, 
                            breaks=c(-Inf, mito_sum[2], mito_sum[3], mito_sum[5], Inf),
                            labels=c("Low", "Medium", "Medium high", "High"))
plot(merged_seurat$mitoFr)

# Plot the PCA colored by cell cycle phase
p1 <- DimPlot(merged_seurat,
              reduction = "pca",
              group.by= "mitoFr")
p2 <- DimPlot(merged_seurat,
              reduction = "pca",
              group.by= "mitoFr",
              split.by = "mitoFr")
plot_grid(p1, p2, ncol = 2, labels = "AUTO")

可以看到线粒体基因比例不是变异来源。

由于细胞周期和线粒体基因比例都不是非期望变异来源，所以我们这里不需要再次运行SCTransform来回归这些变量。接下来，直接进入数据整合环节。

5 数据整合

5.1 不进行整合时检验细胞分群情况

# 查看降维信息
names(merged_seurat@reductions)

[1] "pca"

# Run UMAP
merged_seurat <- RunUMAP(merged_seurat, 
                         dims = 1:40, 
                         reduction = "pca",
                         reduction.name = "umap.unintegrated"); beep()

# 分群
# Determine the K-nearest neighbor graph
merged_seurat <- FindNeighbors(merged_seurat, 
                               dims = 1:40, 
                               reduction = "pca")
merged_seurat <- FindClusters(merged_seurat, 
                              cluster.name = "unintegrated_clusters")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 48584
Number of edges: 1937300

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9298
Number of communities: 38
Elapsed time: 7 seconds

# Plot UMAP
p1 <- DimPlot(merged_seurat, 
              reduction = "umap.unintegrated",
              group.by = "samples")
p2 <- DimPlot(merged_seurat, 
              reduction = "umap.unintegrated",
              split.by = "samples")
plot_grid(p1, p2, 
          ncol = 1, labels = "AUTO")

# 由于样本数较多，我们再按照样本类型“Normal” vs. “Cancer”画一下UMAP图
merged_seurat$groups <- ifelse(merged_seurat$samples == "N01", "Normal", "Cancer")
p1a <- DimPlot(merged_seurat, 
               reduction = "umap.unintegrated",
               group.by = "groups")
p2a <- DimPlot(merged_seurat, 
               reduction = "umap.unintegrated",
               split.by = "groups")
plot_grid(p1a, p2a, 
          ncol = 1, labels = "AUTO")

5.2 整合

这里我们用Harmony整合算法。

# 整合，比较耗时间，进度条会一直显示0%直至运算完成
seurat_integrated <- IntegrateLayers(object = merged_seurat,
                                     method = HarmonyIntegration,
                                     assay = "SCT", # Integrating SCTransformed data
                                     orig.reduction = "pca",
                                     verbose = FALSE); beep()


# 整合后合并RNA layer
seurat_integrated[["RNA"]] <- JoinLayers(seurat_integrated[["RNA"]])

# 查看整合后的降维信息
names(seurat_integrated@reductions)

[1] "pca"               "umap.unintegrated" "harmony"

5.3 整合后检验细胞分群情况

set.seed(123456)
# Run UMAP
seurat_integrated <- RunUMAP(seurat_integrated, 
                             dims = 1:40,
                             reduction = "harmony", # 更改降维来源为整合后的"harmony"
                             reduction.name = "umap.integrated"); beep()
names(seurat_integrated@reductions)

[1] "pca"               "umap.unintegrated" "harmony"          
[4] "umap.integrated"

# 分群
seurat_integrated <- FindNeighbors(seurat_integrated, 
                                   dims = 1:40, 
                                   reduction = "harmony") #更改降维来源为"harmony"
seurat_integrated <- FindClusters(seurat_integrated, 
                                  cluster.name = "integrated_clusters")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 48584
Number of edges: 2161498

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9121
Number of communities: 24
Elapsed time: 8 seconds

colnames(seurat_integrated@meta.data)

 [1] "orig.ident"            "nCount_RNA"            "nFeature_RNA"         
 [4] "samples"               "log10GenesPerUMI"      "mitoRatio"            
 [7] "cells"                 "nCount_SCT"            "nFeature_SCT"         
[10] "S.Score"               "G2M.Score"             "Phase"                
[13] "mitoFr"                "unintegrated_clusters" "seurat_clusters"      
[16] "groups"                "integrated_clusters"

# Plot UMAP                             
p3 <- DimPlot(seurat_integrated, 
              reduction = "umap.integrated", 
              group.by = "samples")
p4 <- DimPlot(seurat_integrated, 
              reduction = "umap.integrated", 
              split.by = "samples")
plot_grid(p1, p3, p2, p4, 
          ncol = 1, 
          labels = c("Before Harmony", "After Harmony", 
                     "Before Harmony", "After Harmony"))

# 保存
saveRDS(seurat_integrated, 
        file = "output/sc_supplementary/GSE150430_seurat_integrated.RDS"); beep()

6 聚类

rm(list = ls())
gc()

             used   (Mb) gc trigger    (Mb) limit (Mb)   max used    (Mb)
Ncells    8422866  449.9   15695064   838.3         NA   15695064   838.3
Vcells 1189105980 9072.2 3232710770 24663.7     102400 3232709930 24663.7

seurat_integrated <- readRDS("output/sc_supplementary/GSE150430_seurat_integrated.RDS")

# Determine the clusters for various resolutions                                
seurat_integrated <- FindClusters(seurat_integrated,
                                  resolution = c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1),
                                  verbose = F); beep()
# Explore resolutions
head(seurat_integrated@meta.data, 5)

                       orig.ident nCount_RNA nFeature_RNA samples
N01_CAGAATCAGTAATCCC.1        N01   3470.970         4173     N01
N01_TGCGGGTAGTACGTAA.1        N01   3155.725         2994     N01
N01_AAACCTGAGGGTTTCT.1        N01   2347.034         1946     N01
N01_AAACCTGAGTTCCACA.1        N01   2362.495         1993     N01
N01_AAACCTGGTTCTCATT.1        N01   2130.465         1459     N01
                       log10GenesPerUMI   mitoRatio                  cells
N01_CAGAATCAGTAATCCC.1        1.0225953 0.008644269 N01_CAGAATCAGTAATCCC.1
N01_TGCGGGTAGTACGTAA.1        0.9934705 0.009895666 N01_TGCGGGTAGTACGTAA.1
N01_AAACCTGAGGGTTTCT.1        0.9758564 0.015183419 N01_AAACCTGAGGGTTTCT.1
N01_AAACCTGAGTTCCACA.1        0.9781039 0.014965958 N01_AAACCTGAGTTCCACA.1
N01_AAACCTGGTTCTCATT.1        0.9506023 0.017437508 N01_AAACCTGGTTCTCATT.1
                       nCount_SCT nFeature_SCT     S.Score    G2M.Score Phase
N01_CAGAATCAGTAATCCC.1       1840         1674 -0.05179250 -0.070444626    G1
N01_TGCGGGTAGTACGTAA.1       2840         2604 -0.04341940 -0.074302336    G1
N01_AAACCTGAGGGTTTCT.1       2443         1931 -0.02396942 -0.007239586    G1
N01_AAACCTGAGTTCCACA.1       2488         1972 -0.01294260 -0.021167805    G1
N01_AAACCTGGTTCTCATT.1       2104         1451 -0.05440456 -0.049531592    G1
                            mitoFr unintegrated_clusters seurat_clusters groups
N01_CAGAATCAGTAATCCC.1         Low                    20              20 Normal
N01_TGCGGGTAGTACGTAA.1         Low                    32              20 Normal
N01_AAACCTGAGGGTTTCT.1        High                     0               0 Normal
N01_AAACCTGAGTTCCACA.1 Medium high                     1               3 Normal
N01_AAACCTGGTTCTCATT.1        High                     1               8 Normal
                       integrated_clusters SCT_snn_res.0.01 SCT_snn_res.0.05
N01_CAGAATCAGTAATCCC.1                  18                2                2
N01_TGCGGGTAGTACGTAA.1                  18                2                2
N01_AAACCTGAGGGTTTCT.1                   0                1                1
N01_AAACCTGAGTTCCACA.1                   0                1                1
N01_AAACCTGGTTCTCATT.1                   7                1                1
                       SCT_snn_res.0.1 SCT_snn_res.0.2 SCT_snn_res.0.3
N01_CAGAATCAGTAATCCC.1               2               2               1
N01_TGCGGGTAGTACGTAA.1               2               2               1
N01_AAACCTGAGGGTTTCT.1               0               0               0
N01_AAACCTGAGTTCCACA.1               0               0               0
N01_AAACCTGGTTCTCATT.1               0               0               0
                       SCT_snn_res.0.4 SCT_snn_res.0.5 SCT_snn_res.0.8
N01_CAGAATCAGTAATCCC.1               1              15              18
N01_TGCGGGTAGTACGTAA.1               1              15              18
N01_AAACCTGAGGGTTTCT.1               0               0               0
N01_AAACCTGAGTTCCACA.1               0               0               0
N01_AAACCTGGTTCTCATT.1               0               8               7
                       SCT_snn_res.1
N01_CAGAATCAGTAATCCC.1            20
N01_TGCGGGTAGTACGTAA.1            20
N01_AAACCTGAGGGTTTCT.1             0
N01_AAACCTGAGTTCCACA.1             3
N01_AAACCTGGTTCTCATT.1             8

# 查看各个分辨率下的细胞分群情况
select(seurat_integrated@meta.data, starts_with(match = "SCT_snn_res.")) %>%  
  lapply(levels)

$SCT_snn_res.0.01
[1] "0" "1" "2" "3" "4" "5"

$SCT_snn_res.0.05
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

$SCT_snn_res.0.1
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

$SCT_snn_res.0.2
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

$SCT_snn_res.0.3
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
[16] "15" "16"

$SCT_snn_res.0.4
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
[16] "15" "16" "17"

$SCT_snn_res.0.5
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
[16] "15" "16" "17" "18" "19" "20"

$SCT_snn_res.0.8
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
[16] "15" "16" "17" "18" "19" "20" "21" "22" "23"

$SCT_snn_res.1
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
[16] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25"

绘制聚类树展示不同分辨率下的细胞分群情况及相互关系

tree <- clustree(seurat_integrated@meta.data, 
                 prefix = "SCT_snn_res.")#指定包含聚类信息的列
tree

通常情况下，为了决定合适的聚类分辨率，可使用以下两种策略：

选择透明箭头出现较少的分辨率聚类结果
基于marker基因表达选择有生物学意义的分辨率聚类结果

接下来分析，按照分辨率为0.5进行

Idents(seurat_integrated) <- "SCT_snn_res.0.5"

聚类可视化

# Plot the UMAP
DimPlot(seurat_integrated,
        reduction = "umap.integrated",
        label = FALSE)

7 细胞分群质量评估

7.1 分析样本类型是否影响细胞分群

# 先简单查看不同cluster的细胞数
table(seurat_integrated@active.ident)


    0     1     2     3     4     5     6     7     8     9    10    11    12 
10830  5756  5526  5321  4161  3765  2388  2365  1952  1478   986   792   759 
   13    14    15    16    17    18    19    20 
  707   398   388   232   214   198   189   179

# 查看不同样本类型中的细胞分群情况
DimPlot(seurat_integrated, 
        reduction = "umap.integrated",
        label = TRUE, 
        split.by = "groups") + 
  NoLegend()

7.2 分析细胞周期是否影响细胞分群

# Explore whether clusters segregate by cell cycle phase
DimPlot(seurat_integrated,
        reduction = "umap.integrated",
        label = TRUE, 
        split.by = "Phase") + 
  NoLegend()

7.3 分析其他非期望变异来源是否会影响细胞分群

# Determine metrics to plot present in seurat_clustered@meta.data
metrics <-  c("nCount_RNA", "nFeature_RNA", "S.Score", "G2M.Score", "mitoRatio")

FeaturePlot(seurat_integrated, 
            reduction = "umap.integrated", 
            features = metrics,
            pt.size = 0.4, 
            order = TRUE,
            min.cutoff = 'q10', 
            label = TRUE)

7.4 保存

saveRDS(seurat_integrated, 
        file = "output/sc_supplementary/GSE150430_seurat_clustered.RDS"); beep()

8 细胞注释

8.1 探索cell type markers的表达

这里的marker基因选用此前章节中的细胞初步分群通用marker。

rm(list = ls())
gc()

             used   (Mb) gc trigger    (Mb) limit (Mb)   max used    (Mb)
Ncells    8508819  454.5   15695064   838.3         NA   15695064   838.3
Vcells 1006102702 7676.0 3232710770 24663.7     102400 3232709930 24663.7

seurat_clustered <- readRDS("output/sc_supplementary/GSE150430_seurat_clustered.RDS")

genes_to_check = c('PTPRC', 
                   "CD163","AIF1", 
                   'CD3D', 'CD3E', 'CD4', 'CD8A', 
                   'CD19', 'CD79A', 'MS4A1', "SDC1", "CD27", "CD38",
                   'IGHG1', 'MZB1', 'SDC1', "JCHAIN", 
                   'CD68', 'CD163', 'CD14', 
                   'S100A9', 'S100A8', 'MMP19',
                   'C1QA',  'C1QB', 
                   'TPSAB1', 'TPSB2', 
                   'KLRB1', "KLRD1", 'NCR1', "GNLY", "NKG7", 
                   'FGF7', 'MME', 'ACTA2', "COL3A1", 
                   'PECAM1', 'VWF', 
                   'EPCAM', 'KRT19', 'PROM1', 'ALDH1A1') |> unique()

DotPlot(seurat_clustered, 
        features = genes_to_check) + 
  coord_flip()

8.2 手动注释

seurat_clustered <- RenameIdents(seurat_clustered, 
                                 "0" = "B cells",
                                 "1" = "T cells",
                                 "2" = "Epithelial/cancer",
                                 "3" = "T cells",
                                 "4" = "T cells",
                                 "5" = "Macrophages",
                                 "6" = "T cells",
                                 "7" = "Plasma cells",
                                 "8" = "B cells",
                                 "9" = "Epithelial/cancer",
                                 "10" = "B cells",
                                 "11" = "B cells",
                                 "12" = "Unknown",
                                 "13" = "NK cells",
                                 "14" = "Unknown",
                                 "15" = "Epithelial/cancer",
                                 "16" = "Myeloid (unspecific)",
                                 "17" = "Epithelial/cancer",
                                 "18" = "Epithelial/cancer",
                                 "19" = "Epithelial/cancer",
                                 "20" = "B cells")
table(Idents(seurat_clustered))


             B cells              T cells    Epithelial/cancer 
               14739                17626                 7993 
         Macrophages         Plasma cells              Unknown 
                3765                 2365                 1157 
            NK cells Myeloid (unspecific) 
                 707                  232

# Plot the UMAP
DimPlot(seurat_clustered, 
        reduction = "umap.integrated", 
        label = T)

在注释好的数据中再次检查marker基因的表达情况：

DotPlot(seurat_clustered, 
        features = genes_to_check) + 
  coord_flip() +
  theme(axis.text.x = element_text(angle = 30,vjust = 0.85,hjust = 0.75)
)

因为接下来我们要提取髓系细胞进一步分析，所以通过几种可视化方法检查一下髓系细胞通用marker（CD163和AIF1）的表达情况：

RidgePlot(seurat_clustered, features = c("CD163", "AIF1"), ncol = 1)

VlnPlot(seurat_clustered, features = c("CD163", "AIF1"))

FeaturePlot(seurat_clustered, features = c("CD163", "AIF1"), label = T)

保存

saveRDS(seurat_clustered, 
        file = "output/sc_supplementary/GSE150430_seurat_annotated.RDS"); beep()

9 提取髓系细胞

sub_seurat <- subset(seurat_clustered, 
                     idents = c("Macrophages", 
                                "Myeloid (unspecific)"))
sub_seurat

An object of class Seurat 
46357 features across 3997 samples within 2 assays 
Active assay: SCT (21637 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
 4 dimensional reductions calculated: pca, umap.unintegrated, harmony, umap.integrated

colnames(sub_seurat@meta.data)

 [1] "orig.ident"            "nCount_RNA"            "nFeature_RNA"         
 [4] "samples"               "log10GenesPerUMI"      "mitoRatio"            
 [7] "cells"                 "nCount_SCT"            "nFeature_SCT"         
[10] "S.Score"               "G2M.Score"             "Phase"                
[13] "mitoFr"                "unintegrated_clusters" "seurat_clusters"      
[16] "groups"                "integrated_clusters"   "SCT_snn_res.0.01"     
[19] "SCT_snn_res.0.05"      "SCT_snn_res.0.1"       "SCT_snn_res.0.2"      
[22] "SCT_snn_res.0.3"       "SCT_snn_res.0.4"       "SCT_snn_res.0.5"      
[25] "SCT_snn_res.0.8"       "SCT_snn_res.1"

myeloid_seurat <- CreateSeuratObject(counts = sub_seurat@assays[["RNA"]],
                                     meta.data = sub_seurat@meta.data[,c(1:7, 10:13, 16)])
myeloid_seurat

An object of class Seurat 
24720 features across 3997 samples within 1 assay 
Active assay: RNA (24720 features, 0 variable features)
 1 layer present: counts

Warning

可以看到在重新创建Seurat之后，默认的“RNA” assay之中只有一个layer。为了后面执行SCTransform及整合，这里需要将其按照样本拆分成不同的layers。

myeloid_seurat[["RNA"]] <- split(myeloid_seurat[["RNA"]], 
                                 f = myeloid_seurat$samples)
myeloid_seurat

An object of class Seurat 
24720 features across 3997 samples within 1 assay 
Active assay: RNA (24720 features, 0 variable features)
 16 layers present: counts.N01, counts.P01, counts.P02, counts.P03, counts.P04, counts.P05, counts.P06, counts.P07, counts.P08, counts.P09, counts.P10, counts.P11, counts.P12, counts.P13, counts.P14, counts.P15

# 补充原细胞注释信息
myeloid_seurat$old_clusters <- Idents(sub_seurat)
table(myeloid_seurat$old_clusters)


         Macrophages Myeloid (unspecific) 
                3765                  232

# 保存
saveRDS(myeloid_seurat, file = "output/sc_supplementary/GSE150430_myeloid_seurat.RDS")

如何提取并处理目标细胞亚群

After we have completed the scRNA-seq workflow and identified the various cell types present in our samples, we might decide that for a particular cell type, we would like to identify subtypes. For example, if we have a large cluster of CD4+ Helper T cells, we may want to identify subsets of Th1, Th2, Th17, Th9, and Tfh cells. To subset the dataset, Seurat has a handy subset()function（见Seurat常用函数清单）.

To perform the subclustering, there are a couple of different methods you could try. The easiest would be to run the FindNeighbors() and FindClusters() on the subsetted cells, adjusting the resolution to give you the optimal clustering. However, with this approach you are not redefining the most variable genes used to find clusters, so it might not work if the genes delineating these subsets are not those driving any of the top PCs used for the clustering.

Alternatively, we could start over with the raw counts for this subset of cells and run SCTransform()to determine the greatest sources of variation present. This would allow us to focus our clustering on the most variant genes present among our subset of cells. Hopefully, the most variant genes are those driving the various desired subsets (e.g. Th1, Th2, Th17, Th9, and Tfh cells). If integration is necessary, then this step would still need to be performed.

Since subsetting the dataset can result in a much smaller number of cells, it is important to consider the total number of cells you are looking to cluster and some of the parameters that might be affected by the small numbers. For example, if integrating, there is a ‘K’ number of cells used for determining the neighborhoods for identifying and filtering anchors. Therefore, if your integration isn’t very good for a small dataset, you might want to consider lowering the ‘K’ parameter (k.filter、k.weight) inIntegrateLayers. However, if ‘K’ is too small, it could also lead to poor integration.

（来自：Subclustering scRNA-seq datasets）

下一节，针对髓系细胞进一步分群。

Session Info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustree_0.5.1     ggraph_2.1.0       dplyr_1.1.4        cowplot_1.1.2     
[5] beepr_1.3          ggplot2_3.4.4      Seurat_5.0.1       SeuratObject_5.0.1
[9] sp_2.1-2          

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.21            splines_4.3.2              
  [3] later_1.3.2                 bitops_1.0-7               
  [5] tibble_3.2.1                polyclip_1.10-6            
  [7] fastDummies_1.7.3           lifecycle_1.0.4            
  [9] globals_0.16.2              lattice_0.22-5             
 [11] MASS_7.3-60.0.1             backports_1.4.1            
 [13] magrittr_2.0.3              plotly_4.10.4              
 [15] rmarkdown_2.25              yaml_2.3.8                 
 [17] httpuv_1.6.13               glmGamPoi_1.14.0           
 [19] sctransform_0.4.1           spam_2.10-0                
 [21] spatstat.sparse_3.0-3       reticulate_1.34.0          
 [23] pbapply_1.7-2               RColorBrewer_1.1-3         
 [25] abind_1.4-5                 zlibbioc_1.48.0            
 [27] audio_0.1-11                Rtsne_0.17                 
 [29] GenomicRanges_1.54.1        purrr_1.0.2                
 [31] BiocGenerics_0.48.1         RCurl_1.98-1.14            
 [33] tweenr_2.0.2                GenomeInfoDbData_1.2.11    
 [35] IRanges_2.36.0              S4Vectors_0.40.2           
 [37] ggrepel_0.9.5               irlba_2.3.5.1              
 [39] listenv_0.9.0               spatstat.utils_3.0-4       
 [41] goftest_1.2-3               RSpectra_0.16-1            
 [43] spatstat.random_3.2-2       fitdistrplus_1.1-11        
 [45] parallelly_1.36.0           DelayedMatrixStats_1.24.0  
 [47] leiden_0.4.3.1              codetools_0.2-19           
 [49] DelayedArray_0.28.0         ggforce_0.4.1              
 [51] tidyselect_1.2.0            farver_2.1.1               
 [53] viridis_0.6.4               matrixStats_1.2.0          
 [55] stats4_4.3.2                spatstat.explore_3.2-5     
 [57] jsonlite_1.8.8              ellipsis_0.3.2             
 [59] tidygraph_1.3.0             progressr_0.14.0           
 [61] ggridges_0.5.5              survival_3.5-7             
 [63] tools_4.3.2                 ica_1.0-3                  
 [65] Rcpp_1.0.12                 glue_1.7.0                 
 [67] gridExtra_2.3               SparseArray_1.2.3          
 [69] xfun_0.41                   mgcv_1.9-1                 
 [71] MatrixGenerics_1.14.0       GenomeInfoDb_1.38.5        
 [73] withr_3.0.0                 fastmap_1.1.1              
 [75] fansi_1.0.6                 digest_0.6.34              
 [77] R6_2.5.1                    mime_0.12                  
 [79] colorspace_2.1-0            scattermore_1.2            
 [81] tensor_1.5                  spatstat.data_3.0-4        
 [83] RhpcBLASctl_0.23-42         utf8_1.2.4                 
 [85] tidyr_1.3.0                 generics_0.1.3             
 [87] data.table_1.14.10          graphlayouts_1.0.2         
 [89] httr_1.4.7                  htmlwidgets_1.6.4          
 [91] S4Arrays_1.2.0              uwot_0.1.16                
 [93] pkgconfig_2.0.3             gtable_0.3.4               
 [95] lmtest_0.9-40               XVector_0.42.0             
 [97] htmltools_0.5.7             dotCall64_1.1-1            
 [99] scales_1.3.0                Biobase_2.62.0             
[101] png_0.1-8                   harmony_1.2.0              
[103] knitr_1.45                  rstudioapi_0.15.0          
[105] reshape2_1.4.4              checkmate_2.3.1            
[107] nlme_3.1-164                zoo_1.8-12                 
[109] stringr_1.5.1               KernSmooth_2.23-22         
[111] vipor_0.4.7                 parallel_4.3.2             
[113] miniUI_0.1.1.1              ggrastr_1.0.2              
[115] pillar_1.9.0                grid_4.3.2                 
[117] vctrs_0.6.5                 RANN_2.6.1                 
[119] promises_1.2.1              xtable_1.8-4               
[121] cluster_2.1.6               beeswarm_0.4.0             
[123] evaluate_0.23               cli_3.6.2                  
[125] compiler_4.3.2              rlang_1.1.3                
[127] crayon_1.5.2                future.apply_1.11.1        
[129] labeling_0.4.3              ggbeeswarm_0.7.2           
[131] plyr_1.8.9                  stringi_1.8.3              
[133] viridisLite_0.4.2           deldir_2.0-2               
[135] munsell_0.5.0               lazyeval_0.2.2             
[137] spatstat.geom_3.2-7         Matrix_1.6-5               
[139] RcppHNSW_0.5.0              patchwork_1.2.0            
[141] sparseMatrixStats_1.14.0    future_1.33.1              
[143] shiny_1.8.0                 SummarizedExperiment_1.32.0
[145] ROCR_1.0-11                 igraph_1.6.0

References

Gong, Lanqi, Jie Luo, Yu Zhang, Yuma Yang, Shanshan Li, Xiaona Fang, Baifeng Zhang, et al. 2023. “Nasopharyngeal Carcinoma Cells Promote Regulatory T Cell Development and Suppressive Activity via CD70-CD27 Interaction.” Nature Communications 14 (1). https://doi.org/10.1038/s41467-023-37614-6.