寻找不同样本类型间同一细胞类型内的差异基因

参考原文:Introduction to scRNA-seq integration Differential expression testing

原文发布日期:2023年10月31日

1 直接通过FindMarkers寻找差异基因

1.1 数据导入

导入我们在整合(integration)中完成SCTransform和整合的单细胞数据。这里的meta.data已经提前注释好了细胞类型(储存在”seurat_annotations”列中)。

library(Seurat)
ifnb_integrated <- readRDS("output/seurat_official/ifnb_integrated.rds")
ifnb_integrated
An object of class Seurat 
27359 features across 13999 samples within 2 assays 
Active assay: SCT (13306 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
 3 dimensional reductions calculated: pca, umap, integrated.dr
head(ifnb_integrated, 3)
                  orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
                 nCount_SCT nFeature_SCT SCT_snn_res.0.6 seurat_clusters
AAACATACATTTCC.1       1944          852               1               1
AAACATACCAGAAA.1       1888          699               9               9
AAACATACCTCGCT.1       1951          804               1               1

1.2 运行PrepSCTFindMarkers(),预处理SCT assay

ifnb_integrated <- PrepSCTFindMarkers(ifnb_integrated)
Important

对于经过SCTransform归一化处理后的单细胞数据,在进行差异分析之前,需要先运行PrepSCTFindMarkers(),来预处理SCT assay。详细解释见此链接

如果是基于NormalizeData标准化的单细胞数据,需要使用”RNA” assay进行差异分析(见下面的数据读取和预处理),如果不是,需要通过DefaultAssay(ifnb) <- "RNA"进行设定。Note that the raw and normalized counts are stored in the counts and data layers of RNA assay. By default, the functions for finding markers will use normalized data. 关于FindMarkers为什么要使用”RNA” assay,参阅:https://github.com/hbctraining/scRNA-seq_online/issues/58

1.3 准备差异表达分析所需变量

We create a column in the meta.data slot to hold both the cell type and treatment information and switch the current Idents to that column.

ifnb_integrated$celltype.stim <- paste(ifnb_integrated$seurat_annotations, 
                                       ifnb_integrated$stim, 
                                       sep = "_")
unique(ifnb_integrated$celltype.stim)
 [1] "CD14 Mono_CTRL"    "pDC_CTRL"          "CD4 Memory T_CTRL"
 [4] "T activated_CTRL"  "CD4 Naive T_CTRL"  "CD8 T_CTRL"       
 [7] "Mk_CTRL"           "B Activated_CTRL"  "B_CTRL"           
[10] "DC_CTRL"           "CD16 Mono_CTRL"    "NK_CTRL"          
[13] "Eryth_CTRL"        "CD8 T_STIM"        "pDC_STIM"         
[16] "CD4 Naive T_STIM"  "B_STIM"            "CD14 Mono_STIM"   
[19] "T activated_STIM"  "CD4 Memory T_STIM" "B Activated_STIM" 
[22] "NK_STIM"           "DC_STIM"           "CD16 Mono_STIM"   
[25] "Mk_STIM"           "Eryth_STIM"       
Idents(ifnb_integrated) <- "celltype.stim"

1.4 执行差异表达分析

Then we use FindMarkers() to find the genes that are different between control and stimulated B cells.

# 寻找对照组和刺激组之间在B细胞中的差异基因

b.interferon.response <- FindMarkers(ifnb_integrated, 
                                     ident.1 = "B_STIM", 
                                     ident.2 = "B_CTRL", 
                                     verbose = FALSE)
head(b.interferon.response, n = 15)
                p_val avg_log2FC pct.1 pct.2     p_val_adj
ISG15   1.505650e-159  5.1597242 0.998 0.229 2.003417e-155
IFIT3   4.128835e-154  6.2281506 0.961 0.052 5.493827e-150
IFI6    2.493139e-153  5.6458391 0.965 0.076 3.317371e-149
ISG20   9.385626e-152  3.1715979 1.000 0.666 1.248851e-147
IFIT1   2.447118e-139  6.2614957 0.904 0.029 3.256136e-135
MX1     2.111961e-124  4.1490465 0.900 0.115 2.810175e-120
LY6E    2.930420e-122  3.9278136 0.898 0.150 3.899217e-118
TNFSF10 1.104024e-112  6.8254288 0.785 0.020 1.469014e-108
IFIT2   3.491368e-108  5.5668205 0.783 0.037 4.645615e-104
B2M      3.405403e-98  0.6074989 1.000 1.000  4.531229e-94
IRF7     1.114291e-96  3.3721350 0.834 0.187  1.482675e-92
PLSCR1   3.364901e-96  4.0217697 0.783 0.111  4.477338e-92
UBE2L6   1.155610e-85  2.6732717 0.849 0.295  1.537655e-81
CXCL10   5.689834e-84  8.0923962 0.639 0.010  7.570893e-80
PSMB9    2.304426e-81  1.8684260 0.937 0.568  3.066269e-77
Warning

Please note that p-values obtained from this analysis should be interpreted with caution, as these tests treat each cell as an independent replicate, and ignore inherent correlations between cells originating from the same sample. Such analyses have been shown to find a large number of false positive associations, as has been demonstrated by (Squair et al. 2021)(Zimmerman, Espeland, and Langefeld 2021)(Junttila, Smolander, and Elo 2022), and others. As discussed here (Crowell et al. 2020), DE tests across multiple conditions should expressly utilize multiple samples/replicates, and can be performed after aggregating (‘pseudobulking’) cells from the same sample and subpopulation together. Below, we show how pseudobulking can be used to account for such within-sample correlation.

Warning

这里没有使用进行pseudobulking的原因是,本例中“ctrl”和“sim”组都分别只有一个重复:We do not perform this analysis here, as there is a single replicate in the data.

1.5 可视化差异基因表达

Another useful way to visualize these changes in gene expression is with the split.by option to the FeaturePlot() or VlnPlot() function. This will display FeaturePlots of the list of given genes, split by a grouping variable (stimulation condition here).

FeaturePlot(ifnb_integrated, 
            features = c("CD3D", "GNLY", "IFI6", "ISG15", "CD14", "CXCL10"), 
            split.by = "stim", 
            max.cutoff = 3, 
            cols = c("grey", "red"), 
            reduction = "umap")

plots <- VlnPlot(ifnb_integrated,
                 features = c("CD3D", "GNLY", "IFI6", "ISG15", "CD14", "CXCL10", "LYZ"),
                 split.by = "stim",
                 group.by = "seurat_annotations",
                 pt.size = 0,
                 combine = FALSE) # 由于VlnPlot绘制组图时没有图例,所以这里取消绘制组图
library(patchwork)
wrap_plots(plots = plots, ncol = 2) # 将plots列表组合成组图

  • Genes such as CD3D and GNLY are canonical cell type markers (for T cells and NK/CD8 T cells) that are virtually unaffected by interferon stimulation and display similar gene expression patterns in the control and stimulated group.
  • IFI6 and ISG15, on the other hand, are core interferon response genes and are upregulated accordingly in all cell types.
  • CD14 and CXCL10 are genes that show a cell type specific interferon response.
    • CD14 expression decreases after stimulation in CD14 monocyteswhich could lead to misclassification in a supervised analysis framework, underscoring the value of integrated analysis.如果用于识别细胞类型的marker本身在不同的样本类型(处理 vs. 对照、恶性组织 vs. 正常组织)中存在表达量的差异,那么就会导致对细胞类型判断的错误。而本篇的数据整合则能够避免出现这种情况。

    • CXCL10 shows a distinct upregulation in monocytes and B cells after interferon stimulation but not in other cell types.


2 pseudobulking后的差异分析

这一节为了演示pseudobulking,采用的仍然是“ifnb”数据集,仍然包括了“ctrl”和“stim”两个conditions,但是每个条件下都有多个重复,即多个来自不同患者的样本。同时,为了和Seurat官方教程一致,采用了NormalizeData对数据进行标准化。

2.1 数据读取和预处理

For demonstration purposes, we will be using the interferon-beta stimulated human PBMCs dataset (Kang et al. 2017) that is available via the SeuratData package.

rm(list = ls())
library(Seurat)
library(SeuratData)
InstallData("ifnb")
ifnb <- LoadData("ifnb")

从本地下载好的数据读取:

rm(list = ls())

library(Seurat)
ifnb <- readRDS("data/seurat_official/pbmc_ifnb.rds")
ifnb
An object of class Seurat 
14053 features across 13999 samples within 1 assay 
Active assay: RNA (14053 features, 0 variable features)
 2 layers present: counts, data
head(ifnb@meta.data, 5)
                  orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
unique(ifnb$seurat_annotations) # 这里的数据已经提前注释好了细胞类型
 [1] CD14 Mono    pDC          CD4 Memory T T activated  CD4 Naive T 
 [6] CD8 T        Mk           B Activated  B            DC          
[11] CD16 Mono    NK           Eryth       
13 Levels: CD14 Mono CD4 Naive T CD4 Memory T CD16 Mono B CD8 T ... Eryth
# 标准化
ifnb <- NormalizeData(ifnb)
# 核对目前的默认assay,保证是RNA assay
DefaultAssay(ifnb)
[1] "RNA"

2.2 直接通过FindMarkers寻找差异基因

为了和pseudobulking后的差异分析结果进行比价,这里仍然先进行基于FindMarkers的简单差异表达分析,寻找ctrl和stim之间在CD14单核细胞中的差异基因。基本步骤和上面的准备差异表达分析所需变量一致。

ifnb$celltype.stim <- paste(ifnb$seurat_annotations, ifnb$stim, sep = "_")
unique(ifnb$celltype.stim)
 [1] "CD14 Mono_CTRL"    "pDC_CTRL"          "CD4 Memory T_CTRL"
 [4] "T activated_CTRL"  "CD4 Naive T_CTRL"  "CD8 T_CTRL"       
 [7] "Mk_CTRL"           "B Activated_CTRL"  "B_CTRL"           
[10] "DC_CTRL"           "CD16 Mono_CTRL"    "NK_CTRL"          
[13] "Eryth_CTRL"        "CD8 T_STIM"        "pDC_STIM"         
[16] "CD4 Naive T_STIM"  "B_STIM"            "CD14 Mono_STIM"   
[19] "T activated_STIM"  "CD4 Memory T_STIM" "B Activated_STIM" 
[22] "NK_STIM"           "DC_STIM"           "CD16 Mono_STIM"   
[25] "Mk_STIM"           "Eryth_STIM"       
Idents(ifnb) <- "celltype.stim"

mono.de <- FindMarkers(ifnb, 
                       ident.1 = "CD14 Mono_STIM", 
                       ident.2 = "CD14 Mono_CTRL", 
                       verbose = FALSE)
nrow(mono.de)
[1] 6956
head(mono.de, n = 10)
        p_val avg_log2FC pct.1 pct.2 p_val_adj
IFIT1       0   7.319139 0.985 0.033         0
CXCL10      0   8.036564 0.984 0.035         0
RSAD2       0   6.741673 0.988 0.045         0
TNFSF10     0   6.991279 0.989 0.047         0
IFIT3       0   6.883785 0.992 0.056         0
IFIT2       0   7.179929 0.961 0.039         0
CXCL11      0   8.624208 0.932 0.012         0
CCL8        0   9.134191 0.918 0.017         0
IDO1        0   5.455898 0.965 0.089         0
MX1         0   5.059052 0.960 0.093         0

2.3 执行pseudobulking

To pseudobulk, we will use AggregateExpression() to sum together gene counts of all the cells from the same sample for each cell type. This results in one gene expression profile per sample and cell type. We can then perform DE analysis using DESeq2 on the sample level. This treats the samples, rather than the individual cells, as independent observations.

First, we need to retrieve the sample information for each cell. This is not loaded in the metadata, so we will load it from the Github repo of the source data for the original paper.

```{r}
#| eval: false

# 从GitHub仓库读取(可能需要代理)
# load the inferred sample IDs of each cell
ctrl <- read.table(url("https://raw.githubusercontent.com/yelabucsf/demuxlet_paper_code/master/fig3/ye1.ctrl.8.10.sm.best"), head = T, stringsAsFactors = F)
stim <- read.table(url("https://raw.githubusercontent.com/yelabucsf/demuxlet_paper_code/master/fig3/ye2.stim.8.10.sm.best"), head = T, stringsAsFactors = F)
```
# 这里提前下载好了两个样本信息文件,所以直接从本地读取
ctrl <- readRDS("data/seurat_official/inferred_sample_ids_ctrl.rds")
stim <- readRDS("data/seurat_official/inferred_sample_ids_stim.rds")
info <- rbind(ctrl, stim)
info$BARCODE[1:5]
[1] "AAACATACAATGCC-1" "AAACATACATTTCC-1" "AAACATACCAGAAA-1" "AAACATACCAGCTA-1"
[5] "AAACATACCATGCA-1"
colnames(ifnb)[1:5]
[1] "AAACATACATTTCC.1" "AAACATACCAGAAA.1" "AAACATACCTCGCT.1" "AAACATACCTGGTA.1"
[5] "AAACATACGATGAA.1"
# 可以看到两者的barcode形式不一致
# rename the cell IDs by substituting the '-' into '.'
info$BARCODE <- gsub(pattern = "\\-", replacement = "\\.", info$BARCODE)
info$BARCODE[1:5]
[1] "AAACATACAATGCC.1" "AAACATACATTTCC.1" "AAACATACCAGAAA.1" "AAACATACCAGCTA.1"
[5] "AAACATACCATGCA.1"
# only keep the cells with high-confidence sample ID
info <- info[grep(pattern = "SNG", x = info$BEST), ]

# remove cells with duplicated IDs in both ctrl and stim groups
info <- info[!duplicated(info$BARCODE) & !duplicated(info$BARCODE, fromLast = T), ]

# now add the sample IDs to ifnb 
rownames(info) <- info$BARCODE
info <- info[, c("BEST"), drop = F]
names(info) <- c("donor_id")
ifnb <- AddMetaData(ifnb, metadata = info)

# remove cells without donor IDs
ifnb$donor_id[is.na(ifnb$donor_id)] <- "unknown"
ifnb <- subset(ifnb, subset = donor_id != "unknown")

可以看到,现在的meta.dat中多了样本信息列(donor_id),记录了每个细胞来自哪个患者:

head(ifnb@meta.data, 5)
                  orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
                     celltype.stim donor_id
AAACATACATTTCC.1    CD14 Mono_CTRL SNG-1016
AAACATACCAGAAA.1    CD14 Mono_CTRL SNG-1256
AAACATACCTCGCT.1    CD14 Mono_CTRL SNG-1256
AAACATACCTGGTA.1          pDC_CTRL SNG-1039
AAACATACGATGAA.1 CD4 Memory T_CTRL SNG-1488
table(ifnb$donor_id, ifnb$stim)
          
           CTRL STIM
  SNG-101   491  706
  SNG-1015 1583 1533
  SNG-1016  697  741
  SNG-1039  270  393
  SNG-107   331  321
  SNG-1244 1023  975
  SNG-1256 1160 1203
  SNG-1488  865 1376

按照治疗分组(STIM vs. CTRL)、患者IDs、细胞类型(seurat_annotations)3个条件,执行pseudobulking (AggregateExpression)。通过AggregateExpression命令将同一类型的细胞按照不同的处理条件合并起来,形成一个假的组织水平的测序数据。本例中,有2个治疗分组、13种细胞类型和8个患者,总共被合并成206个类别,将每一个类别看作是一个样本,这样就形成了一个所谓的假的组织水平的测序数据。

pseudo_ifnb <- AggregateExpression(ifnb, 
                                   assays = "RNA", 
                                   return.seurat = T, 
                                   group.by = c("stim", "donor_id", "seurat_annotations"))
pseudo_ifnb
An object of class Seurat 
14053 features across 206 samples within 1 assay 
Active assay: RNA (14053 features, 0 variable features)
 3 layers present: counts, data, scale.data
head(pseudo_ifnb@meta.data) # 可以看到现在的表达矩阵的barcode变成了治疗分组+患者IDs+细胞类型
                                         orig.ident stim donor_id
CTRL_SNG-101_CD14 Mono       CTRL_SNG-101_CD14 Mono CTRL  SNG-101
CTRL_SNG-101_CD4 Naive T   CTRL_SNG-101_CD4 Naive T CTRL  SNG-101
CTRL_SNG-101_CD4 Memory T CTRL_SNG-101_CD4 Memory T CTRL  SNG-101
CTRL_SNG-101_CD16 Mono       CTRL_SNG-101_CD16 Mono CTRL  SNG-101
CTRL_SNG-101_B                       CTRL_SNG-101_B CTRL  SNG-101
CTRL_SNG-101_CD8 T               CTRL_SNG-101_CD8 T CTRL  SNG-101
                          seurat_annotations
CTRL_SNG-101_CD14 Mono             CD14 Mono
CTRL_SNG-101_CD4 Naive T         CD4 Naive T
CTRL_SNG-101_CD4 Memory T       CD4 Memory T
CTRL_SNG-101_CD16 Mono             CD16 Mono
CTRL_SNG-101_B                             B
CTRL_SNG-101_CD8 T                     CD8 T

然后和此前一样,我们在meta.data中增加一列,记录治疗分组(STIM vs. CTRL)+ 细胞类型,这是用于差异分析的分组依据。

pseudo_ifnb$celltype.stim <- paste(pseudo_ifnb$seurat_annotations, 
                                   pseudo_ifnb$stim, 
                                   sep = "_")
unique(pseudo_ifnb$celltype.stim)
 [1] "CD14 Mono_CTRL"    "CD4 Naive T_CTRL"  "CD4 Memory T_CTRL"
 [4] "CD16 Mono_CTRL"    "B_CTRL"            "CD8 T_CTRL"       
 [7] "T activated_CTRL"  "NK_CTRL"           "DC_CTRL"          
[10] "B Activated_CTRL"  "Mk_CTRL"           "pDC_CTRL"         
[13] "Eryth_CTRL"        "CD14 Mono_STIM"    "CD4 Naive T_STIM" 
[16] "CD4 Memory T_STIM" "CD16 Mono_STIM"    "B_STIM"           
[19] "CD8 T_STIM"        "T activated_STIM"  "NK_STIM"          
[22] "DC_STIM"           "B Activated_STIM"  "Mk_STIM"          
[25] "pDC_STIM"          "Eryth_STIM"       

2.4 执行差异分析

Next, we perform DE testing on the pseudobulk level for CD14 monocytes, and compare it against the previous single-cell-level DE results.

安装DESeq2

由于pseudobulking后的FindMarkers差异分析需要采用DESeq2包提供的方法,所以需要提前安装DESeq2包:

```{r}
#| eval: false
BiocManager::install("DESeq2")
```

注意,如果是用FindMarkers来寻找细胞类型的marker基因,则一般采用默认的Wilcoxon Rank Sum Test方法,这时需要调用的是presto 包(见FindMarkers-在特定cluster之间寻找marker基因)。

Idents(pseudo_ifnb) <- "celltype.stim"

bulk.mono.de <- FindMarkers(pseudo_ifnb, 
                            ident.1 = "CD14 Mono_STIM", 
                            ident.2 = "CD14 Mono_CTRL",
                            test.use = "DESeq2") # 指定差异分析方法为"DESeq2"
head(bulk.mono.de, n = 15)
                 p_val avg_log2FC pct.1 pct.2     p_val_adj
IL1RN    3.701542e-275   6.160156     1     1 5.201777e-271
IFITM2   1.955626e-250   4.318976     1     1 2.748242e-246
SSB      2.699554e-203   3.066647     1     1 3.793683e-199
NT5C3A   2.239898e-198   5.412972     1     1 3.147729e-194
RTCB     5.700554e-162   3.133362     1     1 8.010989e-158
RABGAP1L 4.743010e-161   5.562364     1     1 6.665352e-157
DYNLT1   9.735640e-159   2.402726     1     1 1.368150e-154
PLSCR1   3.191691e-146   2.676047     1     1 4.485284e-142
ISG20    9.664488e-145   5.443114     1     1 1.358151e-140
NAPA     2.858013e-144   1.977719     1     1 4.016365e-140
DDX58    5.957026e-142   4.640111     1     1 8.371409e-138
HERC5    6.333722e-133   5.266515     1     1 8.900780e-129
OASL     3.892853e-130   3.946745     1     1 5.470627e-126
EIF2AK2  6.636434e-128   3.940167     1     1 9.326180e-124
TMEM50A  6.731955e-117   1.355947     1     1 9.460416e-113

We also support many other DE tests using other methods. For completeness, the following tests are currently supported:

  • “wilcox” : Wilcoxon rank sum test (default, using ‘presto’ package)

  • “wilcox_limma” : Wilcoxon rank sum test (using ‘limma’ package)

  • “bimod” : Likelihood-ratio test for single cell feature expression, (McDavid et al., Bioinformatics, 2013)

  • “roc” : Standard AUC classifier

  • “t” : Student’s t-test

  • “poisson” : Likelihood ratio test assuming an underlying negative binomial distribution. Use only for UMI-based datasets

  • “negbinom” : Likelihood ratio test assuming an underlying negative binomial distribution. Use only for UMI-based datasets

  • “LR” : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.

  • “MAST” : GLM-framework that treates cellular detection rate as a covariate (Finak et al, Genome Biology, 2015) (Installation instructions)

  • “DESeq2” : DE based on a model using the negative binomial distribution (Love et al, Genome Biology, 2014) (Installation instructions) For MAST and DESeq2, please ensure that these packages are installed separately in order to use them as part of Seurat. Once installed, use the test.use parameter can be used to specify which DE test to use.

# Test for DE features using the MAST package
# BiocManager::install('limma')
Idents(ifnb) <- "seurat_annotations"
head(FindMarkers(ifnb, 
                 ident.1 = "CD14 Mono", 
                 ident.2 = "CD16 Mono", 
                 test.use = "wilcox_limma"))
               p_val avg_log2FC pct.1 pct.2     p_val_adj
VMO1    0.000000e+00  -5.689802 0.084 0.777  0.000000e+00
MS4A4A  0.000000e+00  -3.356037 0.141 0.747  0.000000e+00
FCGR3A  0.000000e+00  -3.279465 0.418 0.982  0.000000e+00
MS4A7   0.000000e+00  -2.390652 0.557 0.978  0.000000e+00
RPS19   0.000000e+00  -1.321132 0.965 0.999  0.000000e+00
FTL    1.636254e-307   1.318127 1.000 1.000 2.299427e-303

2.5 比较单细胞水平和pseudobulk水平的差异表达分析

接下来,我们可以比较一下单细胞水平的差异表达分析的P值和pseudobulk水平的P值:

names(bulk.mono.de) <- paste0(names(bulk.mono.de), ".bulk") # 重命名列
bulk.mono.de$gene <- rownames(bulk.mono.de)

names(mono.de) <- paste0(names(mono.de), ".sc")
mono.de$gene <- rownames(mono.de)

merge_dat <- merge(mono.de, bulk.mono.de, by = "gene")
merge_dat <- merge_dat[order(merge_dat$p_val.bulk), ]

# 查看在两种差异分析方法中P值都有意义的基因名
common <- merge_dat$gene[which(merge_dat$p_val.bulk < 0.05 & 
                                merge_dat$p_val.sc < 0.05)]
# 查看在pseudobulk水平P>0.05但是在单细胞水平P<0.05的基因名:
only_sc <- merge_dat$gene[which(merge_dat$p_val.bulk > 0.05 & 
                                  merge_dat$p_val.sc < 0.05)]
# 查看在pseudobulk水平P<0.05但是在单细胞水平P>0.05的基因名:
only_bulk <- merge_dat$gene[which(merge_dat$p_val.bulk < 0.05 & 
                                    merge_dat$p_val.sc > 0.05)]
print(paste0('# 在两种差异分析方法中P值都<0.05的基因有: ',length(common), "个"))
[1] "# 在两种差异分析方法中P值都<0.05的基因有: 3519个"
print(paste0('# 仅在细胞水平差异分析中P值<0.05的基因有: ',length(only_sc), "个"))
[1] "# 仅在细胞水平差异分析中P值<0.05的基因有: 1649个"
print(paste0('# 仅在pseudobulk差异分析中P值<0.05的基因有: ',length(only_bulk), "个"))
[1] "# 仅在pseudobulk差异分析中P值<0.05的基因有: 204个"

We can see that while the p-values are correlated between the single-cell and pseudobulk data, the single-cell p-values are often smaller and suggest higher levels of significance. In particular, there are 3,519 genes with evidence of differential expression (prior to multiple hypothesis testing) in both analyses, 1,649 genes that only appear to be differentially expressed in the single-cell analysis, and just 204 genes that only appear to be differentially expressed in the bulk analysis.

接下来,我们通过小提琴图来检查两种方法中的Top共同差异基因在在刺激组和对照组的表达水平:

# create a new column to annotate sample-condition-celltype in the single-cell dataset
ifnb$donor_id.stim <- paste0(ifnb$stim, "-", ifnb$donor_id)
head(ifnb@meta.data)
                  orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
AAACATACGGCATT.1 IMMUNE_CTRL       1581          557 CTRL          CD14 Mono
                     celltype.stim donor_id donor_id.stim
AAACATACATTTCC.1    CD14 Mono_CTRL SNG-1016 CTRL-SNG-1016
AAACATACCAGAAA.1    CD14 Mono_CTRL SNG-1256 CTRL-SNG-1256
AAACATACCTCGCT.1    CD14 Mono_CTRL SNG-1256 CTRL-SNG-1256
AAACATACCTGGTA.1          pDC_CTRL SNG-1039 CTRL-SNG-1039
AAACATACGATGAA.1 CD4 Memory T_CTRL SNG-1488 CTRL-SNG-1488
AAACATACGGCATT.1    CD14 Mono_CTRL SNG-1015 CTRL-SNG-1015
unique(ifnb$celltype.stim)
 [1] "CD14 Mono_CTRL"    "pDC_CTRL"          "CD4 Memory T_CTRL"
 [4] "T activated_CTRL"  "CD4 Naive T_CTRL"  "CD8 T_CTRL"       
 [7] "Mk_CTRL"           "B Activated_CTRL"  "B_CTRL"           
[10] "DC_CTRL"           "NK_CTRL"           "CD16 Mono_CTRL"   
[13] "Eryth_CTRL"        "CD8 T_STIM"        "pDC_STIM"         
[16] "CD4 Naive T_STIM"  "B_STIM"            "CD14 Mono_STIM"   
[19] "T activated_STIM"  "CD4 Memory T_STIM" "B Activated_STIM" 
[22] "NK_STIM"           "DC_STIM"           "CD16 Mono_STIM"   
[25] "Mk_STIM"           "Eryth_STIM"       
Idents(ifnb) <- "celltype.stim"

# 这里我们检查p_val.bulk最小的前两个Top差异基因
print(merge_dat[merge_dat$gene %in% common[1:2], c('gene','p_val.sc','p_val.bulk')])
       gene p_val.sc    p_val.bulk
2785  IL1RN        0 3.701542e-275
2739 IFITM2        0 1.955626e-250
# 在细胞类型水平(CD14 Mono)查看这两个Top差异基因在刺激组和对照组的表达水平
VlnPlot(ifnb, 
        features = common[1:2], 
        idents = c("CD14 Mono_CTRL", "CD14 Mono_STIM"), 
        group.by = "stim") 

# 在样本(患者)水平查看这两个Top差异基因在刺激组和对照组的表达水平
VlnPlot(ifnb, 
        features = common[1:2], 
        idents = c("CD14 Mono_CTRL", "CD14 Mono_STIM"), 
        group.by = "donor_id.stim", 
        ncol = 1) 

In both the pseudobulk and single-cell analyses, the p-values for these two genes are astronomically small. For both of these genes, when just comparing all stimulated CD4 monocytes to all control CD4 monocytes across samples, we see much higher expression in the stimulated cells.

When breaking down these cells by sample, we continue to see consistently higher expression levels in the stimulated samples compared to the control samples; in other words, this finding is not driven by just one or two samples. Because of this consistency, we find this signal in both analyses.

By contrast, we can examine examples of genes that are only DE under the single-cell analysis.

print(merge_dat[merge_dat$gene %in% c('SRGN','HLA-DRA'), 
                c('gene','p_val.sc','p_val.bulk')])
        gene     p_val.sc p_val.bulk
5710    SRGN 4.025076e-21  0.1823188
2603 HLA-DRA 4.989863e-09  0.1851302
VlnPlot(ifnb, 
        features = c('SRGN','HLA-DRA'), 
        idents = c("CD14 Mono_CTRL", "CD14 Mono_STIM"), 
        group.by = "stim") 

VlnPlot(ifnb, 
        features = c('SRGN','HLA-DRA'), 
        idents = c("CD14 Mono_CTRL", "CD14 Mono_STIM"), 
        group.by = "donor_id.stim", 
        ncol = 1) 

Here, SRGN and HLA-DRA both have very small p-values in the single-cell analysis (on the orders of 10−21 and 10−9), but much larger p-values around 0.18 in the pseudobulk analysis. While there appears to be a difference between control and simulated cells when ignoring sample information, the signal is much weaker on the sample level, and we can see notable variability from sample to sample.

所以,从这个例子中可以看出pseudobulk后的差异分析的结果更加准确


R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] patchwork_1.2.0    Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-2          

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3          rstudioapi_0.15.0          
  [3] jsonlite_1.8.8              magrittr_2.0.3             
  [5] spatstat.utils_3.0-4        ggbeeswarm_0.7.2           
  [7] farver_2.1.1                rmarkdown_2.25             
  [9] zlibbioc_1.48.0             vctrs_0.6.5                
 [11] ROCR_1.0-11                 spatstat.explore_3.2-5     
 [13] RCurl_1.98-1.14             S4Arrays_1.2.0             
 [15] htmltools_0.5.7             SparseArray_1.2.3          
 [17] sctransform_0.4.1           parallelly_1.36.0          
 [19] KernSmooth_2.23-22          htmlwidgets_1.6.4          
 [21] ica_1.0-3                   plyr_1.8.9                 
 [23] plotly_4.10.4               zoo_1.8-12                 
 [25] igraph_1.6.0                mime_0.12                  
 [27] lifecycle_1.0.4             pkgconfig_2.0.3            
 [29] Matrix_1.6-5                R6_2.5.1                   
 [31] fastmap_1.1.1               MatrixGenerics_1.14.0      
 [33] GenomeInfoDbData_1.2.11     fitdistrplus_1.1-11        
 [35] future_1.33.1               shiny_1.8.0                
 [37] digest_0.6.34               colorspace_2.1-0           
 [39] S4Vectors_0.40.2            DESeq2_1.42.0              
 [41] tensor_1.5                  RSpectra_0.16-1            
 [43] irlba_2.3.5.1               GenomicRanges_1.54.1       
 [45] labeling_0.4.3              progressr_0.14.0           
 [47] fansi_1.0.6                 spatstat.sparse_3.0-3      
 [49] httr_1.4.7                  polyclip_1.10-6            
 [51] abind_1.4-5                 compiler_4.3.2             
 [53] withr_3.0.0                 BiocParallel_1.36.0        
 [55] fastDummies_1.7.3           MASS_7.3-60.0.1            
 [57] DelayedArray_0.28.0         tools_4.3.2                
 [59] vipor_0.4.7                 lmtest_0.9-40              
 [61] beeswarm_0.4.0              httpuv_1.6.13              
 [63] future.apply_1.11.1         goftest_1.2-3              
 [65] glue_1.7.0                  nlme_3.1-164               
 [67] promises_1.2.1              grid_4.3.2                 
 [69] Rtsne_0.17                  cluster_2.1.6              
 [71] reshape2_1.4.4              generics_0.1.3             
 [73] gtable_0.3.4                spatstat.data_3.0-4        
 [75] tidyr_1.3.0                 data.table_1.14.10         
 [77] XVector_0.42.0              utf8_1.2.4                 
 [79] BiocGenerics_0.48.1         spatstat.geom_3.2-7        
 [81] RcppAnnoy_0.0.21            ggrepel_0.9.5              
 [83] RANN_2.6.1                  pillar_1.9.0               
 [85] stringr_1.5.1               spam_2.10-0                
 [87] RcppHNSW_0.5.0              limma_3.58.1               
 [89] later_1.3.2                 splines_4.3.2              
 [91] dplyr_1.1.4                 lattice_0.22-5             
 [93] survival_3.5-7              deldir_2.0-2               
 [95] tidyselect_1.2.0            locfit_1.5-9.8             
 [97] miniUI_0.1.1.1              pbapply_1.7-2              
 [99] knitr_1.45                  gridExtra_2.3              
[101] IRanges_2.36.0              SummarizedExperiment_1.32.0
[103] scattermore_1.2             stats4_4.3.2               
[105] xfun_0.41                   Biobase_2.62.0             
[107] statmod_1.5.0               matrixStats_1.2.0          
[109] stringi_1.8.3               lazyeval_0.2.2             
[111] yaml_2.3.8                  evaluate_0.23              
[113] codetools_0.2-19            tibble_3.2.1               
[115] cli_3.6.2                   uwot_0.1.16                
[117] xtable_1.8-4                reticulate_1.34.0          
[119] munsell_0.5.0               GenomeInfoDb_1.38.5        
[121] Rcpp_1.0.12                 globals_0.16.2             
[123] spatstat.random_3.2-2       png_0.1-8                  
[125] ggrastr_1.0.2               parallel_4.3.2             
[127] ellipsis_0.3.2              ggplot2_3.4.4              
[129] presto_1.0.0                dotCall64_1.1-1            
[131] bitops_1.0-7                listenv_0.9.0              
[133] viridisLite_0.4.2           scales_1.3.0               
[135] ggridges_0.5.5              crayon_1.5.2               
[137] leiden_0.4.3.1              purrr_1.0.2                
[139] rlang_1.1.3                 cowplot_1.1.2              

References

Crowell, Helena L., Charlotte Soneson, Pierre-Luc Germain, Daniela Calini, Ludovic Collin, Catarina Raposo, Dheeraj Malhotra, and Mark D. Robinson. 2020. “Muscat Detects Subpopulation-Specific State Transitions from Multi-Sample Multi-Condition Single-Cell Transcriptomics Data.” Nature Communications 11 (1). https://doi.org/10.1038/s41467-020-19894-4.
Junttila, Sini, Johannes Smolander, and Laura L Elo. 2022. “Benchmarking Methods for Detecting Differential States Between Conditions from Multi-Subject Single-Cell RNA-Seq Data.” Briefings in Bioinformatics 23 (5). https://doi.org/10.1093/bib/bbac286.
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2017. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.
Squair, Jordan W., Matthieu Gautier, Claudia Kathe, Mark A. Anderson, Nicholas D. James, Thomas H. Hutson, Rémi Hudelle, et al. 2021. “Confronting False Discoveries in Single-Cell Differential Expression.” Nature Communications 12 (1). https://doi.org/10.1038/s41467-021-25960-2.
Zimmerman, Kip D., Mark A. Espeland, and Carl D. Langefeld. 2021. “A Practical Solution to Pseudoreplication Bias in Single-Cell Studies.” Nature Communications 12 (1). https://doi.org/10.1038/s41467-021-21038-1.