寻找marker gene

参考原文:Introduction to scRNA-seq integration

原文发布日期:2023年10月31日

1 数据导入

这里我们导入整合(integration)中完成SCTransform和整合的单细胞数据。这里的meta.data已经提前注释好了细胞类型(储存在”seurat_annotations”列中)。

library(Seurat)
ifnb <- readRDS("output/seurat_official/ifnb_integrated.rds")
ifnb
An object of class Seurat 
27359 features across 13999 samples within 2 assays 
Active assay: SCT (13306 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
 3 dimensional reductions calculated: pca, umap, integrated.dr
head(ifnb, 3)
                  orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
                 nCount_SCT nFeature_SCT SCT_snn_res.0.6 seurat_clusters
AAACATACATTTCC.1       1944          852               1               1
AAACATACCAGAAA.1       1888          699               9               9
AAACATACCTCGCT.1       1951          804               1               1
table(ifnb$seurat_annotations)

   CD14 Mono  CD4 Naive T CD4 Memory T    CD16 Mono            B        CD8 T 
        4362         2504         1762         1044          978          814 
 T activated           NK           DC  B Activated           Mk          pDC 
         633          619          472          388          236          132 
       Eryth 
          55 
DimPlot(ifnb, 
        reduction = "umap", 
        group.by = c("stim", "seurat_annotations"))

2 FindMarkers-在特定cluster之间寻找marker基因

Caution

For a much faster implementation of the Wilcoxon Rank Sum Test,(default method for FindMarkers) please install the presto package:

devtools::install_github('immunogenomics/presto')

作为演示,下面我们通过FindMarkers寻找CD16 Mono和CD14 Mono之间的差异基因(marker gene)。

Important

对于经过SCTransform归一化处理后的单细胞数据,在进行差异分析之前,需要先运行PrepSCTFindMarkers(),来预处理SCT assay。详细解释见此链接

如果是基于NormalizeData标准化的单细胞数据,需要使用”RNA” assay进行差异分析,如果不是,需要通过DefaultAssay(ifnb) <- "RNA"进行设定。

# 预处理SCT assay
ifnb <- PrepSCTFindMarkers(ifnb)
# 指定细胞idents为注释信息
Idents(ifnb) <- "seurat_annotations"
# 执行FindMarkers
monocyte.de.markers <- FindMarkers(ifnb, 
                                   ident.1 = "CD16 Mono", 
                                   ident.2 = "CD14 Mono")
# view results
nrow(monocyte.de.markers)
[1] 6869
head(monocyte.de.markers)
       p_val avg_log2FC pct.1 pct.2 p_val_adj
VMO1       0   5.641894 0.767 0.078         0
MS4A4A     0   3.238324 0.739 0.134         0
FCGR3A     0   3.305623 0.976 0.382         0
PLAC8      0   3.311271 0.620 0.107         0
CXCL16     0   2.030972 0.918 0.439         0
MS4A7      0   2.435979 0.973 0.519         0

The results data frame has the following columns :

  • p_val : p-value (unadjusted)

  • avg_log2FC : log fold-change of the average expression between the two groups. Positive values indicate that the feature is more highly expressed in the first group.

  • pct.1 : The percentage of cells where the feature is detected in the first group

  • pct.2 : The percentage of cells where the feature is detected in the second group

  • p_val_adj : Adjusted p-value, based on Bonferroni correction using all features in the dataset.

If the ident.2 parameter is omitted or set to NULLFindMarkers() will test for differentially expressed features between the group specified by ident.1 and all other cells. Additionally, the parameter only.pos can be set to TRUE to only search for positive markers, i.e. features that are more highly expressed in the ident.1 group.

monocyte.de.markers <- FindMarkers(ifnb, 
                                   ident.1 = "CD16 Mono", 
                                   ident.2 = NULL, 
                                   only.pos = TRUE)
nrow(monocyte.de.markers)
[1] 4259
head(monocyte.de.markers)
       p_val avg_log2FC pct.1 pct.2 p_val_adj
FCGR3A     0   4.616088 0.976 0.152         0
MS4A7      0   3.867491 0.973 0.196         0
CXCL16     0   3.309239 0.918 0.178         0
VMO1       0   6.207904 0.767 0.040         0
MS4A4A     0   4.661394 0.739 0.051         0
LST1       0   2.988332 0.887 0.205         0

3 FindConservedMarkers-鉴定在所有conditions下保守的cell marker

To identify canonical cell type marker genes that are conserved across conditions, we provide the FindConservedMarkers() function. This function performs differential gene expression testing for each dataset/group and combines the p-values using meta-analysis methods from the MetaDE R package.

在实际分析中,鉴定这些保守的cell marker主要用来辅助对cluster的注释:you can perform these same analysis on the unsupervised clustering results (stored in seurat_clusters), and use these conserved markers to annotate cell types in your dataset.

We can calculated the genes that are conserved markers irrespective of stimulation condition in cluster 6 (NK cells).

Tip

FindConservedMarkers函数会调用metap包,metap包需要multtest包,所以需要先安装这两个依赖包:

BiocManager::install('multtest')
install.packages('metap')

FindConservedMarkers中的grouping.var参数用来指定meta.data中表示样本类型或者condition的列名,其他参数及其含义基本和FindMarkers一致。

nk.markers <- FindConservedMarkers(ifnb, 
                                   ident.1 = "NK", 
                                   grouping.var = "stim", 
                                   only.pos = TRUE)
head(nk.markers)
          CTRL_p_val CTRL_avg_log2FC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj
GNLY    0.000000e+00       85.120959      0.943      0.046   0.000000e+00
NKG7    0.000000e+00       14.508633      0.953      0.085   0.000000e+00
CTSW    0.000000e+00        7.597542      0.537      0.030   0.000000e+00
FGFBP2  0.000000e+00        2.972433      0.500      0.021   0.000000e+00
KLRC1   0.000000e+00        8.995502      0.379      0.004   0.000000e+00
PRF1   3.646682e-300       11.866746      0.423      0.017  5.124682e-296
          STIM_p_val STIM_avg_log2FC STIM_pct.1 STIM_pct.2 STIM_p_val_adj
GNLY    0.000000e+00      20.3429036      0.956      0.059   0.000000e+00
NKG7    0.000000e+00       6.0306403      0.950      0.081   0.000000e+00
CTSW    0.000000e+00       8.2243914      0.592      0.035   0.000000e+00
FGFBP2 4.578442e-158       0.8925926      0.259      0.016  6.434085e-154
KLRC1   0.000000e+00       7.0700921      0.374      0.006   0.000000e+00
PRF1    0.000000e+00       8.0777458      0.863      0.057   0.000000e+00
            max_pval minimump_p_val
GNLY    0.000000e+00              0
NKG7    0.000000e+00              0
CTSW    0.000000e+00              0
FGFBP2 4.578442e-158              0
KLRC1   0.000000e+00              0
PRF1   3.646682e-300              0

4 可视化cell markers的表达

The DotPlot() function with the split.by parameter can be useful for viewing conserved cell type markers across conditions, showing both the expression level and the percentage of cells in a cluster expressing any given gene. Here we plot 2-3 strong marker genes for each of our 14 clusters.

markers.to.plot <- c("CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY", "NKG7",
                     "CCL5", "CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", "FCGR3A", 
                     "VMO1", "CCL2", "S100A9", "HLA-DQA1", "GPR183", "PPBP", "GNG11",
                     "HBA2", "HBB", "TSPAN13", "IL3RA", "IGJ", "PRSS57")
DotPlot(ifnb, 
        features = markers.to.plot, 
        cols = c("blue", "red"), 
        dot.scale = 8, 
        split.by = "stim") +
  RotatedAxis()


R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-2          

loaded via a namespace (and not attached):
  [1] mathjaxr_1.6-0         RColorBrewer_1.1-3     rstudioapi_0.15.0     
  [4] jsonlite_1.8.8         magrittr_2.0.3         TH.data_1.1-2         
  [7] spatstat.utils_3.0-4   farver_2.1.1           rmarkdown_2.25        
 [10] vctrs_0.6.5            multtest_2.58.0        ROCR_1.0-11           
 [13] spatstat.explore_3.2-5 htmltools_0.5.7        plotrix_3.8-4         
 [16] sctransform_0.4.1      parallelly_1.36.0      KernSmooth_2.23-22    
 [19] htmlwidgets_1.6.4      ica_1.0-3              sandwich_3.1-0        
 [22] plyr_1.8.9             plotly_4.10.4          zoo_1.8-12            
 [25] igraph_1.6.0           mime_0.12              lifecycle_1.0.4       
 [28] pkgconfig_2.0.3        Matrix_1.6-5           R6_2.5.1              
 [31] fastmap_1.1.1          rbibutils_2.2.16       fitdistrplus_1.1-11   
 [34] future_1.33.1          shiny_1.8.0            numDeriv_2016.8-1.1   
 [37] digest_0.6.34          colorspace_2.1-0       patchwork_1.2.0       
 [40] tensor_1.5             RSpectra_0.16-1        irlba_2.3.5.1         
 [43] labeling_0.4.3         progressr_0.14.0       fansi_1.0.6           
 [46] spatstat.sparse_3.0-3  httr_1.4.7             TFisher_0.2.0         
 [49] polyclip_1.10-6        abind_1.4-5            compiler_4.3.2        
 [52] withr_3.0.0            mutoss_0.1-13          fastDummies_1.7.3     
 [55] MASS_7.3-60.0.1        tools_4.3.2            lmtest_0.9-40         
 [58] metap_1.9              httpuv_1.6.13          future.apply_1.11.1   
 [61] qqconf_1.3.2           goftest_1.2-3          glue_1.7.0            
 [64] nlme_3.1-164           promises_1.2.1         grid_4.3.2            
 [67] Rtsne_0.17             cluster_2.1.6          reshape2_1.4.4        
 [70] generics_0.1.3         gtable_0.3.4           spatstat.data_3.0-4   
 [73] tidyr_1.3.0            sn_2.1.1               data.table_1.14.10    
 [76] utf8_1.2.4             BiocGenerics_0.48.1    spatstat.geom_3.2-7   
 [79] RcppAnnoy_0.0.21       ggrepel_0.9.5          RANN_2.6.1            
 [82] pillar_1.9.0           stringr_1.5.1          spam_2.10-0           
 [85] RcppHNSW_0.5.0         limma_3.58.1           later_1.3.2           
 [88] splines_4.3.2          dplyr_1.1.4            lattice_0.22-5        
 [91] survival_3.5-7         deldir_2.0-2           tidyselect_1.2.0      
 [94] miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.45            
 [97] gridExtra_2.3          scattermore_1.2        stats4_4.3.2          
[100] xfun_0.41              Biobase_2.62.0         statmod_1.5.0         
[103] matrixStats_1.2.0      stringi_1.8.3          lazyeval_0.2.2        
[106] yaml_2.3.8             evaluate_0.23          codetools_0.2-19      
[109] tibble_3.2.1           cli_3.6.2              uwot_0.1.16           
[112] xtable_1.8-4           reticulate_1.34.0      Rdpack_2.6            
[115] munsell_0.5.0          Rcpp_1.0.12            globals_0.16.2        
[118] spatstat.random_3.2-2  png_0.1-8              parallel_4.3.2        
[121] ellipsis_0.3.2         ggplot2_3.4.4          presto_1.0.0          
[124] dotCall64_1.1-1        listenv_0.9.0          viridisLite_0.4.2     
[127] mvtnorm_1.2-4          scales_1.3.0           ggridges_0.5.5        
[130] leiden_0.4.3.1         purrr_1.0.2            rlang_1.1.3           
[133] multcomp_1.4-25        mnormt_2.1.1           cowplot_1.1.2