数据导入与Seurat对象构建

After quantifying gene expression we need to bring this data into R to generate metrics for performing QC. In this lesson we will talk about the format(s) count data can be expected in, and how to read it into R so we can move on to the QC step in the workflow. We will also discuss the dataset we will be using and the associated metadata.


1 Exploring the example dataset

For this workshop we will be working with a single-cell RNA-seq dataset which is part of a larger study from (Kang et al. 2017). In this paper, the authors present a computational algorithm that harnesses genetic variation (eQTL) to determine the genetic identity of each droplet containing a single cell (singlet) and identify droplets containing two cells from different individuals (doublets).

The data used to test their algorithm is comprised of pooled Peripheral Blood Mononuclear Cells (PBMCs) taken from eight lupus patients, split into control and interferon beta-treated (stimulated) conditions.

Image credit: (Kang et al. 2017)

1.1 Raw data

This dataset is available on GEO (GSE96583), however the available counts matrix lacked mitochondrial reads, so we downloaded the BAM files from the SRA (SRP102802). These BAM files were converted back to FASTQ files, then run through Cell Ranger to obtain the count data that we will be using.

NOTE: The count data for this dataset is also freely available from 10X Genomics and is used in the Seurat tutorial.

1.2 Metadata

In addition to the raw data, we also need to collect information about the data; this is known as metadata. There is often a temptation to just start exploring the data, but it is not very meaningful if we know nothing about the samples that this data originated from.

Some relevant metadata for our dataset is provided below:

  • The libraries were prepared using 10X Genomics version 2 chemistry

  • The samples were sequenced on the Illumina NextSeq 500

  • PBMC samples from eight individual lupus patients were separated into two aliquots each.

    • One aliquot of PBMCs was activated by 100 U/mL of recombinant IFN-β for 6 hours.
    • The second aliquot was left untreated.
    • After 6 hours, the eight samples for each condition were pooled together in two final pools (stimulated cells and control cells). We will be working with these two, pooled samples. (We did not demultiplex the samples because SNP genotype information was used to demultiplex in the paper and the barcodes/sample IDs were not readily available for this data. Generally, you would demultiplex and perform QC on each individual sample rather than pooling the samples.)
  • 12,138 and 12,167 cells were identified (after removing doublets) for control and stimulated pooled samples, respectively.

  • Since the samples are PBMCs, we will expect immune cells, such as:

    • B cells
    • T cells
    • NK cells
    • monocytes
    • macrophages
    • possibly megakaryocytes

It is recommended that you have some expectation regarding the cell types you expect to see in a dataset prior to performing the QC. This will inform you if you have any cell types with low complexity (lots of transcripts from a few genes) or cells with higher levels of mitochondrial expression. This will enable us to account for these biological factors during the analysis workflow.

None of the above cell types are expected to be low complexity or anticipated to have high mitochondrial content.

2 Loading single-cell RNA-seq count data

After processing 10X data using its proprietary software Cell Ranger, you will have an outs directory (always). Within this directory you will find a number of different files including the files listed below:

  • web_summary.html: report that explores different QC metrics, including the mapping metrics, filtering thresholds, estimated number of cells after filtering, and information on the number of reads and genes per cell after filtering.
  • BAM alignment files: files used for visualization of the mapped reads and for re-creation of FASTQ files, if needed
  • filtered_feature_bc_matrix: folder containing all files needed to construct the count matrix using data filtered by Cell Ranger
  • raw_feature_bc_matrix: folder containing all files needed to construct the count matrix using the raw unfiltered data

While Cell Ranger performs filtering on the expression counts (see note below), we wish to perform our own QC and filtering because we want to account for the biology of our experiment/biological system. Given this we are only interested in the raw_feature_bc_matrix folder in the Cell Ranger output.

The filtered_feature_bc_matrix uses internal filtering criteria by Cell Ranger, and we do not have control of what cells to keep or abandon.

The filtering performed by Cell Ranger when generating the filtered_feature_bc_matrix is often good; however, sometimes data can be of very high quality and the Cell Ranger filtering process can remove high quality cells. In addition, it is generally preferable to explore your own data while taking into account the biology of the experiment for applying thresholds during filtering. For example, if you expect a particular cell type in your dataset to be smaller and/or not as transcriptionally active as other cell types in your dataset, these cells have the potential to be filtered out. However, with Cell Ranger v3 they have tried to account for cells of different sizes (for example, tumor vs infiltrating lymphocytes), and now may not filter as many low quality cells as needed.

Regardless of the technology or pipeline used to process your raw single-cell RNA-seq sequence data, the output with quantified expression will generally be the same. That is, for each individual sample you will have the following three files:

  1. a file with the cell IDs, representing all cells quantified
  2. a file with the gene IDs, representing all genes quantified
  3. a matrix of counts per gene for every cell

案例文件以10X格式数据储存。总文件夹为“original_10x”,其中每个样本有一个单独的文件夹,分别是“ctrl_raw_feature_bc_matrix”和“stim_raw_feature_bc_matrix”,每个样本文件夹由以下三个标准的10X文件组成:

数据文件夹的结构

1. barcodes.tsv

This is a text file which contains all cellular barcodes present for that sample. Barcodes are listed in the order of data presented in the matrix file (i.e. these are the column names).

2. features.tsv

This is a text file which contains the identifiers of the quantified genes. The source of the identifier can vary depending on what reference (i.e. Ensembl, NCBI, UCSC) you use in the quantification methods, but most often these are official gene symbols. The order of these genes corresponds to the order of the rows in the matrix file (i.e. these are the row names).

3. matrix.mtx

This is a text file which contains a matrix of count values. The rows are associated with the gene IDs above and columns correspond to the cellular barcodes. Note that there are many zero values in this matrix.

Loading this data into R requires us to use functions that allow us to efficiently combine these three files into a single count matrix. However, instead of creating a regular matrix data structure, the functions we will use create a sparse matrix to reduce the amount of memory (RAM), processing capacity (CPU) and storage required to work with our huge count matrix.

Different methods for reading in data include:

  1. readMM(): This function is from the Matrix package and will convert our standard matrix into a sparse matrix. The features.tsv file and barcodes.tsv must first be individually loaded into R and then they can be combined. For specific code and instructions on how to do this please see these additional material.
  2. Read10X(): This function is from the Seurat package and will use the Cell Ranger output directory as input, directly. With this method individual files do not need to be loaded in, instead the function will load and combine them into a sparse matrix. We will be using this function to load in our data!

2.1 首先尝试读取单个样本

If we had a single sample, we could generate the count matrix and then subsequently create a Seurat object:

library(Seurat)
# Read in 10X data for a single sample (output is a sparse matrix)
ctrl_counts <- Read10X(data.dir = "data/scRNA-seq_online/original_10x/ctrl_raw_feature_bc_matrix")

# Turn count matrix into a Seurat object (output is a Seurat object)
ctrl <- CreateSeuratObject(counts = ctrl_counts,
                           project = "pbmc_ctrl",
                           min.features = 100)
ctrl
An object of class Seurat 
33538 features across 15688 samples within 1 assay 
Active assay: RNA (33538 features, 0 variable features)
 1 layer present: counts
Note

The min.features argument specifies the minimum number of genes that need to be detected per cell. This argument will filter out poor quality cells that likely just have random barcodes encapsulated without any cell present. Usually, cells with less than 100 genes detected are not considered for analysis.

Seurat automatically creates some metadata for each of the cells when you use the Read10X() function to read in data. This information is stored in the meta.data slot within the Seurat object.

# Explore the metadata
head(ctrl@meta.data)
                 orig.ident nCount_RNA nFeature_RNA
AAACATACAATGCC-1  pbmc_ctrl       2344          874
AAACATACATTTCC-1  pbmc_ctrl       3125          896
AAACATACCAGAAA-1  pbmc_ctrl       2578          725
AAACATACCAGCTA-1  pbmc_ctrl       3261          979
AAACATACCATGCA-1  pbmc_ctrl        746          362
AAACATACCTCGCT-1  pbmc_ctrl       3519          866

What do the columns of metadata mean?

  • orig.ident: this often contains the sample identity if known。通过CreateSeuratObject中的project参数可以指定,默认是”SeuratProject”
  • nCount_RNA: number of UMIs per cell
  • nFeature_RNA: number of genes detected per cell

2.2 读取所有样本

  1. 首先通过list.dirs()函数列出数据文件夹“original_10x”下所有子文件夹的相对路径,并形成一个包含了所有子文件夹相对路径的字符向量“files”

  2. 然后通过Read10X()读取每个子文件夹。Read10X函数支持批量读取多个文件夹。

  3. 最后,通过CreateSeuratObject构建Seurat对象。

Caution

这里利用了Read10X函数支持批量读取多个文件夹的特性,直接创建一个合并后的Seurat对象“merged_seurat”。但是,这只适用于读取以标准10X格式保存的数据,即每个样本为一个文件夹,每个文件夹内有三个文件,文件名为:“barcodes.tsv.gz”、“features.tsv.gz”、“matrix.mtx.gz”。

另外一种更通用的方法是通过循环依次读取每个文件夹内的3个10X文件,得到一个Seurat对象列表,然后通过merge()函数合并这些Seurat对象,详见:读取非标准10X格式文件

# 列出数据文件夹下所有子文件夹的相对路径
list.dirs("data/scRNA-seq_online/original_10x")
[1] "data/scRNA-seq_online/original_10x"                           
[2] "data/scRNA-seq_online/original_10x/ctrl_raw_feature_bc_matrix"
[3] "data/scRNA-seq_online/original_10x/stim_raw_feature_bc_matrix"
files <- list.dirs("data/scRNA-seq_online/original_10x")[-1]

# 构建Seurat对象
merged_seurat <- CreateSeuratObject(Read10X(files),
                                    min.features = 100, 
                                    project = "GSE96583")
merged_seurat
An object of class Seurat 
33538 features across 31444 samples within 1 assay 
Active assay: RNA (33538 features, 0 variable features)
 1 layer present: counts
# Check that the merged object has the appropriate sample-specific prefixes
head(merged_seurat)
                   orig.ident nCount_RNA nFeature_RNA
1_AAACATACAATGCC-1          1       2344          874
1_AAACATACATTTCC-1          1       3125          896
1_AAACATACCAGAAA-1          1       2578          725
1_AAACATACCAGCTA-1          1       3261          979
1_AAACATACCATGCA-1          1        746          362
1_AAACATACCTCGCT-1          1       3519          866
1_AAACATACCTGGTA-1          1       3328         1137
1_AAACATACCTGTAG-1          1        484          281
1_AAACATACGATGAA-1          1       1991          650
1_AAACATACGCCAAT-1          1       1186          447
tail(merged_seurat)
                   orig.ident nCount_RNA nFeature_RNA
2_TTTGCATGCATGAC-1          2       1395          492
2_TTTGCATGCCTGAA-1          2       1102          483
2_TTTGCATGCCTGTC-1          2       2334          841
2_TTTGCATGCCTTAT-1          2       2766          856
2_TTTGCATGCGACAT-1          2        620          295
2_TTTGCATGCTAAGC-1          2       1641          545
2_TTTGCATGGGACGA-1          2       1233          518
2_TTTGCATGGTGAGG-1          2       1084          469
2_TTTGCATGGTTTGG-1          2        818          432
2_TTTGCATGTCTTAC-1          2       1104          438

可以发现,此时的细胞barcode按照我们读取10X文件的顺序,被分别赋予“1, 2, 3…”的前缀。为了后续便于识别细胞来自哪个样本,我们需要通过gsub()将这些前缀更改为样本名称(“ctrl” 和 “stim”)。

colnames(merged_seurat) <- gsub(pattern = "^1_", 
                                x = colnames(merged_seurat), 
                                replacement = "ctrl_")
colnames(merged_seurat) <- gsub(pattern = "^2_", 
                                x = colnames(merged_seurat), 
                                replacement = "stim_")

# Re-check that the merged object has the appropriate sample-specific prefixes
head(merged_seurat)
                      orig.ident nCount_RNA nFeature_RNA
ctrl_AAACATACAATGCC-1          1       2344          874
ctrl_AAACATACATTTCC-1          1       3125          896
ctrl_AAACATACCAGAAA-1          1       2578          725
ctrl_AAACATACCAGCTA-1          1       3261          979
ctrl_AAACATACCATGCA-1          1        746          362
ctrl_AAACATACCTCGCT-1          1       3519          866
ctrl_AAACATACCTGGTA-1          1       3328         1137
ctrl_AAACATACCTGTAG-1          1        484          281
ctrl_AAACATACGATGAA-1          1       1991          650
ctrl_AAACATACGCCAAT-1          1       1186          447
tail(merged_seurat)
                      orig.ident nCount_RNA nFeature_RNA
stim_TTTGCATGCATGAC-1          2       1395          492
stim_TTTGCATGCCTGAA-1          2       1102          483
stim_TTTGCATGCCTGTC-1          2       2334          841
stim_TTTGCATGCCTTAT-1          2       2766          856
stim_TTTGCATGCGACAT-1          2        620          295
stim_TTTGCATGCTAAGC-1          2       1641          545
stim_TTTGCATGGGACGA-1          2       1233          518
stim_TTTGCATGGTGAGG-1          2       1084          469
stim_TTTGCATGGTTTGG-1          2        818          432
stim_TTTGCATGTCTTAC-1          2       1104          438

现在可以看到细胞barcode的前缀已被更改成了样本名称。接下来我们还可以进一步将样本信息添加到meta.data的新的一列“sample”中:

# Create sample column
library(stringr)
merged_seurat$sample <- str_split(rownames(merged_seurat@meta.data),
                                  "_",
                                  simplify = TRUE)
head(merged_seurat)
                      orig.ident nCount_RNA nFeature_RNA sample
ctrl_AAACATACAATGCC-1          1       2344          874   ctrl
ctrl_AAACATACATTTCC-1          1       3125          896   ctrl
ctrl_AAACATACCAGAAA-1          1       2578          725   ctrl
ctrl_AAACATACCAGCTA-1          1       3261          979   ctrl
ctrl_AAACATACCATGCA-1          1        746          362   ctrl
ctrl_AAACATACCTCGCT-1          1       3519          866   ctrl
ctrl_AAACATACCTGGTA-1          1       3328         1137   ctrl
ctrl_AAACATACCTGTAG-1          1        484          281   ctrl
ctrl_AAACATACGATGAA-1          1       1991          650   ctrl
ctrl_AAACATACGCCAAT-1          1       1186          447   ctrl
table(merged_seurat$sample)

 ctrl  stim 
15688 15756 

3 保存Seurat对象

# 保存
saveRDS(merged_seurat, file = "output/scRNA-seq_online/merged_seurat.rds")

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.5.1      Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-2          

loaded via a namespace (and not attached):
  [1] deldir_2.0-2           pbapply_1.7-2          gridExtra_2.3         
  [4] rlang_1.1.3            magrittr_2.0.3         RcppAnnoy_0.0.21      
  [7] spatstat.geom_3.2-7    matrixStats_1.2.0      ggridges_0.5.5        
 [10] compiler_4.3.2         png_0.1-8              vctrs_0.6.5           
 [13] reshape2_1.4.4         pkgconfig_2.0.3        fastmap_1.1.1         
 [16] ellipsis_0.3.2         utf8_1.2.4             promises_1.2.1        
 [19] rmarkdown_2.25         purrr_1.0.2            xfun_0.41             
 [22] jsonlite_1.8.8         goftest_1.2-3          later_1.3.2           
 [25] spatstat.utils_3.0-4   irlba_2.3.5.1          parallel_4.3.2        
 [28] cluster_2.1.6          R6_2.5.1               ica_1.0-3             
 [31] stringi_1.8.3          RColorBrewer_1.1-3     spatstat.data_3.0-4   
 [34] reticulate_1.34.0      parallelly_1.36.0      lmtest_0.9-40         
 [37] scattermore_1.2        Rcpp_1.0.12            knitr_1.45            
 [40] tensor_1.5             future.apply_1.11.1    zoo_1.8-12            
 [43] R.utils_2.12.3         sctransform_0.4.1      httpuv_1.6.13         
 [46] Matrix_1.6-5           splines_4.3.2          igraph_1.6.0          
 [49] tidyselect_1.2.0       abind_1.4-5            rstudioapi_0.15.0     
 [52] yaml_2.3.8             spatstat.random_3.2-2  codetools_0.2-19      
 [55] miniUI_0.1.1.1         spatstat.explore_3.2-5 listenv_0.9.0         
 [58] lattice_0.22-5         tibble_3.2.1           plyr_1.8.9            
 [61] shiny_1.8.0            ROCR_1.0-11            evaluate_0.23         
 [64] Rtsne_0.17             future_1.33.1          fastDummies_1.7.3     
 [67] survival_3.5-7         polyclip_1.10-6        fitdistrplus_1.1-11   
 [70] pillar_1.9.0           KernSmooth_2.23-22     plotly_4.10.4         
 [73] generics_0.1.3         RcppHNSW_0.5.0         ggplot2_3.4.4         
 [76] munsell_0.5.0          scales_1.3.0           globals_0.16.2        
 [79] xtable_1.8-4           glue_1.7.0             lazyeval_0.2.2        
 [82] tools_4.3.2            data.table_1.14.10     RSpectra_0.16-1       
 [85] RANN_2.6.1             leiden_0.4.3.1         dotCall64_1.1-1       
 [88] cowplot_1.1.2          grid_4.3.2             tidyr_1.3.0           
 [91] colorspace_2.1-0       nlme_3.1-164           patchwork_1.2.0       
 [94] cli_3.6.2              spatstat.sparse_3.0-3  spam_2.10-0           
 [97] fansi_1.0.6            viridisLite_0.4.2      dplyr_1.1.4           
[100] uwot_0.1.16            gtable_0.3.4           R.methodsS3_1.8.2     
[103] digest_0.6.34          progressr_0.14.0       ggrepel_0.9.5         
[106] htmlwidgets_1.6.4      R.oo_1.25.0            htmltools_0.5.7       
[109] lifecycle_1.0.4        httr_1.4.7             mime_0.12             
[112] MASS_7.3-60.0.1       

References

Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2017. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.