RNA sequencing is an experimental methodology where we can determine the sequence of mRNA transcripts in the cell. By mapping these sequences back to a reference genome or transcriptome, we can effectively identify these transcripts and quantify their abundance in a given sample. The reason that these mRNA transcripts are important in the first place is because they give us insight into the functional state of the cell, since these mRNAs are ultimately translated into proteins that carry out important tasks, including constitutive functions, such as cell cycle control and metabolism, as well as cell-type-specific functions including the production of specific cytokines by lymphocytes.
A brief overview of experimental methodology
In bulk RNA-seq, RNA is first extracted from a sample containing a large number of cells, often representing a heterogeneous population of different cell types. The mRNAs from these cells are pooled together and converted into cDNA using reverse transcriptase, an enzyme which produces complementary single-strand DNA molecules from RNA. This step is usually necessary as DNA is more stable than RNA and it can be amplified more easily using DNA polymerase, which gives us more material to work with for sequencing. Additionally, the vast majority of next-generation sequencing (NGS) technologies, such as Illumina, PacBio, and 10x Genomics, require DNA, not RNA, as an input.
Conversely, there are several experimental strategies for scRNA-seq, including platforms like DROP-seq, SORT-seq, 10x Genomics Chromium, and Smart-seq. In DROP-seq and 10x Genomics Chromium, individual cells and beads with barcoded primers are encapsulated in droplets of oil, each carrying a unique molecular identifier (UMI) allowing the RNA from each cell to be tagged and later identified. In SORT-seq, fluorescence-assisted cell sorting (FACS) is used to sort cells based on specific markers into wells, where they are individually captured for RNA sequencing.
In both RNA-seq and scRNA-seq, the sequenced reads are stored into FASTQ files which can be further processed by first assessing quality (e.g. FastQC, MultiQC) and then aligning them to a reference genome or transcriptome (e.g. STAR, HISAT2, Bowtie2 (for bulk RNA-seq), CellRanger for 10x Genomics scRNA-seq) to quantify raw read counts per transcript, each annotated by a unique Ensembl ID.
You have the raw read counts. Now what?
RNA sequencing enables the high throughput quantification of mRNA transcript levels, which can be used downstream for transcriptome assembly, differential expression analysis, biomarker identification, functional enrichment, and characterization of cell phenotype. Although generally cheaper, bulk RNA-seq only provides cell-averaged expression profiles for a given sample, which are easier to analyze but hide important details like cell heterogeneity. For example, some drugs may only affect certain types of cells or the way those cells communicate with each other. Nonetheless, bulk RNA-seq is still useful in experimental designs where the cell type has already been selected for on the basis of surface protein markers, as detected by flow cytometry, or when we just want to measure and compare average gene expression across different cell populations or experimental conditions.
In the following tutorial, we will focus on scRNA-seq as it enables transcriptomic profiling at a single cell resolution, permitting the identification and characterization of different cell types in a bulk tissue sample, as well as the calculation of their relative abundance. Following is a step-by-step guide to analyze human peripheral blood monocytes (PBMCs) using the scanpy package and the pbmc8k dataset provided by 10x Genomics. I will cover the preprocessing of raw counts to transcript expression analysis, clustering, and visualization. Note: all of these analyses can also be performed using the Seurat R package. For bulk RNA-seq analysis, use DESeq2 or edgeR instead.