This tutorial introduces a standard analysis precedures for differential expression (DE) analyses with RNA-Seq data. We aim to (i) identify differentially expressed genes (DEGs) from RNA-Seq data (also called as read count data), and (ii) unravel the biological functions of these DEGs. Since there are two types of DEG identification procedures, count-based and FPKM-based, we will introduces both procedures in this tutorial.
In this tutorial, we use a simply designed experimental dataset to study DE analyses. The dataset, obtained from Arabidopsis thaliana, consists of two groups, control and mutation groups. The control group consists of three biological replicates, and the mutation group also consists of three biological replicates. We will compare the two groups, and identify DEGs between the two groups.
RNA-Seq data analysis requires some specific packages, for example, edgeR and DESeq2 for DEG identification, clusterProfiler for gene ontology (GO) enrichment anlaysis. It does not like common R packages, these packages are usually reposited on Bioconductor. Here, we start R, and install the required packages from Bioconductor. In addition, for convenience, we also install some useful packages for data manipulation from CRAN.
# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("edgeR", "DESeq2", "clusterProfiler", "org.At.tair.db", "ballgown", "genefilter"))
# CRAN packages
install.packages(c("tidyverse", "ggsci", "gplots"))After the installation of R/Bioconductor packages, we restart R, and load the packages.
library(tidyverse)
library(ggsci)
library(gplots)
library(edgeR)              # DE analysis
library(DESeq2)             # DE analysis
library(ballgown)           # DE analysis 
library(genefilter)         # DE analysis
library(clusterProfiler)    # GO analysis
library(DOSE)               # GO analysis
library(enrichplot)         # GO analysis
library(org.At.tair.db)     # GO analysisTo enable R find the count data, we set R working directory.
setwd('~/projects/atrnaseq')The count-based procedures use raw read counts directly for DEG identification. The procedures takes three steps to identify DEGs: (i) normalize counts (only adjust the library size), (ii) estimate dispersions, and (iii) perfrom statistical test (LRT, Wald test). There are two packages named edgeR and DESeq famouse for DEG identification.
Overviews of whole datasets are essentkial steps of data anlaysis. In this section, we will summarise the statistics of this dataset and visualize them with charts, to check whether the data is suitable for analyses or not. First, we load expression data with read.table function.
d <- read.table('count/counts.txt', sep = '\t', header = TRUE)
head(d)Since most packages requires count data as a matrix type object, we format the loaded data into a matrix type, and set up row name and column name.
x <- as.matrix(d[, 7:ncol(d)])
rownames(x) <- sapply(strsplit(d[, 1], ':'), '[', 2)
colnames(x) <- c('ctrl_1', 'ctrl_2', 'ctrl_3', 'cold_1', 'cold_2', 'cold_3')
head(x)##           ctrl_1 ctrl_2 ctrl_3 cold_1 cold_2 cold_3
## AT1G01010    233    200    287    273    270    292
## AT1G01020    374    351    417    474    496    493
## AT1G01030     80     71     90     98    123    114
## AT1G01040   1729   1604   1962   1684   1866   1967
## AT1G01050   1946   1755   2083   2297   2451   2440
## AT1G01060    315    275    401    213    272    222The total number of reads, library size, in a library reflects the sequencing performance. We use bar chart to visualize the library size. From the figure, we can see that the six libraries are similar to each other, and there is no outlier among these libraries.
# barplot(colSums(x))
fig_libsize <- data.frame(libname = colnames(x), libsize = colSums(x)) %>%
                  ggplot(aes(x = libname, y = libsize)) +
                  scale_y_log10() +
                  geom_bar(stat = 'identity')
print(fig_libsize)Next, we are going to check the distribution of read count of each library. We can see that each library has a sharp peak at zero of x-axis and a gentle peak between 1e2 and 1e3. The x-axis of two peaks among the six libraries are similar to each other.
# hist(log10(x[, 1]+ 1))
fig_countdist <- as.data.frame(x) %>%
                    pivot_longer(everything(), names_to = 'library') %>%
                    mutate(value = value + 1) %>%
                    ggplot(aes(x = value)) +
                    geom_histogram(bins = 50) +
                    scale_x_log10() +
                    facet_wrap(~ library, nrow = 2)
print(fig_countdist)Sample (library) clustering gives us insights for discussion and chances to explore outliers in the dataset. Hierarchical clustering and PCA are commonly used for this purpose. Hierarchical clustering can be anlayzed by calculating spearman correlation coefficients between two libraries across all combinations. To avoid sequencing errors, usually, we only use the highly expressed genes for clustering.
From the clustering result, we can see that there were two clusters: (i) one contains three replicates of control group, and (ii) the other contains three replicates of mutation group. The clustering result reflects the design of this experiments.
x_log10 <- log10(sweep(x, 2, 1e6 / colSums(x), '*') + 1)
x_log10_highexp <- x_log10[rowSums(x_log10 > 0) > 3, ]
# only use high expressed genes
x_rho <- cor(x_log10_highexp, method = 'spearman')
x_hist <- hclust(as.dist(1 - x_rho), method = 'ward.D')
plot(x_hist, h = -1)We then use highly expressed genes to perform PCA and plot the first and second principal components. We can see that replicates in the control group are extreamly different from replicates in the mutation group.
x_log10 <- log10(sweep(x, 2, 1e6 / colSums(x), '*') + 1)
x_log10_highexp <- x_log10[rowSums(x_log10 > 0) > 3, ]
# only use high expressed genes
x_pca <- prcomp(t(x_log10_highexp), scale = FALSE)
# plot(x_pca$x[, c('PC1', 'PC2')])
# text(x_pca$x[, c('PC1', 'PC2')], labels = rownames(x_pca$x))
fig_pca <- as.data.frame(x_pca$x) %>%
            mutate(libname = rownames(x_pca$x)) %>%
            ggplot(aes(x = PC1, y = PC2, label = libname)) +
            geom_label()
print(fig_pca)Differential expression analysis aims to idenitfy the genes that are differentially expressed between two groups. To use edgeR for DEG identification, we prepare the group information and experimental design for the count data.
x_group <- factor(c(1, 1, 1, 2, 2, 2))
x_design <- model.matrix(~ x_group)
print(x_design)##   (Intercept) x_group2
## 1           1        0
## 2           1        0
## 3           1        0
## 4           1        1
## 5           1        1
## 6           1        1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$x_group
## [1] "contr.treatment"Then, we set up count data and group information to edgeR class object. Since genes with low expression affects accuracy of statistical test, we use filterByExpr function with default parameters to remove the low expressed genes.
y <- DGEList(counts = x, group = x_group)
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes = FALSE]We use MDS plot to check the distance among the replicates. MDS reuslt should be similar to the reuslts of PCA and hierachical clustering.
plotMDS(y)After we check that the data is suitable for DE anlaysis, we normalize data and estimate the dispersion for each gene. estimateDisp function estiamtes three types of dispersion, common, trended, and tagwise-dispersion. The tagwise-dispersion will be used for DEG identification.
y <- calcNormFactors(y)
y <- estimateDisp(y, x_design)We can use plotBCV function to visualize the distribution of the estimated dispersions.
plotBCV(y)We then use likelihood ratio test (LRT), a method to test the difference between full model and reduced model in GLM, to identify DEGs between control and mutation groups.
fit <- glmFit(y, x_design)
lrt <- glmLRT(fit, coef = 2)
deg_table <- as.data.frame(topTags(lrt, n = nrow(y), sort.by = 'PValue'))
head(deg_table)DEG is generally defined as gene with FDR is less than 0.1 or 0.05. In addition to FDR, foldchange is also used as cutoff for DEG identification. Since usually we extract important genes from DEG list and validate gene expression with additional experiment, it allows us to use the loose threshold as a cutoff. Here we define DEG as a gene which FDR is less than 0.05 and log2-foldchange is larger than 1 or smaller than -1. After we defined DEGs, we save the results to TSV file, and plot MA-plot to visualize log2-foldchange and average of gene expression.
deg_table <- deg_table %>%
                dplyr::mutate(gene = rownames(deg_table),
                              type = ifelse(FDR < 0.05 & abs(logFC) > 1.0, 'significant', 'not significant')) %>%
                dplyr::select(gene, logFC, logCPM, PValue, FDR, type)
write.table(deg_table, file = 'deg_table.tsv', col.names = TRUE, row.names = FALSE, sep = '\t', quote = FALSE)
fig_ma <- deg_table %>%
            ggplot(aes(x = logCPM, y = logFC, color = type)) +
            geom_point() +
            scale_color_manual(values = c('#666666', '#d62728'))
print(fig_ma)DESeq2 is another package to perform DE analysis. Similar to edgeR, we set up data into DESeq2 object and filter out genes with low counts.
x_group <- data.frame(treatment = c('ctrl', 'ctrl', 'ctrl', 'cold', 'cold', 'cold'))
dds <- DESeqDataSetFromMatrix(countData = x, colData = x_group,  design = ~ treatment)
keep <- (rowSums(counts(dds) > 0) > 3)
dds <- dds[keep,]Normalization, dispersion estimation, and statistical test can be done with the following functions in DESeq2. By default, DESEeq2 uses Wald test to identify DEGs, whereas edgeR uses LRT.
# dds <- DESeq(dds)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)The estimated dispersions can be visualized with plotDisEsts function.
plotDispEsts(dds)We convert DEGs identification result to data frame from DESeq2 class object. Since p-value can not be calculated for some genes, p-value and adjusted p-value (FDR) of these genes are represented by NA. For convenience, we replace NA to 1.0 in p-value and adjusted p-value of DESeq2 result.
resultsNames(dds)## [1] "Intercept"              "treatment_ctrl_vs_cold"res <- results(dds, name="treatment_ctrl_vs_cold")
deg_table <- as.data.frame(res)
deg_table$padj[is.na(deg_table$pvalue)] <- 1
deg_table$padj[is.na(deg_table$padj)] <- 1
head(deg_table)We define DEG as a gene which FDR is less than 0.05 and log2-foldchange is larger than 1 or smaller than -1. Then, we save the results to TSV file, and plot MA-plot to visualize log2-foldchange and average of gene expression.
deg_table <- deg_table %>%
                dplyr::mutate(gene = rownames(deg_table),
                              type = ifelse(padj < 0.05 & abs(log2FoldChange) > 1, 'significant', 'not significant')) %>%
                dplyr::select(gene, log2FoldChange, baseMean, pvalue, padj, type)
write.table(deg_table, file = 'deg_table.tsv', col.names = TRUE, row.names = FALSE, sep = '\t', quote = FALSE)
fig_ma <- deg_table %>%
            ggplot(aes(x = baseMean, y = log2FoldChange, color = type)) +
            geom_point() +
            scale_x_log10() +
            scale_color_manual(values = c('#666666', '#d62728'))
print(fig_ma)Usually, DEGs identification in RNA-Seq data outputs more than several hundreds of DEGs. To overview the biological functions of these DEGs, we can engage functional enrichment analysis for these DEGs. There are two types of the analyses, enrichment analysis and GSEA (gene set enrichment analysis). Both analysis can be down with clusterProfiler package.
To prepare enrichment analysis, for convenience, we create some object to record whole gene names, DEG names, and so on.
gene_names <- deg_table$gene
gene_pval <- deg_table$pvalue             # deg_table$PValue
gene_qval <- deg_table$padj               # deg_table$FDR
gene_fc   <- deg_table$log2FoldChange     # deg_table$logFC
names(gene_pval) <- names(gene_qval) <- gene_names
deg_names <- gene_names[deg_table$type == 'significant']Gene ontology has three main category, biological process (BP), cellular component (CC), and molecular function (MF). In this tutorial, we perform GO enrichment analysis only for BP. Since nearly all of genes are annotated with the categories in the top several levels, it is unnecessary to check GO terms in the higehr levels.
ego <- enrichGO(gene = deg_names,         # interested genes
                universe = gene_names,    # background genes
                OrgDb = org.At.tair.db,
                keyType = 'TAIR',
                ont = 'BP',
                pAdjustMethod = 'BH',
                pvalueCutoff = 1.0,
                qvalueCutoff = 1.0,
                readable = FALSE)
# remove some categories
ego <- dropGO(ego, level = 1)
ego <- dropGO(ego, level = 2)
ego <- dropGO(ego, level = 3)
go_table <- summary(ego)
bg_annotated <- as.integer(sapply(strsplit(go_table$BgRatio, '/'), '[', 1))
go_table <- go_table[100 < bg_annotated & bg_annotated < 500, ]
go_table$p.adjust <- p.adjust(go_table$pvalue)
go_table[go_table$p.adjust < 0.1, ]To overview the significant GO terms and the annotated genes, we use cnetplot to visualize the result of enrichment analysis.
edox <- setReadable(ego, 'org.At.tair.db', 'TAIR')
cnetplot(edox, showCategory = 10, foldChange = gene_fc, categorySize = 'p.adjust')goplot function can visualize hierarchical trees of enriched GO terms.
goplot(ego, showCategory = 10, color = 'p.adjust')GSEA can be done with gseGO analysis. Since GSEA should considering all genes with the rank, we input p-value of genes to gseGO function.
ggo <- gseGO(geneList = sort(gene_pval, decreasing = TRUE),
             OrgDb = org.At.tair.db,
             keyType = 'TAIR',
             ont = 'BP',
             pvalueCutoff = 1.0)
gogsea_table <- summary(ggo)
head(gogsea_table, 50)We use gseaplot2 function to visualize the GSEA result.
gseaplot2(ggo, geneSetID = 6)gseaplot2(ggo, geneSetID = 1:5)FPKM-based DEG identification is a part of StringTie-Ballgown pipeline. We use ballgown package load the normalized count data, FPKM, outputted by StringTie, and perform statistical test (F-test).
In this section, we will plot some figures to visualize the statistics of RNA-Seq data. First of all, we load expression data outputted from StringTie with ballgown package.
b <- ballgown(dataDir = 'ballgown', samplePattern = 'SRR')
pData(b) <- data.frame(id =  c('ctrl_1', 'ctrl_2', 'ctrl_3', 'cold_1', 'cold_2', 'cold_3'),
                       group = c('ctrl', 'ctrl', 'ctrl', 'cold', 'cold', 'cold'))After loading expression, we can retrieve expresion of transcripts and genes.
# transcript expression
head(texpr(b))##   FPKM.SRR1792924 FPKM.SRR1792925 FPKM.SRR1792926 FPKM.SRR1792927
## 1        4.125814        3.844283        4.593955        4.123420
## 2        2.796588        2.798470        3.924363        3.454246
## 3        0.534306        0.282695        0.668188        0.766483
## 4        0.725205        1.565021        1.152122        1.377866
## 5        0.619181        0.664552        0.453376        0.846209
## 6        1.171823        1.892226        0.772719        0.946749
##   FPKM.SRR1792928 FPKM.SRR1792929
## 1        3.694850        3.848086
## 2        3.479400        3.589532
## 3        0.444744        0.635831
## 4        1.048007        0.984064
## 5        0.533180        0.154207
## 6        1.959617        1.625064# gene expression
head(gexpr(b))##           FPKM.SRR1792924 FPKM.SRR1792925 FPKM.SRR1792926 FPKM.SRR1792927
## AT1G01030      1.25360604      1.22689400      1.29295800      1.31943000
## AT1G01110      1.66181793      1.92547586      1.80818711      1.90685011
## AT1G01130      2.27273100      4.40685400      3.13476900      3.65375300
## AT1G01150      0.06257279      0.04701473      0.08466496      0.08402109
## AT1G01270      0.00000000      0.00000000      0.00000000      0.00000000
## AT1G01280      0.00000000      0.00000000      0.00000000      0.00000000
##           FPKM.SRR1792928 FPKM.SRR1792929
## AT1G01030      1.48861482       1.3497740
## AT1G01110      1.79535574       1.6217058
## AT1G01130      3.83300400       3.7215440
## AT1G01150      0.08348642       0.0971909
## AT1G01270      0.00000000       0.0000000
## AT1G01280      0.00000000       0.0000000We extract normalized count data, FPKM, and plot the distribution of FPKM for each library. We can see that the shape of the distributions are similar to each other.
fig_countdist <- as.data.frame(texpr(b)) %>%
                    pivot_longer(everything(), names_to = 'library') %>%
                    mutate(value = value + 1) %>%
                    ggplot(aes(x = value)) +
                    geom_histogram(bins = 50) +
                    scale_x_log10() +
                    facet_wrap(~ library, nrow = 2)
print(fig_countdist)Then, we extract highly expressed transcripts for sample (library) clustering. From the clustering result, we can see that there were two clusters: (i) one contains three replicates of control group, and (ii) the other contains three replicates of cold treatment group. The clustering result reflects the design of this experiments.
b_highexp <- subset(b, 'rowVars(texpr(b)) > 1', genomesubset = TRUE)
tx_highexp <- texpr(b_highexp)
x_rho <- cor(tx_highexp, method = 'spearman')
x_hist <- hclust(as.dist(1 - x_rho), method = 'ward.D')
plot(x_hist, h = -1)We then use highly expressed genes to perform PCA and plot the first and second principal components. We can see that replicates in the control group are extreamly different from replicates in the cold group.
b_highexp <- subset(b, 'rowVars(texpr(b)) > 1', genomesubset = TRUE)
tx_highexp <- texpr(b_highexp)
x_pca <- prcomp(t(log10(tx_highexp + 1)), scale = FALSE)
fig_pca <- as.data.frame(x_pca$x) %>%
            mutate(libname = rownames(x_pca$x)) %>%
            ggplot(aes(x = PC1, y = PC2, label = libname)) +
            geom_label()
print(fig_pca)To identify differentially expressed transcripts, we use stattest function implemented in ballgown package. By changing feature argument to gene, we can identify DEGs.
det_table <- stattest(b_highexp, feature = 'transcript', covariate = 'group', getFC = TRUE, meas = 'FPKM')Then, we adjust the test results.
det_table <- data.frame(gene = geneNames(b_highexp),
                        gene_id = geneIDs(b_highexp),
                        log_mean = log10(rowMeans(texpr(b_highexp))),
                        det_table)
head(det_table)MA plot can be generated by the following scripts.
fig_ma <- ggplot(det_table, aes(x = log_mean, y = log2(fc), color = qval < 0.1)) +
              geom_point() +
              scale_color_manual(values = c('#666666', '#d62728'))
        
print(fig_ma)