This tutorial introduces a standard analysis precedures for differential expression (DE) analyses with RNA-Seq data. We aim to (i) identify differentially expressed genes (DEGs) from RNA-Seq data (also called as read count data), and (ii) unravel the biological functions of these DEGs. Since there are two types of DEG identification procedures, count-based and FPKM-based, we will introduces both procedures in this tutorial.

In this tutorial, we use a simply designed experimental dataset to study DE analyses. The dataset, obtained from Arabidopsis thaliana, consists of two groups, control and mutation groups. The control group consists of three biological replicates, and the mutation group also consists of three biological replicates. We will compare the two groups, and identify DEGs between the two groups.

Analysis environment

RNA-Seq data analysis requires some specific packages, for example, edgeR and DESeq2 for DEG identification, clusterProfiler for gene ontology (GO) enrichment anlaysis. It does not like common R packages, these packages are usually reposited on Bioconductor. Here, we start R, and install the required packages from Bioconductor. In addition, for convenience, we also install some useful packages for data manipulation from CRAN.

# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))

BiocManager::install(c("edgeR", "DESeq2", "clusterProfiler", "org.At.tair.db", "ballgown", "genefilter"))

# CRAN packages
install.packages(c("tidyverse", "ggsci", "gplots"))

After the installation of R/Bioconductor packages, we restart R, and load the packages.

library(edgeR)              # DE analysis
library(DESeq2)             # DE analysis
library(ballgown)           # DE analysis 
library(genefilter)         # DE analysis
library(clusterProfiler)    # GO analysis
library(DOSE)               # GO analysis
library(enrichplot)         # GO analysis
library(org.At.tair.db)     # GO analysis

To enable R find the count data, we set R working directory.


Count-based DEG identification

The count-based procedures use raw read counts directly for DEG identification. The procedures takes three steps to identify DEGs: (i) normalize counts (only adjust the library size), (ii) estimate dispersions, and (iii) perfrom statistical test (LRT, Wald test). There are two packages named edgeR and DESeq famouse for DEG identification.


Overviews of whole datasets are essentkial steps of data anlaysis. In this section, we will summarise the statistics of this dataset and visualize them with charts, to check whether the data is suitable for analyses or not. First, we load expression data with read.table function.

d <- read.table('count/counts.txt', sep = '\t', header = TRUE)

Since most packages requires count data as a matrix type object, we format the loaded data into a matrix type, and set up row name and column name.

x <- as.matrix(d[, 7:ncol(d)])
rownames(x) <- sapply(strsplit(d[, 1], ':'), '[', 2)
colnames(x) <- c('ctrl_1', 'ctrl_2', 'ctrl_3', 'cold_1', 'cold_2', 'cold_3')
##           ctrl_1 ctrl_2 ctrl_3 cold_1 cold_2 cold_3
## AT1G01010    233    200    287    273    270    292
## AT1G01020    374    351    417    474    496    493
## AT1G01030     80     71     90     98    123    114
## AT1G01040   1729   1604   1962   1684   1866   1967
## AT1G01050   1946   1755   2083   2297   2451   2440
## AT1G01060    315    275    401    213    272    222

The total number of reads, library size, in a library reflects the sequencing performance. We use bar chart to visualize the library size. From the figure, we can see that the six libraries are similar to each other, and there is no outlier among these libraries.

# barplot(colSums(x))

fig_libsize <- data.frame(libname = colnames(x), libsize = colSums(x)) %>%
                  ggplot(aes(x = libname, y = libsize)) +
                  scale_y_log10() +
                  geom_bar(stat = 'identity')

Next, we are going to check the distribution of read count of each library. We can see that each library has a sharp peak at zero of x-axis and a gentle peak between 1e2 and 1e3. The x-axis of two peaks among the six libraries are similar to each other.

# hist(log10(x[, 1]+ 1))

fig_countdist <- %>%
                    pivot_longer(everything(), names_to = 'library') %>%
                    mutate(value = value + 1) %>%
                    ggplot(aes(x = value)) +
                    geom_histogram(bins = 50) +
                    scale_x_log10() +
                    facet_wrap(~ library, nrow = 2)

Sample (library) clustering gives us insights for discussion and chances to explore outliers in the dataset. Hierarchical clustering and PCA are commonly used for this purpose. Hierarchical clustering can be anlayzed by calculating spearman correlation coefficients between two libraries across all combinations. To avoid sequencing errors, usually, we only use the highly expressed genes for clustering.

From the clustering result, we can see that there were two clusters: (i) one contains three replicates of control group, and (ii) the other contains three replicates of mutation group. The clustering result reflects the design of this experiments.

x_log10 <- log10(sweep(x, 2, 1e6 / colSums(x), '*') + 1)
x_log10_highexp <- x_log10[rowSums(x_log10 > 0) > 3, ]

# only use high expressed genes
x_rho <- cor(x_log10_highexp, method = 'spearman')
x_hist <- hclust(as.dist(1 - x_rho), method = 'ward.D')
plot(x_hist, h = -1)

We then use highly expressed genes to perform PCA and plot the first and second principal components. We can see that replicates in the control group are extreamly different from replicates in the mutation group.

x_log10 <- log10(sweep(x, 2, 1e6 / colSums(x), '*') + 1)
x_log10_highexp <- x_log10[rowSums(x_log10 > 0) > 3, ]

# only use high expressed genes
x_pca <- prcomp(t(x_log10_highexp), scale = FALSE)

# plot(x_pca$x[, c('PC1', 'PC2')])
# text(x_pca$x[, c('PC1', 'PC2')], labels = rownames(x_pca$x))

fig_pca <-$x) %>%
            mutate(libname = rownames(x_pca$x)) %>%
            ggplot(aes(x = PC1, y = PC2, label = libname)) +

DEG identification (edgeR)

Differential expression analysis aims to idenitfy the genes that are differentially expressed between two groups. To use edgeR for DEG identification, we prepare the group information and experimental design for the count data.

x_group <- factor(c(1, 1, 1, 2, 2, 2))
x_design <- model.matrix(~ x_group)
##   (Intercept) x_group2
## 1           1        0
## 2           1        0
## 3           1        0
## 4           1        1
## 5           1        1
## 6           1        1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$x_group
## [1] "contr.treatment"

Then, we set up count data and group information to edgeR class object. Since genes with low expression affects accuracy of statistical test, we use filterByExpr function with default parameters to remove the low expressed genes.

y <- DGEList(counts = x, group = x_group)
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes = FALSE]

We use MDS plot to check the distance among the replicates. MDS reuslt should be similar to the reuslts of PCA and hierachical clustering.