In this workshop, we will learn how to quantify gene expression and how to identify differentially expressed genes (DEGs) of RNA-Seq data. We will use paired-end RNA-Seq data that was from Arabidopsis thaliana as an example.
To prevent contamination of the system environment, we will build a virtual analysis environment with Anaconda. First of all, we use conda
command to build an environment named atenv, and then install several bioinformatics tools.
x# build an analysis environment
conda create --name atenv
# activate the analysis environment
conda activate atenv
# install bioinformatics tools
# conda install wget # for macOS
conda install -c bioconda fastqc trimmomatic hisat2 star kallisto samtools subread stringtie
We next create directories to save the RNA-Seq data, genome data, intermediate results, and the final results. We set an environment variable PROJECTT_PATHT
to remember the root path of this project, just for convenience. Note that you should change lecturer01
to your user name. Also, note that, DO NOT ADD SPACES NEAR BY =
.
xxxxxxxxxx
# project path
PROJECT_PATH=/home/lecturer01/projects/atrnaseq
mkdir -p ${PROJECT_PATH}
cd ${PROJECT_PATH}
pwd
# /home/lecturer01/projects/at
mkdir genome
mkdir fastq
mkdir cleaned_fastq
mkdir bam
mkdir count
mkdir ballgown
We will use six RNA-Seq datasets that sequenced from A. thaliana in Bazin et al. study. The RNA-Seq data, i.e., FASTQ files can be downloaded from EMBL-EBI with accession PRJNA480638. This dataset has six replicates. The first three replicates are the control sample, and the other three are nsra/b double mutant sample.
xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
# WT_1; Col0_RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/004/SRR1792924/SRR1792924_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/004/SRR1792924/SRR1792924_2.fastq.gz
# WT_2; Col0_RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/005/SRR1792925/SRR1792925_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/005/SRR1792925/SRR1792925_2.fastq.gz
# WT_3; Col0_RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/006/SRR1792926/SRR1792926_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/006/SRR1792926/SRR1792926_2.fastq.gz
# NSR_1; nsra/b RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/007/SRR1792927/SRR1792927_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/007/SRR1792927/SRR1792927_2.fastq.gz
# NSR_2; nsra/b RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/008/SRR1792928/SRR1792928_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/008/SRR1792928/SRR1792928_2.fastq.gz
# NSR_3; nsra/b RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/009/SRR1792929/SRR1792929_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/009/SRR1792929/SRR1792929_2.fastq.gz
Note that if you are using AI supercomputer, you can copy the FASTQ files from /data/workshop1/ws_rnaseq/fastq/at
with the following command.
xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
cp /data/workshop1/hts/fastq/atrnaseq/*.fastq.gz .
To quantify gene expression of RNA-Seq data, we need the reference genome and the gene annotations. Reference genome and annotations are available from the TAIR project website or the Ensembl website. Here, we download them from the Ensembl using the following scripts.
xxxxxxxxxx
cd ${PROJECT_PATH}/genome
# reference sequence
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
# transcript sequence
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
# gene annotation
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.40.gff3.gz
# decompress
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.40.gff3.gz
Note that if you are using AI supercomputer, you can copy these files from /data/workshop1/ws_rnaseq/genome/TAIR10
with the following commands.
xxxxxxxxxx
cd ${PROJECT_PATH}/genome
cp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa .
cp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.40.gff3 .
The sequencing data in FASTQ files may contain low-quality parts, such as artificial contaminations, poly-A tails, and so on. To determine detailed errors, we use FastQC (fastqc
) to check all the FASTQ files.
xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
fastqc *.gz
FastQC generates a quality report for each library. Based on the reports, we then determine the thresholds for quality control (QC) against all FASTQ files using Trimmomatic (trimmomatic
). Since we will perform the same QC processes against all FASTQ files, we will use for
syntax to process all FASTQ files one by one. Note that, DO NOT ADD SPACES NEAR BY =
, and DO NOT ADD ANY CHARACTERS AFTER \
including space.
xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
# SRR7508939_*.fastq.gz
trimmomatic PE -threads 4 \
SRR1792924_1.fastq.gz SRR1792924_2.fastq.gz \
SRR1792924_cleaned_1.fastq.gz SRR1792924_unpaired_1.fastq.gz \
SRR1792924_cleaned_2.fastq.gz SRR1792924_unpaired_2.fastq.gz \
LEADING:30 TRAILING:30 HEADCROP:13 MINLEN:60
# SRR1792925_*.fastq.gz to SRR1792929_*.fastq.gz
for libid in 25 26 27 28 29
do
trimmomatic PE -threads 4 \
SRR17929${libid}_1.fastq.gz SRR17929${libid}_2.fastq.gz \
SRR17929${libid}_cleaned_1.fastq.gz SRR17929${libid}_unpaired_1.fastq.gz \
SRR17929${libid}_cleaned_2.fastq.gz SRR17929${libid}_unpaired_2.fastq.gz \
LEADING:30 TRAILING:30 HEADCROP:13 MINLEN:60
done
mv *_cleaned_*.gz ${PROJECT_PATH}/cleaned_fastq/
To make sure the QC was succesfully performed, we check qualities of all the cleaned FASTQ files with FastQC.
xxxxxxxxxx
cd ${PROJECT_PATH}/cleaned_fastq
fastqc *.gz
Mapping is a main process in expression quantification of RNA-Seq data analysis. There are several mapping tools available for mapping. Choose one of the tools and try it.
We use hisat2-build
to build an index and name the index as TAIR10_hisat2.
xxxxxxxxxx
cd ${PROJECT_PATH}/genome
hisat2-build -p 4 Arabidopsis_thaliana.TAIR10.dna.toplevel.fa TAIR10_hisat2
Then, we can use hisat2
command to map reads onto the index TAIR10_hisat2.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
hisat2 -x ${PROJECT_PATH}/genome/TAIR10_hisat2 \
-1 ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \
-2 ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz \
-p 16 -S SRR1792924.sam
for libid in 25 26 27 28 29
do
hisat2 -x ${PROJECT_PATH}/genome/TAIR10_hisat2 \
-1 ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \
-2 ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gz \
-p 16 -S SRR17929${libid}.sam
done
Reference sequence index for STAR can be built by STAR
command with genomeGenerate
mode. We name the index as TAIR10_star. Note that STAR requires an amount of memory to build the index.
xxxxxxxxxx
cd ${PROJECT_PATH}/genome
mkdir TAIR10_star
STAR --runMode genomeGenerate \
--runThreadN 4 \
--genomeDir TAIR10_star \
--genomeSAindexNbases 12 \
--genomeFastaFiles Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
--limitGenomeGenerateRAM 3400000000
Mapping can be done by STAR
command.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
STAR --runThreadN 4 \
--genomeDir ${PROJECT_PATH}/genome/TAIR10_star \
--readFilesIn ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \
${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz \
--readFilesCommand gunzip \
--outFileNamePrefix SRR1792924_
# Jul 29 14:01:48 ..... started STAR run
# Jul 29 14:01:48 ..... loading genome
# Jul 29 14:01:50 ..... started mapping
# Jul 29 14:02:34 ..... finished mapping
# Jul 29 14:02:34 ..... finished successfully
mv SRR1792924_Aligned.out.sam SRR1792924.sam
for libid in 25 26 27 28 29
do
STAR --runThreadN 4 \
--genomeDir ${PROJECT_PATH}/genome/TAIR10_star \
--readFilesIn ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \
${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gz \
--readFilesCommand gunzip \
--outFileNamePrefix SRR17929${libid}_
mv SRR17929${libid}_Aligned.out.sam SRR17929${libid}.sam
done
kallisto is a pseudo-alignment tool. It requires transcript sequences for generating k-mer based index. Here, we use cDNA sequeneces (not the whole genome sequence) to generate the kallisto index.
xxxxxxxxxx
cd ${PROJECT_PATH}/genome
kallisto index -i TAIR10_kallisto Arabidopsis_thaliana.TAIR10.cdna.all.fa
Then, we use quant
mode to perform pseudo-alignment.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
kallisto quant -i ${PROJECT_PATH}/genome/TAIR10_kallisto \
-o SRR1792924 -b 100 \
${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \
${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz
# [quant] fragment length distribution will be estimated from the data
# [index] k-mer length: 31
# [index] number of targets: 48,359
# [index] number of k-mers: 45,567,218
# [index] number of equivalence classes: 93,973
# [quant] running in paired-end mode
# [quant] will process pair 1: /home/lecturer01/project/at/cleaned_fastq/SRR7508939_cleaned_1.fastq.gz
# /home/lecturer01/project/at/cleaned_fastq/SRR7508939_cleaned_2.fastq.gz
# [quant] finding pseudoalignments for the reads ... done
# [quant] processed 26,691,682 reads, 26,228,045 reads pseudoaligned
# [quant] estimated average fragment length: 145.869
# [ em] quantifying the abundances ... done
# [ em] the Expectation-Maximization algorithm ran for 1,274 rounds
# [bstrp] running EM for the bootstrap: 100
for libid in 25 26 27 28 29
do
kallisto quant -i ${PROJECT_PATH}/genome/TAIR10_kallisto \
-o SRR17929${libid} -b 100 \
${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \
${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gz
done
The mapping results are usually saved in SAM files, and the order of the reads are random. For convenience, we use samtools
to convert SAM to BAM, and sort the reads according to chromosome coordinates. BAM format equals to SAM format, and the file size is smaller than SAM format.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
# samtools view -bS SRR1792924.sam | samtools sort -@ 4 > SRR1792924.bam
samtools sort -@ 4 SRR1792924.sam > SRR1792924.bam
samtools index SRR1792924.bam
for libid in 25 26 27 28 29
do
samtools sort -@ 4 SRR17929${libid}.sam > SRR17929${libid}.bam
samtools index SRR17929${libid}.bam
done
featureCounts
is used for counting the number of reads at specific gene regions. Here, we use featureCounts
to count the number of reads at the exon region and summarize it according to gene IDs. The exon region and gene IDs are defined in the annotation file in GTF or GFF format.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
featureCounts -p -t gene -g ID \
-a ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \
-o counts.txt \
*.bam
mv counts.txt* ${PROJECT_PATH}/count
After executing featureCounts
, we will retrieve the count.txt file that stores the gene expression information. In general, we call this data as an raw count data or raw counts. In the next step, we will use R to perform expression analysis.
StringTie (and Ballgown) are tools for DE analysis, aligning reads to genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts. Use StringTie pipeline for DE analysis, after the mapping and BAM sorting, we perform transcript reassemble.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
stringtie -p 4 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \
-o SRR1792924_tari10.gtf \
-l SRR1792924 \
SRR1792924.bam
for libid in 25 26 27 28 29
do
stringtie -p 4 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \
-o SRR17929${libid}_tari10.gtf \
-l SRR17929${libid} \
SRR17929${libid}.bam
done
Then we merged the reassmbled transcript annotation to one file.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
echo -e "" > merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
stringtie --merge -p 16 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \
-o tair10_stringtie_merged.gtf \
merge_list.txt
gffcompare -r ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 -G \
-o stringtie_merge tair10_stringtie_merged.gtf
# # gffcompare v0.11.2 | Command line was:
#
# = Summary for dataset: tair10_stringtie_merged.gtf
# Query mRNAs : 58352 in 32246 loci (46241 multi-exon transcripts)
# (11591 multi-transcript loci, ~1.8 transcripts per locus)
# Reference mRNAs : 54235 in 32619 loci (42815 multi-exon)
# Super-loci w/ reference transcripts: 32070
# -----------------| Sensitivity | Precision |
# Base level: 100.0 | 98.8 |
# Exon level: 97.8 | 96.5 |
# Intron level: 100.0 | 97.8 |
# Intron chain level: 100.0 | 92.6 |
# Transcript level: 99.5 | 92.5 |
# Locus level: 99.7 | 99.3 |
# Matching intron chains: 42815
# Matching transcripts: 53983
# Matching loci: 32536
# Missed exons: 0/193121 ( 0.0%)
# Novel exons: 900/192441 ( 0.5%)
# Missed introns: 1/132525 ( 0.0%)
# Novel introns: 931/135466 ( 0.7%)
# Missed loci: 0/32619 ( 0.0%)
# Novel loci: 176/32246 ( 0.5%)
#
# Total union super-loci across all input datasets: 32246
# 58352 out of 58352 consensus transcripts written in stringtie_merge.annotated.gtf (0 discarded as redundant)
Then we use StringTie to estimate the gene expression with new the annotations.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
stringtie -e -B -p 4 -G tair10_stringtie_merged.gtf \
-o ${PROJECT_PATH}/ballgonwn/SRR1792924/SRR1792924_stringtie.gtf \
SRR1792924.bam
for libid in 25 26 27 28 29
do
stringtie -e -B -p 4 -G tair10_stringtie_merged.gtf \
-o ${PROJECT_PATH}/ballgown/SRR17929${libid}/SRR17929${libid}_stringtie.gtf \
SRR17929${libid}.bam
done
The result are save in ballgown directory. There were 6 files generated by StringTie. *.ctab files are the text file for id mapping, e.g. exon id to transcript id. The GTF file is a text file in GTF format which consits of gene annotations and transcript expression, i.e., FPKM and TPM.
xxxxxxxxxx
cd ${PROJECT_PATH}/ballgown/SRR1792924
# total 84M
# -rw-r--r-- 1 lecturer01 lecturer01 4.0M Aug 4 07:24 e2t.ctab
# -rw-r--r-- 1 lecturer01 lecturer01 14M Aug 4 07:24 e_data.ctab
# -rw-r--r-- 1 lecturer01 lecturer01 3.3M Aug 4 07:24 i2t.ctab
# -rw-r--r-- 1 lecturer01 lecturer01 5.1M Aug 4 07:24 i_data.ctab
# -rw-r--r-- 1 lecturer01 lecturer01 54M Aug 4 07:24 SRR1792924_stringtie.gtf
# -rw-r--r-- 1 lecturer01 lecturer01 5.0M Aug 4 07:24 t_data.ctab
# drwx------ 2 lecturer01 lecturer01 4.0K Aug 4 07:21 tmp.XXehOX9J
head SRR1792924_stringtie.gtf
# 1 StringTie transcript 3631 5899 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; cov "19.784952"; FPKM "4.125814"; TPM "8.077307";
# 1 StringTie exon 3631 3913 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "1"; cov "19.091873";
# 1 StringTie exon 3996 4276 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "2"; cov "24.274023";
# 1 StringTie exon 4486 4605 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "3"; cov "30.108334";
# 1 StringTie exon 4706 5095 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "4"; cov "20.966667";
# 1 StringTie exon 5174 5326 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "5"; cov "20.751635";
# 1 StringTie exon 5439 5899 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "6"; cov "13.466377";
# 1 StringTie transcript 6788 9130 1000 - . gene_id "MSTRG.2"; transcript_id "transcript:AT1G01020.1"; cov "13.410775"; FPKM "2.796588"; TPM "5.475017";
After estimation of transcript expression, we can use ballgown to perform DE anlaysis in R platform.