RNA-Seq

In this workshop, we will learn how to quantify gene expression and how to identify differentially expressed genes (DEGs) of RNA-Seq data. We will use paired-end RNA-Seq data that was from Arabidopsis thaliana as an example.

Analysis environment

To prevent contamination of the system environment, we will build a virtual analysis environment with Anaconda. First of all, we use conda command to build an environment named atenv, and then install several bioinformatics tools.


x
# build an analysis environment
conda create --name atenv
# activate the analysis environment
conda activate atenv
# install bioinformatics tools
# conda install wget # for macOS
conda install -c bioconda fastqc trimmomatic hisat2 star kallisto samtools subread stringtie

We next create directories to save the RNA-Seq data, genome data, intermediate results, and the final results. We set an environment variable PROJECTT_PATHT to remember the root path of this project, just for convenience. Note that you should change lecturer01 to your user name. Also, note that, DO NOT ADD SPACES NEAR BY =.


xxxxxxxxxx
# project path
PROJECT_PATH=/home/lecturer01/projects/atrnaseq
mkdir -p ${PROJECT_PATH}
cd ${PROJECT_PATH}
pwd
# /home/lecturer01/projects/at
mkdir genome
mkdir fastq
mkdir cleaned_fastq
mkdir bam
mkdir count
mkdir ballgown

Dataset

RNA-Seq data

We will use six RNA-Seq datasets that sequenced from A. thaliana in Bazin et al. study. The RNA-Seq data, i.e., FASTQ files can be downloaded from EMBL-EBI with accession PRJNA480638. This dataset has six replicates. The first three replicates are the control sample, and the other three are nsra/b double mutant sample.


xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
# WT_1; Col0_RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/004/SRR1792924/SRR1792924_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/004/SRR1792924/SRR1792924_2.fastq.gz
# WT_2; Col0_RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/005/SRR1792925/SRR1792925_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/005/SRR1792925/SRR1792925_2.fastq.gz
# WT_3; Col0_RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/006/SRR1792926/SRR1792926_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/006/SRR1792926/SRR1792926_2.fastq.gz
# NSR_1; nsra/b RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/007/SRR1792927/SRR1792927_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/007/SRR1792927/SRR1792927_2.fastq.gz
# NSR_2; nsra/b RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/008/SRR1792928/SRR1792928_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/008/SRR1792928/SRR1792928_2.fastq.gz
# NSR_3; nsra/b RNAseq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/009/SRR1792929/SRR1792929_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/009/SRR1792929/SRR1792929_2.fastq.gz

Note that if you are using AI supercomputer, you can copy the FASTQ files from /data/workshop1/ws_rnaseq/fastq/at with the following command.


xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
cp /data/workshop1/hts/fastq/atrnaseq/*.fastq.gz .

Genome sequences

To quantify gene expression of RNA-Seq data, we need the reference genome and the gene annotations. Reference genome and annotations are available from the TAIR project website or the Ensembl website. Here, we download them from the Ensembl using the following scripts.


xxxxxxxxxx
cd ${PROJECT_PATH}/genome
# reference sequence
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
# transcript sequence
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
# gene annotation
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.40.gff3.gz
# decompress
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.40.gff3.gz

Note that if you are using AI supercomputer, you can copy these files from /data/workshop1/ws_rnaseq/genome/TAIR10 with the following commands.


xxxxxxxxxx
cd ${PROJECT_PATH}/genome
cp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa .
cp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.40.gff3 .

QC

The sequencing data in FASTQ files may contain low-quality parts, such as artificial contaminations, poly-A tails, and so on. To determine detailed errors, we use FastQC (fastqc) to check all the FASTQ files.


xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
fastqc *.gz

FastQC generates a quality report for each library. Based on the reports, we then determine the thresholds for quality control (QC) against all FASTQ files using Trimmomatic (trimmomatic). Since we will perform the same QC processes against all FASTQ files, we will use for syntax to process all FASTQ files one by one. Note that, DO NOT ADD SPACES NEAR BY =, and DO NOT ADD ANY CHARACTERS AFTER \ including space.


xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
# SRR7508939_*.fastq.gz
trimmomatic PE -threads 4 \
    SRR1792924_1.fastq.gz SRR1792924_2.fastq.gz \
    SRR1792924_cleaned_1.fastq.gz SRR1792924_unpaired_1.fastq.gz \
    SRR1792924_cleaned_2.fastq.gz SRR1792924_unpaired_2.fastq.gz \
    LEADING:30 TRAILING:30 HEADCROP:13 MINLEN:60
# SRR1792925_*.fastq.gz to SRR1792929_*.fastq.gz
for libid in 25 26 27 28 29
do
  trimmomatic PE -threads 4 \
    SRR17929${libid}_1.fastq.gz SRR17929${libid}_2.fastq.gz \
    SRR17929${libid}_cleaned_1.fastq.gz SRR17929${libid}_unpaired_1.fastq.gz \
    SRR17929${libid}_cleaned_2.fastq.gz SRR17929${libid}_unpaired_2.fastq.gz \
    LEADING:30 TRAILING:30 HEADCROP:13 MINLEN:60
done
mv *_cleaned_*.gz ${PROJECT_PATH}/cleaned_fastq/

To make sure the QC was succesfully performed, we check qualities of all the cleaned FASTQ files with FastQC.


xxxxxxxxxx
cd ${PROJECT_PATH}/cleaned_fastq
fastqc *.gz

Mapping

Mapping is a main process in expression quantification of RNA-Seq data analysis. There are several mapping tools available for mapping. Choose one of the tools and try it.

HISAT2

We use hisat2-build to build an index and name the index as TAIR10_hisat2.


xxxxxxxxxx
cd ${PROJECT_PATH}/genome
hisat2-build -p 4 Arabidopsis_thaliana.TAIR10.dna.toplevel.fa TAIR10_hisat2

Then, we can use hisat2 command to map reads onto the index TAIR10_hisat2.


xxxxxxxxxx
cd ${PROJECT_PATH}/bam
hisat2 -x ${PROJECT_PATH}/genome/TAIR10_hisat2 \
       -1 ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \
       -2 ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz \
       -p 16 -S SRR1792924.sam
for libid in 25 26 27 28 29
do
  hisat2 -x ${PROJECT_PATH}/genome/TAIR10_hisat2 \
         -1 ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \
         -2 ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gz \
         -p 16 -S SRR17929${libid}.sam
done

STAR

Reference sequence index for STAR can be built by STAR command with genomeGenerate mode. We name the index as TAIR10_star. Note that STAR requires an amount of memory to build the index.


xxxxxxxxxx
cd ${PROJECT_PATH}/genome
mkdir TAIR10_star
STAR --runMode genomeGenerate \
     --runThreadN 4 \
     --genomeDir TAIR10_star \
     --genomeSAindexNbases 12 \
     --genomeFastaFiles Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
     --limitGenomeGenerateRAM 3400000000

Mapping can be done by STAR command.


xxxxxxxxxx
cd ${PROJECT_PATH}/bam
STAR --runThreadN 4 \
     --genomeDir ${PROJECT_PATH}/genome/TAIR10_star \
     --readFilesIn ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \
                   ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz \
     --readFilesCommand gunzip \
     --outFileNamePrefix SRR1792924_
# Jul 29 14:01:48 ..... started STAR run
# Jul 29 14:01:48 ..... loading genome
# Jul 29 14:01:50 ..... started mapping
# Jul 29 14:02:34 ..... finished mapping
# Jul 29 14:02:34 ..... finished successfully
mv SRR1792924_Aligned.out.sam SRR1792924.sam
for libid in 25 26 27 28 29
do
  STAR --runThreadN 4 \
       --genomeDir ${PROJECT_PATH}/genome/TAIR10_star \
       --readFilesIn ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \
                     ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gz \
       --readFilesCommand gunzip \
       --outFileNamePrefix SRR17929${libid}_
  mv SRR17929${libid}_Aligned.out.sam SRR17929${libid}.sam
done

kallisto

kallisto is a pseudo-alignment tool. It requires transcript sequences for generating k-mer based index. Here, we use cDNA sequeneces (not the whole genome sequence) to generate the kallisto index.


xxxxxxxxxx
cd ${PROJECT_PATH}/genome
kallisto index -i TAIR10_kallisto Arabidopsis_thaliana.TAIR10.cdna.all.fa

Then, we use quant mode to perform pseudo-alignment.


xxxxxxxxxx
cd ${PROJECT_PATH}/bam
kallisto quant -i ${PROJECT_PATH}/genome/TAIR10_kallisto \
               -o SRR1792924 -b 100 \
               ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \
               ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz
# [quant] fragment length distribution will be estimated from the data
# [index] k-mer length: 31
# [index] number of targets: 48,359
# [index] number of k-mers: 45,567,218
# [index] number of equivalence classes: 93,973
# [quant] running in paired-end mode
# [quant] will process pair 1: /home/lecturer01/project/at/cleaned_fastq/SRR7508939_cleaned_1.fastq.gz
#                              /home/lecturer01/project/at/cleaned_fastq/SRR7508939_cleaned_2.fastq.gz
# [quant] finding pseudoalignments for the reads ... done
# [quant] processed 26,691,682 reads, 26,228,045 reads pseudoaligned
# [quant] estimated average fragment length: 145.869
# [   em] quantifying the abundances ... done
# [   em] the Expectation-Maximization algorithm ran for 1,274 rounds
# [bstrp] running EM for the bootstrap: 100
for libid in 25 26 27 28 29
do
  kallisto quant -i ${PROJECT_PATH}/genome/TAIR10_kallisto \
                 -o SRR17929${libid} -b 100 \
                 ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \
                 ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gz
done

Alignment sorting

The mapping results are usually saved in SAM files, and the order of the reads are random. For convenience, we use samtools to convert SAM to BAM, and sort the reads according to chromosome coordinates. BAM format equals to SAM format, and the file size is smaller than SAM format.


xxxxxxxxxx
cd ${PROJECT_PATH}/bam
# samtools view -bS SRR1792924.sam | samtools sort -@ 4 > SRR1792924.bam
samtools sort -@ 4 SRR1792924.sam > SRR1792924.bam
samtools index SRR1792924.bam
for libid in 25 26 27 28 29
do
  samtools sort -@ 4 SRR17929${libid}.sam > SRR17929${libid}.bam
  samtools index SRR17929${libid}.bam
done

Quantification

featureCounts is used for counting the number of reads at specific gene regions. Here, we use featureCounts to count the number of reads at the exon region and summarize it according to gene IDs. The exon region and gene IDs are defined in the annotation file in GTF or GFF format.


xxxxxxxxxx
cd ${PROJECT_PATH}/bam
featureCounts -p -t gene -g ID \
              -a ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \
              -o counts.txt \
              *.bam
mv counts.txt* ${PROJECT_PATH}/count

After executing featureCounts, we will retrieve the count.txt file that stores the gene expression information. In general, we call this data as an raw count data or raw counts. In the next step, we will use R to perform expression analysis.

Quantification with StringTie-Ballgown pipeline

StringTie (and Ballgown) are tools for DE analysis, aligning reads to genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts. Use StringTie pipeline for DE analysis, after the mapping and BAM sorting, we perform transcript reassemble.


xxxxxxxxxx
cd ${PROJECT_PATH}/bam
stringtie -p 4 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \
          -o SRR1792924_tari10.gtf \
          -l SRR1792924 \
          SRR1792924.bam
          
for libid in 25 26 27 28 29
do
  stringtie -p 4 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \
          -o SRR17929${libid}_tari10.gtf \
          -l SRR17929${libid} \
          SRR17929${libid}.bam
done

Then we merged the reassmbled transcript annotation to one file.


xxxxxxxxxx
cd ${PROJECT_PATH}/bam
echo -e "" > merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
echo "SRR1792924_tari10.gtf" >> merge_list.txt
stringtie --merge -p 16 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \
                  -o tair10_stringtie_merged.gtf \
                  merge_list.txt
gffcompare -r ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 -G \
           -o stringtie_merge tair10_stringtie_merged.gtf
# # gffcompare v0.11.2 | Command line was:
#
# = Summary for dataset: tair10_stringtie_merged.gtf
#     Query mRNAs :   58352 in   32246 loci  (46241 multi-exon transcripts)
#            (11591 multi-transcript loci, ~1.8 transcripts per locus)
#  Reference mRNAs :   54235 in   32619 loci  (42815 multi-exon)
#  Super-loci w/ reference transcripts:    32070
#  -----------------| Sensitivity | Precision  |
#         Base level:   100.0     |    98.8    |
#         Exon level:    97.8     |    96.5    |
#       Intron level:   100.0     |    97.8    |
# Intron chain level:   100.0     |    92.6    |
#   Transcript level:    99.5     |    92.5    |
#        Locus level:    99.7     |    99.3    |
#      Matching intron chains:   42815
#        Matching transcripts:   53983
#               Matching loci:   32536
#           Missed exons:       0/193121  (  0.0%)
#            Novel exons:     900/192441  (  0.5%)
#         Missed introns:       1/132525  (  0.0%)
#        Novel introns:     931/135466  (  0.7%)
#           Missed loci:       0/32619  (  0.0%)
#            Novel loci:     176/32246  (  0.5%)
#
# Total union super-loci across all input datasets: 32246
# 58352 out of 58352 consensus transcripts written in stringtie_merge.annotated.gtf (0 discarded as redundant)

Then we use StringTie to estimate the gene expression with new the annotations.


xxxxxxxxxx
cd ${PROJECT_PATH}/bam
stringtie -e -B -p 4 -G tair10_stringtie_merged.gtf \
          -o ${PROJECT_PATH}/ballgonwn/SRR1792924/SRR1792924_stringtie.gtf \
          SRR1792924.bam
          
          
for libid in 25 26 27 28 29
do
  stringtie -e -B -p 4 -G tair10_stringtie_merged.gtf \
            -o ${PROJECT_PATH}/ballgown/SRR17929${libid}/SRR17929${libid}_stringtie.gtf \
            SRR17929${libid}.bam
done

The result are save in ballgown directory. There were 6 files generated by StringTie. *.ctab files are the text file for id mapping, e.g. exon id to transcript id. The GTF file is a text file in GTF format which consits of gene annotations and transcript expression, i.e., FPKM and TPM.


xxxxxxxxxx
cd  ${PROJECT_PATH}/ballgown/SRR1792924
# total 84M
# -rw-r--r-- 1 lecturer01 lecturer01 4.0M Aug  4 07:24 e2t.ctab
# -rw-r--r-- 1 lecturer01 lecturer01  14M Aug  4 07:24 e_data.ctab
# -rw-r--r-- 1 lecturer01 lecturer01 3.3M Aug  4 07:24 i2t.ctab
# -rw-r--r-- 1 lecturer01 lecturer01 5.1M Aug  4 07:24 i_data.ctab
# -rw-r--r-- 1 lecturer01 lecturer01  54M Aug  4 07:24 SRR1792924_stringtie.gtf
# -rw-r--r-- 1 lecturer01 lecturer01 5.0M Aug  4 07:24 t_data.ctab
# drwx------ 2 lecturer01 lecturer01 4.0K Aug  4 07:21 tmp.XXehOX9J
head SRR1792924_stringtie.gtf
# 1 StringTie transcript  3631  5899  1000  + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; cov "19.784952"; FPKM "4.125814"; TPM "8.077307";
# 1 StringTie exon  3631  3913  1000  + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "1"; cov "19.091873";
# 1 StringTie exon  3996  4276  1000  + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "2"; cov "24.274023";
# 1 StringTie exon  4486  4605  1000  + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "3"; cov "30.108334";
# 1 StringTie exon  4706  5095  1000  + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "4"; cov "20.966667";
# 1 StringTie exon  5174  5326  1000  + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "5"; cov "20.751635";
# 1 StringTie exon  5439  5899  1000  + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "6"; cov "13.466377";
# 1 StringTie transcript  6788  9130  1000  - . gene_id "MSTRG.2"; transcript_id "transcript:AT1G01020.1"; cov "13.410775"; FPKM "2.796588"; TPM "5.475017";

After estimation of transcript expression, we can use ballgown to perform DE anlaysis in R platform.