In this workshop, we will learn how to quantify gene expression and how to identify differentially expressed genes (DEGs) of RNA-Seq data. We will use paired-end RNA-Seq data that was from Arabidopsis thaliana as an example.
To prevent contamination of the system environment, we will build a virtual analysis environment with Anaconda. First of all, we use conda command to build an environment named atenv, and then install several bioinformatics tools.
x# build an analysis environmentconda create --name atenv# activate the analysis environmentconda activate atenv# install bioinformatics tools# conda install wget # for macOSconda install -c bioconda fastqc trimmomatic hisat2 star kallisto samtools subread stringtieWe next create directories to save the RNA-Seq data, genome data, intermediate results, and the final results. We set an environment variable PROJECTT_PATHT to remember the root path of this project, just for convenience. Note that you should change lecturer01 to your user name. Also, note that, DO NOT ADD SPACES NEAR BY =.
xxxxxxxxxx# project pathPROJECT_PATH=/home/lecturer01/projects/atrnaseqmkdir -p ${PROJECT_PATH}cd ${PROJECT_PATH}pwd# /home/lecturer01/projects/atmkdir genomemkdir fastqmkdir cleaned_fastqmkdir bammkdir countmkdir ballgown
We will use six RNA-Seq datasets that sequenced from A. thaliana in Bazin et al. study. The RNA-Seq data, i.e., FASTQ files can be downloaded from EMBL-EBI with accession PRJNA480638. This dataset has six replicates. The first three replicates are the control sample, and the other three are nsra/b double mutant sample.
xxxxxxxxxxcd ${PROJECT_PATH}/fastq# WT_1; Col0_RNAseqwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/004/SRR1792924/SRR1792924_1.fastq.gzwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/004/SRR1792924/SRR1792924_2.fastq.gz# WT_2; Col0_RNAseqwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/005/SRR1792925/SRR1792925_1.fastq.gzwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/005/SRR1792925/SRR1792925_2.fastq.gz# WT_3; Col0_RNAseqwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/006/SRR1792926/SRR1792926_1.fastq.gzwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/006/SRR1792926/SRR1792926_2.fastq.gz# NSR_1; nsra/b RNAseqwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/007/SRR1792927/SRR1792927_1.fastq.gzwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/007/SRR1792927/SRR1792927_2.fastq.gz# NSR_2; nsra/b RNAseqwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/008/SRR1792928/SRR1792928_1.fastq.gzwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/008/SRR1792928/SRR1792928_2.fastq.gz# NSR_3; nsra/b RNAseqwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/009/SRR1792929/SRR1792929_1.fastq.gzwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/009/SRR1792929/SRR1792929_2.fastq.gzNote that if you are using AI supercomputer, you can copy the FASTQ files from /data/workshop1/ws_rnaseq/fastq/at with the following command.
xxxxxxxxxxcd ${PROJECT_PATH}/fastqcp /data/workshop1/hts/fastq/atrnaseq/*.fastq.gz .
To quantify gene expression of RNA-Seq data, we need the reference genome and the gene annotations. Reference genome and annotations are available from the TAIR project website or the Ensembl website. Here, we download them from the Ensembl using the following scripts.
xxxxxxxxxxcd ${PROJECT_PATH}/genome# reference sequencewget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz# transcript sequencewget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz# gene annotationwget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.40.gff3.gz# decompressgunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gzgunzip Arabidopsis_thaliana.TAIR10.cdna.all.fa.gzgunzip Arabidopsis_thaliana.TAIR10.40.gff3.gzNote that if you are using AI supercomputer, you can copy these files from /data/workshop1/ws_rnaseq/genome/TAIR10 with the following commands.
xxxxxxxxxxcd ${PROJECT_PATH}/genomecp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa .cp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.40.gff3 .
The sequencing data in FASTQ files may contain low-quality parts, such as artificial contaminations, poly-A tails, and so on. To determine detailed errors, we use FastQC (fastqc) to check all the FASTQ files.
xxxxxxxxxxcd ${PROJECT_PATH}/fastqfastqc *.gzFastQC generates a quality report for each library. Based on the reports, we then determine the thresholds for quality control (QC) against all FASTQ files using Trimmomatic (trimmomatic). Since we will perform the same QC processes against all FASTQ files, we will use for syntax to process all FASTQ files one by one. Note that, DO NOT ADD SPACES NEAR BY =, and DO NOT ADD ANY CHARACTERS AFTER \ including space.
xxxxxxxxxxcd ${PROJECT_PATH}/fastq# SRR7508939_*.fastq.gztrimmomatic PE -threads 4 \ SRR1792924_1.fastq.gz SRR1792924_2.fastq.gz \ SRR1792924_cleaned_1.fastq.gz SRR1792924_unpaired_1.fastq.gz \ SRR1792924_cleaned_2.fastq.gz SRR1792924_unpaired_2.fastq.gz \ LEADING:30 TRAILING:30 HEADCROP:13 MINLEN:60# SRR1792925_*.fastq.gz to SRR1792929_*.fastq.gzfor libid in 25 26 27 28 29do trimmomatic PE -threads 4 \ SRR17929${libid}_1.fastq.gz SRR17929${libid}_2.fastq.gz \ SRR17929${libid}_cleaned_1.fastq.gz SRR17929${libid}_unpaired_1.fastq.gz \ SRR17929${libid}_cleaned_2.fastq.gz SRR17929${libid}_unpaired_2.fastq.gz \ LEADING:30 TRAILING:30 HEADCROP:13 MINLEN:60donemv *_cleaned_*.gz ${PROJECT_PATH}/cleaned_fastq/To make sure the QC was succesfully performed, we check qualities of all the cleaned FASTQ files with FastQC.
xxxxxxxxxxcd ${PROJECT_PATH}/cleaned_fastqfastqc *.gz
Mapping is a main process in expression quantification of RNA-Seq data analysis. There are several mapping tools available for mapping. Choose one of the tools and try it.
We use hisat2-build to build an index and name the index as TAIR10_hisat2.
xxxxxxxxxxcd ${PROJECT_PATH}/genomehisat2-build -p 4 Arabidopsis_thaliana.TAIR10.dna.toplevel.fa TAIR10_hisat2Then, we can use hisat2 command to map reads onto the index TAIR10_hisat2.
xxxxxxxxxxcd ${PROJECT_PATH}/bamhisat2 -x ${PROJECT_PATH}/genome/TAIR10_hisat2 \ -1 ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \ -2 ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz \ -p 16 -S SRR1792924.samfor libid in 25 26 27 28 29do hisat2 -x ${PROJECT_PATH}/genome/TAIR10_hisat2 \ -1 ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \ -2 ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gz \ -p 16 -S SRR17929${libid}.samdone
Reference sequence index for STAR can be built by STAR command with genomeGenerate mode. We name the index as TAIR10_star. Note that STAR requires an amount of memory to build the index.
xxxxxxxxxxcd ${PROJECT_PATH}/genomemkdir TAIR10_starSTAR --runMode genomeGenerate \ --runThreadN 4 \ --genomeDir TAIR10_star \ --genomeSAindexNbases 12 \ --genomeFastaFiles Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \ --limitGenomeGenerateRAM 3400000000Mapping can be done by STAR command.
xxxxxxxxxxcd ${PROJECT_PATH}/bamSTAR --runThreadN 4 \ --genomeDir ${PROJECT_PATH}/genome/TAIR10_star \ --readFilesIn ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \ ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz \ --readFilesCommand gunzip \ --outFileNamePrefix SRR1792924_# Jul 29 14:01:48 ..... started STAR run# Jul 29 14:01:48 ..... loading genome# Jul 29 14:01:50 ..... started mapping# Jul 29 14:02:34 ..... finished mapping# Jul 29 14:02:34 ..... finished successfullymv SRR1792924_Aligned.out.sam SRR1792924.samfor libid in 25 26 27 28 29do STAR --runThreadN 4 \ --genomeDir ${PROJECT_PATH}/genome/TAIR10_star \ --readFilesIn ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \ ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gz \ --readFilesCommand gunzip \ --outFileNamePrefix SRR17929${libid}_ mv SRR17929${libid}_Aligned.out.sam SRR17929${libid}.samdonekallisto is a pseudo-alignment tool. It requires transcript sequences for generating k-mer based index. Here, we use cDNA sequeneces (not the whole genome sequence) to generate the kallisto index.
xxxxxxxxxxcd ${PROJECT_PATH}/genomekallisto index -i TAIR10_kallisto Arabidopsis_thaliana.TAIR10.cdna.all.faThen, we use quant mode to perform pseudo-alignment.
xxxxxxxxxxcd ${PROJECT_PATH}/bamkallisto quant -i ${PROJECT_PATH}/genome/TAIR10_kallisto \ -o SRR1792924 -b 100 \ ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_1.fastq.gz \ ${PROJECT_PATH}/cleaned_fastq/SRR1792924_cleaned_2.fastq.gz# [quant] fragment length distribution will be estimated from the data# [index] k-mer length: 31# [index] number of targets: 48,359# [index] number of k-mers: 45,567,218# [index] number of equivalence classes: 93,973# [quant] running in paired-end mode# [quant] will process pair 1: /home/lecturer01/project/at/cleaned_fastq/SRR7508939_cleaned_1.fastq.gz# /home/lecturer01/project/at/cleaned_fastq/SRR7508939_cleaned_2.fastq.gz# [quant] finding pseudoalignments for the reads ... done# [quant] processed 26,691,682 reads, 26,228,045 reads pseudoaligned# [quant] estimated average fragment length: 145.869# [ em] quantifying the abundances ... done# [ em] the Expectation-Maximization algorithm ran for 1,274 rounds# [bstrp] running EM for the bootstrap: 100for libid in 25 26 27 28 29do kallisto quant -i ${PROJECT_PATH}/genome/TAIR10_kallisto \ -o SRR17929${libid} -b 100 \ ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_1.fastq.gz \ ${PROJECT_PATH}/cleaned_fastq/SRR17929${libid}_cleaned_2.fastq.gzdone
The mapping results are usually saved in SAM files, and the order of the reads are random. For convenience, we use samtools to convert SAM to BAM, and sort the reads according to chromosome coordinates. BAM format equals to SAM format, and the file size is smaller than SAM format.
xxxxxxxxxxcd ${PROJECT_PATH}/bam# samtools view -bS SRR1792924.sam | samtools sort -@ 4 > SRR1792924.bamsamtools sort -@ 4 SRR1792924.sam > SRR1792924.bamsamtools index SRR1792924.bamfor libid in 25 26 27 28 29do samtools sort -@ 4 SRR17929${libid}.sam > SRR17929${libid}.bam samtools index SRR17929${libid}.bamdone
featureCounts is used for counting the number of reads at specific gene regions. Here, we use featureCounts to count the number of reads at the exon region and summarize it according to gene IDs. The exon region and gene IDs are defined in the annotation file in GTF or GFF format.
xxxxxxxxxxcd ${PROJECT_PATH}/bamfeatureCounts -p -t gene -g ID \ -a ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \ -o counts.txt \ *.bammv counts.txt* ${PROJECT_PATH}/countAfter executing featureCounts, we will retrieve the count.txt file that stores the gene expression information. In general, we call this data as an raw count data or raw counts. In the next step, we will use R to perform expression analysis.
StringTie (and Ballgown) are tools for DE analysis, aligning reads to genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts. Use StringTie pipeline for DE analysis, after the mapping and BAM sorting, we perform transcript reassemble.
xxxxxxxxxxcd ${PROJECT_PATH}/bamstringtie -p 4 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \ -o SRR1792924_tari10.gtf \ -l SRR1792924 \ SRR1792924.bam for libid in 25 26 27 28 29do stringtie -p 4 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \ -o SRR17929${libid}_tari10.gtf \ -l SRR17929${libid} \ SRR17929${libid}.bamdoneThen we merged the reassmbled transcript annotation to one file.
xxxxxxxxxxcd ${PROJECT_PATH}/bamecho -e "" > merge_list.txtecho "SRR1792924_tari10.gtf" >> merge_list.txtecho "SRR1792924_tari10.gtf" >> merge_list.txtecho "SRR1792924_tari10.gtf" >> merge_list.txtecho "SRR1792924_tari10.gtf" >> merge_list.txtecho "SRR1792924_tari10.gtf" >> merge_list.txtecho "SRR1792924_tari10.gtf" >> merge_list.txtstringtie --merge -p 16 -G ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 \ -o tair10_stringtie_merged.gtf \ merge_list.txtgffcompare -r ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 -G \ -o stringtie_merge tair10_stringtie_merged.gtf# # gffcompare v0.11.2 | Command line was:## = Summary for dataset: tair10_stringtie_merged.gtf# Query mRNAs : 58352 in 32246 loci (46241 multi-exon transcripts)# (11591 multi-transcript loci, ~1.8 transcripts per locus)# Reference mRNAs : 54235 in 32619 loci (42815 multi-exon)# Super-loci w/ reference transcripts: 32070# -----------------| Sensitivity | Precision |# Base level: 100.0 | 98.8 |# Exon level: 97.8 | 96.5 |# Intron level: 100.0 | 97.8 |# Intron chain level: 100.0 | 92.6 |# Transcript level: 99.5 | 92.5 |# Locus level: 99.7 | 99.3 |# Matching intron chains: 42815# Matching transcripts: 53983# Matching loci: 32536# Missed exons: 0/193121 ( 0.0%)# Novel exons: 900/192441 ( 0.5%)# Missed introns: 1/132525 ( 0.0%)# Novel introns: 931/135466 ( 0.7%)# Missed loci: 0/32619 ( 0.0%)# Novel loci: 176/32246 ( 0.5%)## Total union super-loci across all input datasets: 32246# 58352 out of 58352 consensus transcripts written in stringtie_merge.annotated.gtf (0 discarded as redundant)Then we use StringTie to estimate the gene expression with new the annotations.
xxxxxxxxxxcd ${PROJECT_PATH}/bamstringtie -e -B -p 4 -G tair10_stringtie_merged.gtf \ -o ${PROJECT_PATH}/ballgonwn/SRR1792924/SRR1792924_stringtie.gtf \ SRR1792924.bam for libid in 25 26 27 28 29do stringtie -e -B -p 4 -G tair10_stringtie_merged.gtf \ -o ${PROJECT_PATH}/ballgown/SRR17929${libid}/SRR17929${libid}_stringtie.gtf \ SRR17929${libid}.bamdoneThe result are save in ballgown directory. There were 6 files generated by StringTie. *.ctab files are the text file for id mapping, e.g. exon id to transcript id. The GTF file is a text file in GTF format which consits of gene annotations and transcript expression, i.e., FPKM and TPM.
xxxxxxxxxxcd ${PROJECT_PATH}/ballgown/SRR1792924# total 84M# -rw-r--r-- 1 lecturer01 lecturer01 4.0M Aug 4 07:24 e2t.ctab# -rw-r--r-- 1 lecturer01 lecturer01 14M Aug 4 07:24 e_data.ctab# -rw-r--r-- 1 lecturer01 lecturer01 3.3M Aug 4 07:24 i2t.ctab# -rw-r--r-- 1 lecturer01 lecturer01 5.1M Aug 4 07:24 i_data.ctab# -rw-r--r-- 1 lecturer01 lecturer01 54M Aug 4 07:24 SRR1792924_stringtie.gtf# -rw-r--r-- 1 lecturer01 lecturer01 5.0M Aug 4 07:24 t_data.ctab# drwx------ 2 lecturer01 lecturer01 4.0K Aug 4 07:21 tmp.XXehOX9Jhead SRR1792924_stringtie.gtf# 1 StringTie transcript 3631 5899 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; cov "19.784952"; FPKM "4.125814"; TPM "8.077307";# 1 StringTie exon 3631 3913 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "1"; cov "19.091873";# 1 StringTie exon 3996 4276 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "2"; cov "24.274023";# 1 StringTie exon 4486 4605 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "3"; cov "30.108334";# 1 StringTie exon 4706 5095 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "4"; cov "20.966667";# 1 StringTie exon 5174 5326 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "5"; cov "20.751635";# 1 StringTie exon 5439 5899 1000 + . gene_id "MSTRG.1"; transcript_id "transcript:AT1G01010.1"; exon_number "6"; cov "13.466377";# 1 StringTie transcript 6788 9130 1000 - . gene_id "MSTRG.2"; transcript_id "transcript:AT1G01020.1"; cov "13.410775"; FPKM "2.796588"; TPM "5.475017";After estimation of transcript expression, we can use ballgown to perform DE anlaysis in R platform.