In this workshop, we will learn how to quantify gene expression and how to identify differentially expressed genes (DEGs) of RNA-Seq data. We will use paired-end RNA-Seq data that was from Arabidopsis thaliana as an example.
To prevent contamination of the system environment, we will build a virtual analysis environment with Anaconda. First of all, we use
conda command to build an environment named atenv, and then install several bioinformatics tools.
We next create directories to save the RNA-Seq data, genome data, intermediate results, and the final results. We set an environment variable
PROJECTT_PATHT to remember the root path of this project, just for convenience. Note that you should change
lecturer01 to your user name. Also, note that, DO NOT ADD SPACES NEAR BY
We will use six RNA-Seq datasets that sequenced from A. thaliana in Bazin et al. study. The RNA-Seq data, i.e., FASTQ files can be downloaded from EMBL-EBI with accession PRJNA480638. This dataset has six replicates. The first three replicates are the control sample, and the other three are nsra/b double mutant sample.
Note that if you are using AI supercomputer, you can copy the FASTQ files from
/data/workshop1/ws_rnaseq/fastq/at with the following command.
To quantify gene expression of RNA-Seq data, we need the reference genome and the gene annotations. Reference genome and annotations are available from the TAIR project website or the Ensembl website. Here, we download them from the Ensembl using the following scripts.
Note that if you are using AI supercomputer, you can copy these files from
/data/workshop1/ws_rnaseq/genome/TAIR10 with the following commands.
The sequencing data in FASTQ files may contain low-quality parts, such as artificial contaminations, poly-A tails, and so on. To determine detailed errors, we use FastQC (
fastqc) to check all the FASTQ files.
FastQC generates a quality report for each library. Based on the reports, we then determine the thresholds for quality control (QC) against all FASTQ files using Trimmomatic (
trimmomatic). Since we will perform the same QC processes against all FASTQ files, we will use
for syntax to process all FASTQ files one by one. Note that, DO NOT ADD SPACES NEAR BY
=, and DO NOT ADD ANY CHARACTERS AFTER
\ including space.
To make sure the QC was succesfully performed, we check qualities of all the cleaned FASTQ files with FastQC.
Mapping is a main process in expression quantification of RNA-Seq data analysis. There are several mapping tools available for mapping. Choose one of the tools and try it.
hisat2-build to build an index and name the index as TAIR10_hisat2.
Then, we can use
hisat2 command to map reads onto the index TAIR10_hisat2.
Reference sequence index for STAR can be built by
STAR command with
genomeGenerate mode. We name the index as TAIR10_star. Note that STAR requires an amount of memory to build the index.
Mapping can be done by
kallisto is a pseudo-alignment tool. It requires transcript sequences for generating k-mer based index. Here, we use cDNA sequeneces (not the whole genome sequence) to generate the kallisto index.
Then, we use
quant mode to perform pseudo-alignment.
The mapping results are usually saved in SAM files, and the order of the reads are random. For convenience, we use
samtools to convert SAM to BAM, and sort the reads according to chromosome coordinates. BAM format equals to SAM format, and the file size is smaller than SAM format.
featureCounts is used for counting the number of reads at specific gene regions. Here, we use
featureCounts to count the number of reads at the exon region and summarize it according to gene IDs. The exon region and gene IDs are defined in the annotation file in GTF or GFF format.
featureCounts, we will retrieve the count.txt file that stores the gene expression information. In general, we call this data as an raw count data or raw counts. In the next step, we will use R to perform expression analysis.
StringTie (and Ballgown) are tools for DE analysis, aligning reads to genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts. Use StringTie pipeline for DE analysis, after the mapping and BAM sorting, we perform transcript reassemble.
Then we merged the reassmbled transcript annotation to one file.
Then we use StringTie to estimate the gene expression with new the annotations.
The result are save in ballgown directory. There were 6 files generated by StringTie. *.ctab files are the text file for id mapping, e.g. exon id to transcript id. The GTF file is a text file in GTF format which consits of gene annotations and transcript expression, i.e., FPKM and TPM.
After estimation of transcript expression, we can use ballgown to perform DE anlaysis in R platform.