GWAS

In this tutorial, we will learn how to identify short variants (e.g. SNPs and indels) with GATK version 4, following the New York University pipeline. We will use DNA-Seq data that sequenced from five strains of Arabidopsis thaliana in Cao et al study, and detect the short variants among these strains.

Analysis environment

In order to prevent the comtamination of system environemnt, we build a virtual analysis environment with Anaconda. We use conda command to build an environment named atgwasenv, and then install some bioinformatics tools.

Then, we create some directories to save DNA-Seq data, gneome data, intermediate results, and the final results. In addititon, we set an environment variable PROJECTT_PATHT to remember the root path of this project, just for convenience. Note that you should change lecturer01 to your user name.

Dataset

Resequencing data

The resequencing data by Cao et al. can be downloaded from EMBL-EBI with accession SRA029270. There are more than 50 strains used in this study, while we only use the 5 strains in this workshop. Here, we use wget command to download FASTQ datasets from EMBL-EBI directly.

Note that if you use AI super computer, you can copy the FASTQ files from /data/workshop1/hts/fastq/atgwas with the following commands.

Genome sequences

We will use TAIR10 genome of A. Thaliana as a reference for variant calling. The TAIR10 genome sequences and annotations can be downloaded from Ensembl website. Here, we use wget command to download both from Ensembl directly.

If you use AI super computer, you can copy the FASTQ files from /data/workshop1/hts/genome/TAIR10 with the following commands.

 

QC

The resequencing data in FASTQ files may contain low quality parts, such as artificial contaminations, poly-A tails, and so on. To determine the detailed these erros, we use FastQC (fastqc) to check all FASTQ files.

FastQC generates a quality report for each library. Considering the reports, we then determine the thresholds to perform quality control (QC) against all FASTQ files by using Trimmomatic (trimmomatic). Since we will perfrom the same QC processes agains all FASTQ files, we will use for syntax to processs all FASTQ files one by one. Note that, DO NOT ADD SPACES NEAR BY =, and DO NOT ADD ANY CHARACTERS AFTER \ including a space.

To make sure the QC was succesfully performed, we check qualities of the all cleaned FASTQ files with FastQC.

 

Mapping

For variant calling, STAR is recommended for mapping of RNA-Seq reads and BWA is recommended for DNA-Seq reads. In this tutorial, we use BWA to map DNA-Seq reads to TAIR10 genome sequences. To generate the index for mapping, we use bwa command with index option. We name the index as TAIR10_bwa here.

Then, we use bwa command to map all sequencing data to TAIR10_bwa.

Since the reads are randomly stored in SAM file, we use samtools command with sort opition to convert SAM to BAM, and then sort reads according to chromosome positions.

 

Variant calling

GATK analyzes alignments in BAM files to call variants. Like to other tools, STAR and BWA, GATK also requires index file for data analysis. We use CreateSequenceDictionary function to build a index. The index name should be same as FASTA file, but the extension should be .dict. In addition, we will use the reference and index many times in the downstream analysis, we here create an environment variable REF to remeber the path to the reference file.

Note that, if GATK command dose not work, add --java-options to upgrade the memory for Java.

 

BAM preprocessing

Resequencing reads may contains PCR duplication. The duplicated reads affect the accuracy of variant calling. We use GATK to mark up the duplicated reads in BAM file. The new BAM file generated in this step (*_markdup.bam) still contains the duplicated reads, but the duplicated reads are marked up which enable GATK to ignore these reads.

Then, we use GATK to check the quality of mapping results, such as alignment scores and the insertion sizes of the paired-end reads.

 

BQSR

The original read quality may not reflect the errors of base calling in HTS platform and PCR errors. These errors may affect the accuracy during variant calling. To reduce the errors, we recalculate the base scores and update the quality scores in BAM files. This process is called as Base Quality Score Recalibration (BQSR) in GATK. To perform BQSR, we need tell GATK the prior-information of variants. If we can obtain these information from public databases or previous studies, we can use that. If there is no prior-information, we can perform variant calling and use that as prior-information.

BQSR with prior-information

If we have prior-information of variants, we can use the information for BQSR. We perform BaseRecalibrator against input.bam and get the updated BAM input_sqrt.bam.

 

BQSR without prior-information

If we do not have prior-information of variants, we can perform variant calling to generate the temporal variant information. We will run GATK with HaplotypeCaller option to call short variants. This step will generate the intermediate results. To distinguish between the intermediate and final results, we use .g.gvf as an suffix for these intermedaite results, and usually we call these files as gVCF files. The process takes more than 1 hour to process 1 GB BAM file.

After we run HaplotypeCaller, i.e., perform haplotype calling and generate the itermediate result for each sample, we will merge these intermediate results into one file (merged.g.vcf). Then, we perform genotyping with the merged file.

Next, we extract SNPs from the merged VCF file and perform filteration to tag high confidence SNPs. Finally, we will retrive SNPs in *_snps_filtered.vcf file. This file contain all SNPs, SNPs passed the filteration are tagged with PASS, otherwise tagged with filteration name which causes failure of filteration.

Same to SNPs, we additionaly perform the same processses to indels to get the final result.

So far, we get merged_snps_filtered.vcf and merged_indels_filtered.vcf files that contain SNPs and indels information, respectively. We use these information as prior-information for BQSR.

 

BQSR effections

To compare the effections between before and after BQSR, we need run BaseRecalibrator again with new BAM file and then use AnalyzeCovariantes to comparison. This step is not required for variant calling, just for checking the effections of BQSR.

 

Variant calling

After BQSR, We will call variant again with GATK HaplotypeCaller against each updated BAM file.

Then, we merge all gVCF files into one gVCF file with CombineGVCFs function and convert gVCF to VCF file.

Next, we extract SNPs from the merged VCF file and perform filteration to tag high confidence SNPs. Finally, we will retrive SNPs in *_snps_filtered.vcf file. This file contain all SNPs, SNPs passed the filteration are tagged with PASS, otherwise tagged with filteration name which causes failure of filteration.

Same to SNPs, we additionaly perform the same processses to indels to get the final result.

 

Annotations

Variants in VCF files are recorded by chromosome name and positions. We use gene annotation to anontate variants to clear that which variant are belong to what genes. We will use snpEff here. To use snpEff, we need to generate index file. Here we use the downloaded annotation file to generate index.

Then, we use the index to annotate the VCF files of the high confidence SNPs and indels.