In this tutorial, we will learn how to identify short variants (e.g. SNPs and indels) with GATK version 4, following the New York University pipeline. We will use DNA-Seq data that sequenced from five strains of Arabidopsis thaliana in Cao et al study, and detect the short variants among these strains.
In order to prevent the comtamination of system environemnt, we build a virtual analysis environment with Anaconda. We use conda command to build an environment named atgwasenv, and then install some bioinformatics tools.
x
# build an analysis environmentconda create --name atgwasenv# activate the analysis environemntconda activate atgwasenv# install bioinformatics toolsconda install -c bioconda fastqc trimmomatic bwa samtools vcftools conda install -c bioconda perl-vcftools-vcf gatk4 tabix snpeffThen, we create some directories to save DNA-Seq data, gneome data, intermediate results, and the final results. In addititon, we set an environment variable PROJECTT_PATHT to remember the root path of this project, just for convenience. Note that you should change lecturer01 to your user name.
xxxxxxxxxx# project pathPROJECT_PATH=/home/lecturer01/projects/atgwasmkdir -p ${PROJECT_PATH}cd ${PROJECT_PATH}pwd## /home/lecturer01/projects/atgwasmkdir genomemkdir fastqmkdir cleaned_fastqmkdir bammkdir bqsrmkdir vcfThe resequencing data by Cao et al. can be downloaded from EMBL-EBI with accession SRA029270. There are more than 50 strains used in this study, while we only use the 5 strains in this workshop. Here, we use wget command to download FASTQ datasets from EMBL-EBI directly.
xxxxxxxxxxwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR094/SRR094979/SRR094979_1.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR094/SRR094979/SRR094979_2.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095680/SRR095680_1.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095680/SRR095680_2.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095786/SRR095786_1.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095786/SRR095786_2.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095851/SRR095851_1.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095851/SRR095851_2.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095698/SRR095698_1.fastq.gzwget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095698/SRR095698_2.fastq.gzNote that if you use AI super computer, you can copy the FASTQ files from /data/workshop1/hts/fastq/atgwas with the following commands.
x
cd ${PROJECT_PATH}/fastq#cp /data/workshop1/hts/fastq/atgwas/*.fastq.gz .cp /data/workshop1/hts/fastq/atgwasmini/*.fastq.gz .We will use TAIR10 genome of A. Thaliana as a reference for variant calling. The TAIR10 genome sequences and annotations can be downloaded from Ensembl website. Here, we use wget command to download both from Ensembl directly.
x
cd ${PROJECT_PATH}/genome# reference sequencewget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz# gene annotationwget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.40.gff3.gz# decompressgunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gzgunzip Arabidopsis_thaliana.TAIR10.40.gff3.gzIf you use AI super computer, you can copy the FASTQ files from /data/workshop1/hts/genome/TAIR10 with the following commands.
xxxxxxxxxxcd ${PROJECT_PATH}/genomecp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa .cp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.40.gff3 .
The resequencing data in FASTQ files may contain low quality parts, such as artificial contaminations, poly-A tails, and so on. To determine the detailed these erros, we use FastQC (fastqc) to check all FASTQ files.
xxxxxxxxxxcd ${PROJECT_PATH}/fastqfastqc *.gzFastQC generates a quality report for each library. Considering the reports, we then determine the thresholds to perform quality control (QC) against all FASTQ files by using Trimmomatic (trimmomatic). Since we will perfrom the same QC processes agains all FASTQ files, we will use for syntax to processs all FASTQ files one by one. Note that, DO NOT ADD SPACES NEAR BY =, and DO NOT ADD ANY CHARACTERS AFTER \ including a space.
x
cd ${PROJECT_PATH}/fastqfor fpath in `ls *_1.fastq.gz`do fname=${fpath%_1.fastq.gz} trimmomatic PE -threads 4 \ ${fname}_1.fastq.gz ${fname}_2.fastq.gz \ ${fname}_cleaned_1.fastq.gz ${fname}_unpaired_1.fastq.gz \ ${fname}_cleaned_2.fastq.gz ${fname}_unpaired_2.fastq.gz \ LEADING:20 TRAILING:20 MINLEN:20donemv *_cleaned_*.gz ${PROJECT_PATH}/cleaned_fastq/To make sure the QC was succesfully performed, we check qualities of the all cleaned FASTQ files with FastQC.
xxxxxxxxxxcd ${PROJECT_PATH}/cleaned_fastqfastqc *.gz
For variant calling, STAR is recommended for mapping of RNA-Seq reads and BWA is recommended for DNA-Seq reads. In this tutorial, we use BWA to map DNA-Seq reads to TAIR10 genome sequences. To generate the index for mapping, we use bwa command with index option. We name the index as TAIR10_bwa here.
xxxxxxxxxxcd ${PROJECT_PATH}/genomebwa index -p TAIR10_bwa Arabidopsis_thaliana.TAIR10.dna.toplevel.faThen, we use bwa command to map all sequencing data to TAIR10_bwa.
x
cd ${PROJECT_PATH}/cleaned_fastqfor fpath in `ls *_cleaned_1.fastq.gz`do fname=${fpath%_cleaned_1.fastq.gz} bamRG="@RG\tID:"${fname}"\tPL:ILLUMINA\tSM:"${fname} bwa mem -t 2 -R ${bamRG} ${PROJECT_PATH}/genome/TAIR10_bwa \ ${fname}_cleaned_1.fastq.gz ${fname}_cleaned_2.fastq.gz > ${fname}.samdonemv *.sam ${PROJECT_PATH}/bam/Since the reads are randomly stored in SAM file, we use samtools command with sort opition to convert SAM to BAM, and then sort reads according to chromosome positions.
x
cd ${PROJECT_PATH}/bamfor fpath in `ls *.sam`do samtools sort -@ 4 ${fpath} > ${fpath%.sam}.bamdone
GATK analyzes alignments in BAM files to call variants. Like to other tools, STAR and BWA, GATK also requires index file for data analysis. We use CreateSequenceDictionary function to build a index. The index name should be same as FASTA file, but the extension should be .dict. In addition, we will use the reference and index many times in the downstream analysis, we here create an environment variable REF to remeber the path to the reference file.
xxxxxxxxxxcd ${PROJECT_PATH}/genomeREF=${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.dna.toplevel.fasamtools faidx ${REF}gatk CreateSequenceDictionary -R ${REF} -O ${REF%.fa}.dictNote that, if GATK command dose not work, add --java-options to upgrade the memory for Java.
xxxxxxxxxxgatk --java-options "-Xmx4G" CreateSequenceDictionary -R ${REF} -O ${REF%.fa}.dict
Resequencing reads may contains PCR duplication. The duplicated reads affect the accuracy of variant calling. We use GATK to mark up the duplicated reads in BAM file. The new BAM file generated in this step (*_markdup.bam) still contains the duplicated reads, but the duplicated reads are marked up which enable GATK to ignore these reads.
x
cd ${PROJECT_PATH}/bamfor fpath in `ls *.bam`do gatk MarkDuplicates -I ${fpath} \ -O ${fpath%.bam}_markdup.bam \ -M ${fpath%.bam}_markdup_metrics.txt samtools index ${fpath%.bam}_markdup.bamdoneThen, we use GATK to check the quality of mapping results, such as alignment scores and the insertion sizes of the paired-end reads.
xxxxxxxxxxcd ${PROJECT_PATH}/bamfor fpath in `ls *.bam | grep -v _markdup`do fname=${fpath%.bam} gatk CollectAlignmentSummaryMetrics -I ${fname}_markdup.bam \ -R ${REF} \ -O ${fname}_alignment_metrics.txt gatk CollectInsertSizeMetrics -I ${fname}_markdup.bam \ -O ${fname}_insert_metrics.txt \ -H ${fname}_insert_size_histogram.pdf samtools depth -a ${fname}_markdup.bam > ${fname}_depth.txtdone
The original read quality may not reflect the errors of base calling in HTS platform and PCR errors. These errors may affect the accuracy during variant calling. To reduce the errors, we recalculate the base scores and update the quality scores in BAM files. This process is called as Base Quality Score Recalibration (BQSR) in GATK. To perform BQSR, we need tell GATK the prior-information of variants. If we can obtain these information from public databases or previous studies, we can use that. If there is no prior-information, we can perform variant calling and use that as prior-information.
If we have prior-information of variants, we can use the information for BQSR. We perform BaseRecalibrator against input.bam and get the updated BAM input_sqrt.bam.
x
gatk BaseRecalibrator -R ${REF} -I input.bam \ --known-sites dbsnp_146.b38.vcf.gz \ --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \ --known-sites resources_broad_hg38_v0_1000G.phase3.integrated.sites_only.hg38.vcf \ -O ${fname}_recal_data.tablegatk ApplyBQSR -R ${REF} -I input.bam -bqsr ${fname}_recal_data.table -O input_bqsr.bam
If we do not have prior-information of variants, we can perform variant calling to generate the temporal variant information. We will run GATK with HaplotypeCaller option to call short variants. This step will generate the intermediate results. To distinguish between the intermediate and final results, we use .g.gvf as an suffix for these intermedaite results, and usually we call these files as gVCF files. The process takes more than 1 hour to process 1 GB BAM file.
x
cd ${PROJECT_PATH}/bamfor fpath in `ls *_markdup.bam`do fname=${fpath%_markdup.bam} gatk HaplotypeCaller -R ${REF} --emit-ref-confidence GVCF \ -I ${fname}_markdup.bam -O ${fname}.g.vcfdonemv *.g.vcf ../bqsr/mv *.g.vcf.idx ../bqsr/After we run HaplotypeCaller, i.e., perform haplotype calling and generate the itermediate result for each sample, we will merge these intermediate results into one file (merged.g.vcf). Then, we perform genotyping with the merged file.
x
cd ${PROJECT_PATH}/bqsr# list up all gVCF filesgvcf_files=""for gvcf_file in `ls *.g.vcf`do gvcf_files=${gvcf_files}"-V ${gvcf_file} "donegatk CombineGVCFs -R ${REF} ${gvcf_files} -O merged.g.vcfgatk GenotypeGVCFs -R ${REF} -V merged.g.vcf -O merged.vcfNext, we extract SNPs from the merged VCF file and perform filteration to tag high confidence SNPs. Finally, we will retrive SNPs in *_snps_filtered.vcf file. This file contain all SNPs, SNPs passed the filteration are tagged with PASS, otherwise tagged with filteration name which causes failure of filteration.
x
cd ${PROJECT_PATH}/bqsrgatk SelectVariants -R ${REF} -V merged.vcf --select-type-to-include SNP -O merged_snps.vcfgatk VariantFiltration -R ${REF} -V merged_snps.vcf -O merged_snps_filtered.vcf \ -filter "QD < 2.0" --filter-name "QD2" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "SOR > 4.0" --filter-name "SOR4" \ -filter "FS > 60.0" --filter-name "FS60" \ -filter "MQ < 40.0" --filter-name "MQ40" \ -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \ -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8"Same to SNPs, we additionaly perform the same processses to indels to get the final result.
x
cd ${PROJECT_PATH}/bqsrgatk SelectVariants -R ${REF} -V merged.vcf --select-type-to-include INDEL -O merged_indels.vcfgatk VariantFiltration -R ${REF} -V merged_indels.vcf -O merged_indels_filtered.vcf \ -filter "QD < 2.0" --filter-name "QD2" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "FS > 200.0" --filter-name "FS200" \ -filter "SOR > 10.0" -filter-name "SOR10" \ -filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20"So far, we get merged_snps_filtered.vcf and merged_indels_filtered.vcf files that contain SNPs and indels information, respectively. We use these information as prior-information for BQSR.
x
cd ${PROJECT_PATH}/bamfor fpath in `ls *_markdup.bam`do fname=${fpath%_markdup.bam} gatk BaseRecalibrator -R ${REF} -I ${fname}_markdup.bam \ --known-sites ${PROJECT_PATH}/bqsr/merged_snps_filtered.vcf \ --known-sites ${PROJECT_PATH}/bqsr/merged_indels_filtered.vcf \ -O ${fname}_recal_data.table gatk ApplyBQSR -R ${REF} -I ${fname}_markdup.bam -bqsr ${fname}_recal_data.table -O ${fname}_bqsr.bamdone
To compare the effections between before and after BQSR, we need run BaseRecalibrator again with new BAM file and then use AnalyzeCovariantes to comparison. This step is not required for variant calling, just for checking the effections of BQSR.
x
cd ${PROJECT_PATH}/bamfor fpath in `ls *_bqsr.bam`do fname=${fpath%_bqsr.bam} gatk BaseRecalibrator -R ${REF} -I ${fname}_bqsr.bam \ --known-sites ${PROJECT_PATH}/bqsr/merged_snps_filtered.vcf \ --known-sites ${PROJECT_PATH}/bqsr/merged_indels_filtered.vcf \ -O ${fname}_recal_data.table.2 gatk AnalyzeCovariates -before ${fname}_recal_data.table -after ${fname}_recal_data.table.2 \ -plots ${fname}_recalibration_plots.pdfdone
After BQSR, We will call variant again with GATK HaplotypeCaller against each updated BAM file.
x
cd ${PROJECT_PATH}/bamfor fpath in `ls *_bqsr.bam`do fname=${fpath%_bqsr.bam} gatk HaplotypeCaller -R ${REF} --emit-ref-confidence GVCF \ -I ${fname}_bqsr.bam -O ${fname}.g.vcfdonemv *.g.vcf ../vcf/mv *.g.vcf.idx ../vcf/Then, we merge all gVCF files into one gVCF file with CombineGVCFs function and convert gVCF to VCF file.
x
cd ${PROJECT_PATH}/vcf# list up all gVCF filesgvcf_files=""for gvcf_file in `ls *.g.vcf`do gvcf_files=${gvcf_files}"-V ${gvcf_file} "donegatk CombineGVCFs -R ${REF} ${gvcf_files} -O merged.g.vcfgatk GenotypeGVCFs -R ${REF} -V merged.g.vcf -O merged.vcfNext, we extract SNPs from the merged VCF file and perform filteration to tag high confidence SNPs. Finally, we will retrive SNPs in *_snps_filtered.vcf file. This file contain all SNPs, SNPs passed the filteration are tagged with PASS, otherwise tagged with filteration name which causes failure of filteration.
x
cd ${PROJECT_PATH}/vcfgatk SelectVariants -R ${REF} -V merged.vcf --select-type-to-include SNP -O merged_snps.vcfgatk VariantFiltration -R ${REF} -V merged_snps.vcf -O merged_snps_filtered.vcf \ -filter "QD < 2.0" --filter-name "QD2" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "SOR > 4.0" --filter-name "SOR4" \ -filter "FS > 60.0" --filter-name "FS60" \ -filter "MQ < 40.0" --filter-name "MQ40" \ -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \ -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8"Same to SNPs, we additionaly perform the same processses to indels to get the final result.
x
cd ${PROJECT_PATH}/vcfgatk SelectVariants -R ${REF} -V merged.vcf --select-type-to-include INDEL -O merged_indels.vcfgatk VariantFiltration -R ${REF} -V merged_indels.vcf -O merged_indels_filtered.vcf \ -filter "QD < 2.0" --filter-name "QD2" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "FS > 200.0" --filter-name "FS200" \ -filter "SOR > 10.0" -filter-name "SOR10" \ -filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20"
Variants in VCF files are recorded by chromosome name and positions. We use gene annotation to anontate variants to clear that which variant are belong to what genes. We will use snpEff here. To use snpEff, we need to generate index file. Here we use the downloaded annotation file to generate index.
xxxxxxxxxx# change path to your environmenwhich snpEffsnpeff=/home/lecturer01/anaconda3/envs/atgwasenv/share/snpeff-4.5covid19-1# go to snpEff dicretorycd ${snpeff}# make tair10 directory and save the sequences and annotationsmkdir -p data/tair10cp ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa data/tair10/sequences.facp ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 data/tair10/genes.gff# change the configure file and build a databaseecho "tair10.genome : tair10" >> snpEff.configsnpEff build -gff3 -v tair10Then, we use the index to annotate the VCF files of the high confidence SNPs and indels.
xxxxxxxxxxcd ${PROJECT_PATH}/vcfsnpEff tair10 merged_snps_filtered.vcf > merged_snps_filtered_ann.vcfmv snpEff_genes.txt merged_snps_snpEff_genes.txtmv snpEff_summary.html merged_snps_snpEff_summary.txtbgzip merged_snps_filtered_ann.vcftabix merged_snps_filtered_ann.vcf.gzsnpEff tair10 merged_indels_filtered.vcf > merged_indels_filtered_ann.vcfmv snpEff_genes.txt merged_indels_snpEff_genes.txtmv snpEff_summary.html merged_indels_snpEff_summary.txtbgzip merged_indels_filtered_ann.vcftabix merged_indels_filtered_ann.vcf.gz