Workflow for Reference Alignment and Variant Calling
Summary:
This workflow aims for mapping of short reads on a reference genome, post-alignment processing and variant calling and filtering. Currently it supports short read data from Illumina, but can also handle longer (200 bp or more) 454 reads.
Input reads should be in FASTQ format. For short or long fragment reads a single input file is required and for paired-end reads, for each member of pair, a separate FASTQ file is required. Reference genome should be in FASTA format.
After mapping reads on reference genome, few post-alignment processing is required before actual variant calling. That includes conversion of SAM file into compressed BAM file, sorting, removal of duplicate reads and indexing BAM file for efficient access. After variant calling, filtering variants based on user given cutoffs is useful to reduce false positives.
Tools:
BWA
Samtools
view
mpileup
sort
index
Picard
MarkDuplicates
Bcftools
View
Vcfutils.pl
VarFilter
Flow Diagram:
Details of Workflow:
Input Files:
FASTQ format short reads
FASTA format Reference Genome
BWA:
Burrows-Wheeler Aligner (BWA) [Li et al. 2009] is an efficient short read aligner which can align fragment or paired-end sequencing against a reference sequence. But it is also capable to handle longer reads (200bp or more). It is based on BWT (Burrows–Wheeler transform), so it is faster than any hash based aligner. Additionally it allows gap while aligning reads and gives SAM format output [Li et al. 2009].
Samtools view:
Used for converting SAM files to binary BAM format.
Picard sort:
Used for sorting and indexing BAM file.
Picard MarkDuplictes:
MarkDuplicates used for removing PCR and optical duplicates.
Samtools mpileup:
It collects summery information from input BAM file, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format
Bcftools view:
Bcftools applies the prior and call variants. It gives output as VCF (Variant Call Format).
vcfutils.pl varFilter:
Used for filtering input VCF file based on user given cutoffs.