Workflow for Phylogeny

 

Summary

The workflow builds a phylogenetic tree of the orthologs detected for a given query sequence (nucleotide or protein) using a similarity search tool. User needs to provide only the query sequence to carry out a similarity search using BLAST against a chosen database. Parsers will redirect the output of BLAST to multiple sequence alignment tool, ClustalW, which would then pass the output to Phylip suite for reconstruction of phylogenetic tree of the orthologs. 

 

Fig:Implemented Translation

Parser in this workflow:

 

   BLAST to ClustalW

o        Reads the results of BLAST search and creates a multi-fasta file containing sequences that satisfy user-defined parameters of E-value, % identity, % overlap and bit score.

o         Default values of the parameters are:

·                E-value: 0

·                % identity: >= 30

·                % query coverage: >=50

·                % database hit coverage: >=50

·                Bit score: >=40

·                 

     Sequence Retrieval Tool: Retrieves the fasta sequences of the entries satisfying the above mentioned criteria from the corresponding table of the database stored in a RDBMS like MySQL. The storage of standard databases like UniProt and nr as tables in MySQL increased the pace of retrieval which otherwise had to be extracted from a text file with upto 3GB size and consumed lot of time for the same task.

 

Standard Tools :

 

·         BLAST:  BLASTP and BLASTN are used for detection of orthologs for the query sequence. The BLAST output is parsed in terms of e-value, % query overlap, % sequence identity and bit score.

·         ClustalW: The MPI version of ClustalW is used for multiple sequence alignment of the orthologs detected using BLAST. The MSA output is converted into Phylip format using output format options of ClustalW.

·         Phylip suite: Is used for reconstruction of phylogenetic trees. Multiple datasets are generated using the ‘seqboot’ program. Different methods and the corresponding programs used for tree building are:

o   Distance based: dnadist, protdist, fitch, kitch, neighbor

o   Parimony based: dnapars, protpars

o   Maximum Likelihood based: dnaml, proml

A final majority consensus tree is built using ‘consense’ program. The trees can be viewed by installation of visualizer like ‘TreeView’ on the client machine.