Workflow for Ortholog
Prediction
Summary:
The
workflow predicts orthologs given a set of two genomes using the criteria of
bi-directional best hits. BLAST is used for detecting the orthologs. Initially
the gene sets of two organisms are subjected to BLAST and orthologs are
predicted in cases where the criteria of bi-directional best hits are
satisfied. Genes are referred to as unique if they do not satisfy the criteria
of bi-directional best hits. Such unique genes of a given organism are then
searched against the genome of the other organism so as to cross-check if the
genes have been missed due to the limitation of the gene prediction program. If
significant matches are found at the genome level then it is not a unique gene
and if no matches are detected then the genes are unique with respect to the
other organism in consideration.
Input
files required for this workflow:
Whole
genome sequences of both organisms (*.fna files)
|
Fig:Translated implementation |
Standard Tools
·
BLAST: Find regions of similarity between biological sequences.
Custom tools
It is a custom tool which performs two
BLASTP programs at a time taking proteome of one of the organisms as input file
and proteome of the other as database and vice-versa. It also fishes out the
orthologs found using the bi-directional best hit criteria between the two
organisms.
Usage
of this tool: ./bbh.sh input_seq1 input_seq2 Output_Directory
Where,
bbh.sh is the tool
Input_seq1 and input_seq2 are the
proteome sequence files of two organisms respectively. (in multi fasta format)
Output_Directory is the directory path
for the output files.
Output
of this tool:
1.
orthologsQuery.out is the list of orthologs reported by BBH tool
2.
hitReject.out is the file containing list of genes from input_seq1 for which no
orthologs were reported. This goes as an input to TBLASTN in the next step.
3.
queryReject.out is the file containing list of genes from input_seq2 for which
no orthologs were reported. This also goes as an input to TBLASTN in the next
step.
4.
out1.out and out2.out are files containing summary reports of BLAST runs.
Summarized BLAST report of input_seq1as query and input_seq2 as database is
reported in out1.out and summarized BLAST report of input_seq2 as query and
input_seq1 as database is reported in out2.out
It is a custom tool which reports the
final orthologs, if found from two TBLASTN output, given the desired identity
and query coverage as well as unique genes found in the respective two genomes.
It takes as input three files, orthologsQuery.out file from BBH, and the two
TBLASTN output from the previous nodes. Given the desired identity value and
percentage query coverage, it looks for genuine orthologs, if missed
previously, and appends the result to the file orthologsQuery.out. The genes
which did not satisfy the given identity and percent query coverage criteria
are reported as unique genes.
Usage of this tool: ./ofb.pl
first_input_file_name second_input_file_name orthologsQue Identities
Query_Coverage output_directory
Where, ofb.pl is the tool
First_input_file_name is the first
TBLASTN output (i.e., TBLASTN result of queryReject.out as query with GenomeB
as database)
Second_input_file_name is the second
TBLASTN output (i.e., TBLASTN result of hitReject.out as query with GenomeA as
database)
orthologsQue is the orthologsQuery.out
file reported by BBH
Identity is similar to BLAST identity
values and this parameter takes integers as input
Query_coverage is the percentage query
coverage, which tells us how well our query is aligned to the database
provided. It is calculated as follows:
% query coverage = ({(query end – query start) + 1}/ query length)*100
Output of this tool:
1.
orthologsQuery.out is the appended list of orthologs reported by OFB
2.
Uniquegene_A is the list of genes present in genomeA (first genome) for which
no orthologs were reported against genomeB (second genome)
3.
Uniquegene_B is the list of genes present in genomeB (second genome) for which
no orthologs were reported against genomeA (first genome).
4. temp_out
is a file containing summarized reports of the two TBLASTN runs.