Introduction
Client Features
Get Started
Software UI Layout
Configuration
- Client Configuration
- Server Configuration
Nodes
Project
PDW
Tools
- Alignace
- Antigenic
- bbh
- blast
- blast2go
- cap3
- cgf
- charge
- cln2qual
- clustalw
- cluster
- compseq
- consense
- consensus
- crossmatch
- custom_cap3
- db_est
- dnadist
- dnaml
- dnapars
- eprimer3
- eslpred2
- estscan
- fasta
- fat
- fitch
- freak
- genscan
- glimmer
- hamming_distance
- hq_est
- interproscan
- kitsch
- mast
- mdscan
- meme
- mpao
- mpt
- mutual_info
- neighbor
- netctl
- ofb
- orthologous_cluster
- pfam
- phred
- predict_primer
- prof_vector
- proml
- protdist
- protpars
- psiblast
- RBSFinder
- repeat_masker
- restrict
- rot
- seqboot
- seqclean
- signalp
- srt
- stt
- targetp
- tmhmm
- trnascan
- ut_est
- vrnafold
- weeder
Workflow
Glossary
Index

Introduction

The advent of new generation sequencing technologies has revolutionized the number of genomes being sequenced (per organism) leading to a steep rise in the volumes of genomic data generated. Downstream analysis of such enormous data requires specialized software/hardware that will enable high-throughput analysis of vast genomic data. Comprehensive analysis of heterogeneous genomic data requires the usage of large number of Bioinformatics tools that demand extensive preprocessing and require manual intervention at frequent intervals. Our motive is to provide a flexible platform for running complex queries that is capable of integrating and analyzing large amount of genomic data. Anvaya, is a software that consists of Bioinformatics tools and databases that are loosely coupled together in a coordinated system to execute a set of analyses tools in series or in parallel.

Anvaya Client Features

User
Login	The user has to login with appropriate credentials. Login-ID and Password can be obtained from Anvaya Admin Interface.
Logout	The user must logout everytime before the client exits.
Project
New Project	When workflows client is started, after user login, a new project needs to be created in order to start creating a pipeline
Open Project	Anvaya Client allows user to open a previously executed project by using this option.
Delete Project	A previously created project can be deleted using this option.
Workflow (Pipeline)
Create Workflow	A workflow is created by dragging associated tools from the tool list onto the design canvas. The tools are then connected in logical order.
Save workflow	The workflow created on design canvas must be saved before execution.
Run Workflow	When user clicks "Run Workflow", the user will need to transfer the input and other configuration files before actual execution of workflow. Once the file transfer is complete, the user can start executing the workflow.
Stop Workflow	Workflow Execution can be stopped in between by using this option
Resume workflow	Workflow execution can be resumed from certain node using this option.
Status	Status of the workflow can be viewed in tabular format on the status tab. The status is also depicted pictorially on the design canvas
UI
Design Canvas	Design Canvas is the tab which is used as work are for creating the pipeline. The tools are dragged on the design canvas for logical connection.
Project Explorer	Project Explorer Tab allows user to view the input-output and the intermediate output files of the current project. The files must be transferred from server (status tab) before they can be viewed in the Project Explorer
Tool List	The complete list of tools available is available in tool list. The tool list is sorted functionality-wise and alphabetical order.
Pre Defined Workflows	Anvaya provides a set of 11 pre-defined workflows for frequently used pipelines in genome annotation and comparative genomics ranging for EST assembly and annotation to phylogenetic reconstruction and microarray analysis.
Node Properties	The advanced parameters of a node (tool) can be set by double clicking the node
Connecting Tools	The tools can be connected by single click of mouse in appropriate order.
Sticky Note	Sticky Note allows user to store short notes regarding associated node or workflow. These can be minimized or hidden or expanded back for readability purpose.
Help	Help is provided for Anvaya Client to assist user with different functionalities.
SubLayer	Nodes (Tools) can be logically grouped together to form sublayer. The sublayer can be collapsed or expanded as per readability.
Rules Engine	One of the unique features of �Anvaya� is the �rules engine� that defines rules for logical connection between the existing tools. �Anvaya� offers the user, novel functionalities to carry out exhaustive comparative analysis via �custom tools�, which are tools with new functionality not available in standard tools. Once clicked on particular node, the nodes which are allowed connection from that node are highlighted.
Tool tip	Tool tip is provided on the tool nodes in the tool list. This gives the version of associated tool.
Alignment of Nodes	The nodes can be group selected and can be arranged horizontally or vertically using this option.
Scroll View	Scroll View allows user to span across the design canvas.

Workflows - Get Started

Once Anvaya client is installed, user will have to request login-id from Anvaya Admin interface. The user can then use the login-id and password to start using Anvaya Client. The server needs to be configured before login.

Anvaya - Software UI Layout

Anvaya has following layout components:

- Design Canvas : Work Area for creating workflow pipelines

- Project Explorer : To view input/output files for the project

- Status Tab : For viewing Tabular/pictorial status of workflow execution

- Tool List: List of tools available functionality wise or in alphabetical order

- Sub Layer Tree : Tree View for sub layers created in workflow pipeline

- Scroll View : Snapshot view area which has handle to span across the complete design canvas

- Console: Text Message area to display current execution messages for user

Workflows - Configurations

Client Configuration

Once client is installed, it is configured with default values. the configuration file is avilable under conf/workflow.conf file wherever the client was installed. The user with have to configure WORKFLOW_HOME path to indicate the workspace directory for Anvaya Client. All the other paths are configurable through the client itself.

Server Configuration

Server configuration is available under the "Server>>Configure Server" option in the client menu. Here, the user will have to configure the server address, where the Anvaya services are installed along with user authentication details. These details are stored in encrypted format at client end for further usage.

Anvaya Client Configuration

Anvaya Server Configuration

Anvaya - Nodes

Anvaya node represents any individual tools which is part of created pipeline. The nodes can be logically connected to other nodes. Each node is associated with input-output panel to configure the IO files and also with advanced panels, to configure the associated tool in detail.

Workflows - Delete Node

If the user wants to delete a node form pipeline, right click on the node and then click 'Delete' option.

Workflows - Drag Node

To create pipeline, the node (tool) needs to be dragged from the tool list onto the design canvas. User can drag the tools from alphabetical list or from the tools list sorted functionaility-wise.

Workflows - Edit Node Properties

User can edit the node, by double-clicking it. The Properties dialog of the node will pop up. Each node has input-output panel for configuring the input and outputs for the node. The parameters to be configured for each tools is available in the associated "advanced parameter" panel.

Anvaya - Project

In Anvaya, the user has to started by creating a project. Each project can have only one workflow-pipeline defined. The user can open-delete previously created projects.

Delete Project

A previously created project can be deleted using this option.

New Project

When workflows client is started, after user login, a new project needs to be created in order to start creating a pipeline.

Open Project

Anvaya Client allows user to open a previously executed project by using this option.

Project Explorer

The Project Explorer tab allows user to view input, output and the intermediate files create during the execution of the workflow. The client needs to send request to transfer files from server to make them available in Project Explorer. For transfering the file, the user can make use of the Status Tab.

Pre-defined Workflow - EST Assembly

EST Assembly is one of the challenging workflow in bioinformatics research area. Here Anvaya provides researcher the single workflow, which will take the raw trace files from sequencing machines and will provide the fully annotated assembled ESTs.

The trace files from sequencing machines are processed by PHRED software for base calling. PHRED will provide the files containing sequences along with corresponding quality values. The sequences are further processed by the Cros_match tool for vector masking. The STT-Sequence trimming tool takes these vector masked EST sequences as an input. STT will remove the vector-masked nucleotides, polyA/T tail, polyC/G tail and Adapter/Linker sequences from EST sequences. STT will also maintain the similar changes/edition in the corresponding quality values. Further, STT-processed sequence files are processed by Seqclean for the removal contamination such as mitochondrial sequences. Seqclean will provide the filtered ESTs. The similar modification can be made in corresponding quality values by the tool cln2qual from seqclean package. One can remove the host/parasite contamination by the use of seqclean tool.

The filtered EST sequences thus obtained are further annotated by using BLAST. FAT tool will parse the BLAST results and will give EST sequences with their annotation in their header lines. The �dbEST submission files Tool� takes these EST sequences along with their annotations as an input and provides the dbEST submission files having file format mentioned by NCBI.

The filtered EST sequences are submitted for assembly to Cap3 assembly tool. The �Unique Transcripts EST Tool� combines the output files from Cap3 (assembled ESTs i.e. contigs file and unassembled ESTs i.e singlets file) and gives the single fasta formatted file containing the union of contigs and singlets sequences. This assembled EST dataset are further processed by BLAST, InterProScan, BLAST2GO for it�s functional annotation. The outputs from all these tools will be parsed by FAT-Functional Annotation tool for the fasta formatted sequences having annotation information in the respected header information. ESTScan will help in analyzing whether the sequenced ESTs are true ESTs or not.

Pre-defined Workflow - Functional Annotation

Functional Annotation workflow template provides the easy way of annotating whole proteome set. This workflow uses multiple tools to annotate the given protein sequence. BLAST is used for functional annotation based on similarity with existing protein databases such as UniProt, nr etc. PfamHmm tools is used to identify the functional domains which are present within the sequence. Using programs viz, TMHMM and SignalP, user can predict subcellular localization. InterProScan is used to assign the Gene Ontology terms to the given protein sequences. The output of above mentioned programs is processed further by FAT-FunctionalAnnotationTool in order to provide protein sequences with annotations mentioned in respected header lines of fasta sequences.

Pre-defined Workflow - Genome Annotation

Genome annotation workflow is used to annotate a newly sequenced genome. This workflow can be used for both prokaryotic and Eukaryotic organisms.

Glimmer/Genscan predicts the genes from a given input genome sequence. In case of prokaryotes, glimmer results are further accompanied by ribosomal binding site prediction from RBSfinder. The tRNA genes are predicted by tRNAScan-SE program. The repeats from the genome are masked by using RepeatMasker program. All the results from Glimmer/Genscan, RBSfinder, tRNAScan, and RepeatMasker programs are combined to give in a standardized format i.e., in Genbank format. The client can view this genbank formatted result file using Artemis visualization tool for detailed analysis.

The Compseq utility from EMBOSS package is used for reporting composition of dimer/trimer/etc words in a sequence. The freak utility from EMBOSS package is used for providing residue/base frequency tables or plots. The restrict utility from EMBOSS package is used to find restriction enzyme cleavage sites within the given genome

Pre-defined Workflow - Promoter Identification using Micro-array data

The workflow helps us to identify consensus patterns in the upstream regions of certain genes, which may be clustered together, which may also lead to the identification of novel target genes for therapeutic purposes. Gene expression data is given as input for the workflow, for which the data is supposed to be pre-normalized at the user�s end. The normalized data is then parsed for Cluster analysis (preferably K-Means or hierarchical Clustering). For the different clusters obtained, the corresponding gene ids are obtained from database followed by retrieval of the upstream regions of the gene. Motif analysis tool MEME is used to identify conserved patterns/motifs. The motifs thus obtained are searched against the desired motif databases using the tool MAST.

Pre-defined Workflow - Motif Identification

The workflow identifies conserved patterns or motifs using DNA or Protein sequences as input. The input sequences are searched against a preferred database using BLAST and the significant hits obtained (orthologs) are parsed for the next step of sequence retrieval. Motif discovery programs like MEME, AlignACE, MDScan, Weeder and Consensus are used for pattern identification. The conserved regions obtained are then parsed and analyzed through a custom tool �Motif Processing Tool�.

Pre-defined Workflow - Ortholog Prediction

The workflow predicts orthologs given a set of two genomes using the criteria of bi-directional best hits. BLAST is used for detecting the orthologs. Initially the gene sets of two organisms are subjected to BLAST and orthologs are predicted in cases where the criteria of bi-directional best hits are satisfied. Genes are referred to as unique if they do not satisfy the criteria of bi-directional best hits. Such unique genes of a given organism are then searched against the genome of the other organism so as to cross-check if the genes have been missed due to the limitation of the gene prediction program. If significant matches are found at the genome level then it is not a unique gene and if no matches are detected then the genes are unique with respect to the other organism in consideration.

Pre-defined Workflow - Phylogenetic Profiling

The workflow aims to infer functional linkages using phylogenetic profiling. The input protein sequence(s) is usually from any organism with its complete genome sequenced which is searched against proteome data of other organisms with completely sequenced genome using either BLASTP/SSEARCH. The e-values obtained after search of every protein sequence is parsed and normalised and represented as a profile vector/matrix where in the rows are individual proteins and the columns are organisms. The profiles obtained are analysed for their statistical significance using parameters like mutual information content, hamming distance and correlation coefficient

Pre-defined Workflow - Phylogeny

The workflow builds a phylogenetic tree of the orthologs detected for a given query sequence (nucleotide or protein) using a similarity search tool. User needs to provide only the query sequence to carry out a similarity search using BLAST against a chosen database. Parsers will be provided to read the output of BLAST and submit sequences to multiple sequence alignment tools like ClustalW, which would then pass the output to Phylip suite for reconstruction of phylogenetic tree of the orthologs.

Pre-defined Workflow - Prediction for Potential Antigenic Sites

The workflow is designed to predict the potential antigenic regions in a given set of proteins of a pathogenic organism. In this workflow multiple BLAST programs (BLASTP, TBLASTN) are run for submitted Pathogen sequences against the host sequence (nr for BLASTP and human/mouse for TBLASTN) databases. The E value for BLAST searches is 0. A parser is also built to write the rejected Ids of sequences which show significant hits from BLAST into one file and write the remaining sequence Ids into a separate file to pass them to next node (Take the geneid as input from BLASTP/TBLASTN and output is used as input to SRT).

The short listed sequences have BLAST bit score <= to 40 and query coverage >= 50 as default. Sequences having low score or no match are the unique pathogen sequences. These sequences are retrieved from the BLAST output by using GeneID from multiple BLAST parser and saved in flat file format comprising of fasta sequences using in-house built Sequence Retrieval Tool.

Selected sequences are than used as input for tools like netCTL, Antigenic, TargetP and ESLpred2 to predict T cell and B cell epitopes along with their sub cellular location.

netCTL predicts the cytotoxic T cell lymphocyte epitopes. netCTL has been chosen because of its higher predictive performance than EpiJen, MAPPP, MHC-pathway, and WAPP on all performance measures. T cell epitopes are predicted and stored in flat file format using custom tool �Map potential antigenic output�

Antigenic (emboss) is used to predict B cell epitopes. The probable B-cell epitopes are predicted and retrieved by using Map potential antigenic output custom tool.

TargetP and ESLpred2 are used to predict the sub cellular localization of the given sequences and if the sequence is detected as secretory or extra cellular protein or contains signal peptides more weight is added to the netCTL and antigenic results.

Pre-defined Workflow - Identification of Primer/Marker

The workflow is for identification of primer sequence. The input nucleotide sequence is subjected to tools like Primer3 for prediction of primers. The primers predicted are then subjected to search against the database of choice using BLASTN and to RNA secondary structure prediction. Outputs from BLASTN search and RNA secondary structure prediction are then combined to give an output, suggesting the most probable primer sequence pair.

Pre-defined Workflow - Remote Ortholog and Conserved Domain Prediction

The workflow enables to find probable remote orthologs for a test input sequence. Simultaneously it also allows identifying conserved domains amongst the closely related sequences of the same input.

Tools available in Workflows

Sr.No	Functionality Name	Tools
1	Gene Prediction	Glimmer-HMM
1	Gene Prediction	GenScan
2	Annotation	RepeatMasker
		TMHMM
		InterProScan
		SignalP
		RBS_finder
		PFAM
		ESLPRED2
		TargetP
		ESTScan
		Blast2GO
3	RNA prediction	tRNAscan-SE
3	RNA prediction	vRNAFold
4	Assembly	Cap3
		Phred
		SeqClean
		Cln2Qual
5	DBsearch	BLAST
		FASTA
		MAST
		PSIBlast
		CrossMatch
6	Motif Prediction	AlignACE
		Weeder
		Consensus
		MDScan
		MEME
7	MSA	CLUSTALW
8	Sequence properties	COMPSEQ
		FREAK
		RESTRICT
		CHARGE

Sr.No	Functionality Name	Tools
9	Phylogeny	Seqboot
		Dnadist
		Protdist
		Dnaml
		Proml
		Dnapars
		Protpars
		Fitch
		Kitsch
		Neighbor
		Consense
10	Custom Tools	BBH
		OFB
		CGF
		SRT
		ROT
		OrthologousCluster
		FAT
		Unique Transcript EST
		HammingDistance
		CreateProfileVector
		MapPotentialAntigenic
		Motif Processing
		SeqTrimmingTool
		DbEstSubmission
		HQ_EST
		PredictPrimer
		Cap3InputCustomization
11	Primer Prediction	ePrimer3
12	Epitope Prediction	NETCTL
12	Epitope Prediction	ANTIGENIC
13	Other	MutualInformation
13	Other	Cluster

Tool Name : AlignACE
- Aligns Nucleic Acid Conserved Elements

Version : 4.0

Description

AlignACE (Aligns Nucleic Acid Conserved Elements) is a program which finds sequence elements conserved in a set of DNA sequences

Reference : http://atlas.med.harvard.edu/

Tool Name : Antigenic
- From EMBOSS Package

Version : 4.1.0

Description

Finds antigenic sites in proteins

Reference : http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/antigenic.html

Tool Name : BBH
- BiDirectional Blast Hit (Anvaya Custom Tool)

Version : 1.0

Description

It is a custom tool which performs two BLASTP programs at a time taking proteome of one of the organisms as input file and proteome of the other as database and vice-versa. It also fishes out the orthologs found using the bi-directional best hit criteria between the two organisms

Advanced Parameters

Parameter	Allowed Values	Default Value
Input sequence for organism1 in BLASTp (complete protein sequence)	Fasta sequences	None
Database (complete protein sequences of 2^nd organism against which ortholog has to be predicted)	Fasta Sequences	None

Tool Name : Blast
- Basic Local Alignment Search Tool

Version : 2.2.14

Description

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Advanced Parameters

Parameter	Allowed Values	Default Value
Program Name	blastp, blastn, tblastn, tblastx, blastx	blastp
Database	Alphanumeric	nr
Query File	Alphanumeric	stdin
Expectation value (E)	Real	10.0
Output file	alphanumeric	stdout
Filter	T/F	T
Gap Opening Peanlty	blastp:7,8,9,10,11,12	-1
Gap Extension Penalty	blastp:1,2	-1
Mismatch penalty	-1,-2.-3,-4,-5	-3
Match score	1,2,4	2
Number of hits to show alignments	Any integer	250
No. of processors	integer	1
Scoring matrix	blastp:BLOSUM62, BLOSUM80, BLOSUM45, PAM30, PAM70	BLOSUM62
Word size	blastn :7,11, 15 all others: 2,3	blastn:11 all others:3

Tool Name : BLAST2GO
- B2G4Pipe

Version : 2.2.2

Description

Blast2GO is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data.

Reference : http://www.blast2go.org

Tool Name : Cap3
- A DNA sequence assembly program

Version : Versions as with Phred 2.2.14

Description

The program has a capability to clip 5' and 3' low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward-reverse constraints to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented.

Tool Name : CGF
- Convert to Genbank Format (Anvaya Custom Tool)

Version : 1.0

Description

This program will take the input from different nodes and will summarize the results in genbank format.

Advanced Parameters

Mandatory options:

1. Input Genome sequence file
2. Glimmer/Genscan prediction output file

Optional input files:

1. RBSFinder output file
2. tRNAScan output file
3. RepeatMasker output file (*.out)

Tool Name : Charge
- From EMBOSS Package

Version : 4.1.0

Description

Charge reads a protein sequence and writes a file (or plots a graph) of the charges of the amino acids within a window of specified length as the window is moved along the sequence

Reference : http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/charge.html

Advanced Parameters

Parameter	Allowed Values	Default Value
Query input	Alphanumeric	-seqall
Sequence associated option	Integer	N/A
Sequence format	Alphanumeric (Fasta, Embl,)	Fasta
Produce graph	ps, hpgl, png, gif, x11.	ps
Window size	Integer	5
Graphics	Toggle value Yes/No (png, ps)	No
Amino acids properties and molecular weight data file	Data file	Eamino.dat
Output File	Alphanumeric	outfile

Tool Name : Cln2Qual

Version : 2.2.14

Description

Cln2Qual parses the trimming ("clear range") coordinates and trash codes from the cleaning report and applies them to the quality records.

Reference : http://jimmy.harvard.edu

Tool Name : ClustalW
- Multiple sequence alignment program

Version : 0.13

Description

ClustalW is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen.

Advanced Paramters

Parameter	Allowed Values	Default Value
Input	Alphanumeric	None
Output	Alphanumeric	Same as input file name
Type of sequence	protein or dna	Protein
Format of output	gcg, gde, phylip, pir, nexus, clustal	Clustal
Format of output tree	nj, phylip, dist, nexus
Matrix for pairwise alignment	BLOSUM, PAM, GONNET, ID	Gonnet
Matrix for pairwise and multiple alignment	IUB, CLUSTALW or filename	IUB
Gap opening penalty	Float	DNA: 15.0 Protein: 10.0
Gap extension penalty	Float	DNA:6.66 Protein:0.1

Tool Name : Cluster

Version : 3.0

Description

The tool provides the most commonly used clustering methods for gene expression data analysis.

Reference : http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/

Tool Name : CompSeq
- From EMBOSS Package

Version : 4.1.0

Description

Compseq counts composition of dimer/trimer/etc words in a sequence.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input Sequence file	File	NA
Word size	1 =< n < 20	2
Out put file name	File	NA
Previoulsy produced compseq outfile.It can be used to set the expected frequencies of words in this analysis.	File	NA
Frame of word	0 =< n < word_number	0
Ignore code B and Z	Yes/no	Yes (Checked)
Reverse complement	Yes/no	No (Unchecked)
Calculate from Observed frequency	Yes/no	No (Unchecked)
Zero count	Yes/no	Yes (Checked)

Tool Name : Consense
- From PHYLIP Package

Version : 3.67

Description

CONSENSE reads a file of computer-readable trees and prints out (and may also write out onto a file) a consensus tree. Basically the consensus tree consists of monophyletic groups that occur as often as possible in the data. If a group occurs in more than 50% of all the input trees it will definitely appear in the consensus tree. The tree printed out has at each fork a number indicating how many times the group which consists of the species to the right of (descended from) the fork occurred.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Intree
Output file	alphanumeric	Outfile
Consensus method	Majority rule (extended)	Majority rule (extended)
	Strict
	Majority Rule
	M_l
Outgroup root	Yes/No	No, use as Outgroup species 1
Rooted tree (Trees to be treated as Rooted)	Yes/No	No
Terminal type	ANSI	ANSI
	BM PC
	None
Print out the data at start of run	Yes/No	No
Print indications of progress of run	Yes/No	Yes
Print out tree	Yes/No	Yes
Write out trees onto tree file?	Yes/No	Yes

Tool Name : Consensus

Version : 6c

Description

Consensus is a pattern recognition program that can be used in identifying pattern in a set of unaligned DNA, RNA and protein sequences.

Reference : http://bifrost.wustl.edu/consensus/

Tool Name : CrossMAtch

Version : 0.990319

Description

Cross_match is a program for rapid protein and nucleic acid sequence comparison and database search.

Reference : http://jimmy.harvard.edu

Tool Name : inCap3
- Cap3 Input Customization (Anvaya Custom Tool)

Version : 1.0

Description

This is a customized tool provided as a part of Anvaya package. This will create the input file names compatible with the CAP3 software.

Input files:

1. Filtered sequence file from seqclean tool
2. Filtered qual file from cln2qual tool
3. Tag of output file
a. Default value: Cap3_In

Output files:

Sequence and qual file should be at the same location. User can not provide the name of output files. User can only provide the output tag. According to the output tag, the cap3 input files will be created.

1. Renamed sequence file i.e. <Tag>.seq
a. Default: Cap3_In.seq
2. Renamed quality value file i.e. <Tag>.seq.qual
a. Default: Cap3_In.seq.qual

Tool Name : Db_EST
- DB EST Submission Tool (Anvaya Custom Tool)

Version : 1.0

Description

This program is a part of Anvaya package, which creates the submission files for dbEST from the filtered and annotated EST sequences.

Input files:

1. File of EST sequences with their annotation (output file from FAT tool)
2. Laboratory information file having information like, Name of the research group, Contact address of research lab, organism details, primers details, vector information, Restriction enzyme details, publication details etc.

Output file (need to be mention as an parameter):

The output file containing records (EST sequence information) in NCBI dbEST submission format

Tool Name : dnadist
- From PHYLIP Package

Version : 3.67

Description

DnaDist uses nucleotide sequences to compute a distance matrix, under three different models of nucleotide substitution. The distance for each pair of species estimates the total branch length between the two species, and can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR. This is an alternative to use of the sequence data itself in the maximum likelihood program DNAML or the parsimony program DNAPARS.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	Alphanumeric	Outfile
Method	F84	F84
	Kimura
	Jukes-Cantor
	LogDet
	Similarity Table
Gamma distribution	Yes/No	No
Gamma distribution	Gamma+Invariant	No
Transition/transversion option	Any Number or ration	2.0
Number of categories	Any number between 0- 9	Yes
Weights	Yes/No	No
Frequencies	Yes/No	No
Output file with distance matrix in lower triangular form	Square	Square
Output file with distance matrix in lower triangular form	Lower triangular	Square
Multiple data sets	Multiple data sets (type D)	No
Multiple data sets	Multiple weights (type W)	No
Input Sequence	Interleaved	Interleaved
Input Sequence	Sequential	Interleaved
Terminal Type	IBM PC	ANSI
	ANSI
	None
Print out the data at start of run	Yes/No	No
Print indications of progress of run	Yes/No	YES

Tool Name : dnaml
- From PHYLIP Package

Version : 3.67

Description

Dnaml implements the maximum likelihood method for DNA sequences. This program is fairly slow, and can be expensive to run.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	Alphanumeric	Outfile
Search for best tree	Yes/No	Yes
Transition/transversion ratio	Any Number	2.0
Empirical Base Frequencies	Yes/No	Yes
Number of categories	Any number between 1- 9	Yes
Hidden Markov Model rates	Constant rate	Constant rate
	Gamma distributed rates
	Gamma+Invariant sites
	user-defined HMM of rates
Weight	Yes/No	No
Speedier but rougher analysis	Yes/No	Yes
Global Arrangement	Yes/No	No
Randomize input order of sequences?	Yes/No	No. Use input order
Outgroup root?	Yes/No	No, use as outgroup species 1
Analyze multiple Data sets	Multiple data sets (type D)	No
Analyze multiple Data sets	Multiple weights (type W)	No
Input sequence	Sequential	Interleaved
Input sequence	Interleaved	Interleaved
Terminal Type	IBM PC	ANSI
	ANSI
	None
Print out the data at start of run	Yes/No	No
Print indications of progress of run	Yes/No	Yes
Print out tree?	Yes/No	Yes
Write out trees onto tree file?	Yes/No	Yes
Reconstruct hypothetical sequences?	Yes/No	No
Use lengths from user trees?	Yes/No	No
Rates at adjacent sites correlated?	Yes/No	No, they are independent

Tool Name : dnapars
- From PHYLIP Package

Version : 3.67

Description

DnaPars carries out unrooted parsimony (analogous to Wagner trees) on DNA sequences. The method of Fitch is used to count the number of changes of base needed on a given tree. Other than that, the algorithm is a direct modification of program WAGNER (an ancestor of MIX which was formerly in this package).

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	Alphanumeric	Outfile
Search for best tree?	No, use user trees in input file	Yes
Search for best tree?	Yes	Yes
Search option?	More thorough search	More thorough search
Search option?	Less thorough	More thorough search
Number of trees to save?	Any number	10000
Randomize input order of sequences?	No/Yes	No. Use input order
Outgroup root?	No/Yes	No, use as outgroup species 1
Use Threshold parsimony?	No/Yes	No, use ordinary parsimony
Use Transversion parsimony?	No/ Yes, count only transversions	No, count all steps
Sites weighted?	No/Yes	No
Analyze multiple data sets?	Multiple data sets (type D)	No
Analyze multiple data sets?	Multiple weights (type W)	No
Input sequences interleaved?	Yes/No, sequential	Yes
Terminal type	IBM PC/ANSI/ none	ANSI
Print out the data at start of run	No/Yes	No
Print indications of progress of run	Yes/No	No
Print out tree	Yes/No	Yes
Print out steps in each site	Yes/No	No
Print sequences at all nodes of tree	Yes/No	No
Write out trees onto tree file?	Yes/No	Yes
Dot-differencing to display	Yes/No	No

Tool Name : ePrimer3
- From EMBOSS Package

Version : 4.1.0

Description

Picks PCR primers and hybridization oligos

Reference : http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/eprimer3.html

Tool Name : ESLPred2

Version : NA

Description

"ESLpred2" is an improved version of our previous most popular method, ESLpred , which can predict four major localizations (cytoplasmic, mitochondrial, nuclear and extracellular) with an accuracy of 88%.

Reference : http://www.imtech.res.in/raghava/eslpred2/

Advanced Parameters

Parameter	Allowed Values	Default Value
Query input	Alphanumeric	Stdin/Fasta format
Organism group	-A , -F, -P, -G	Generalized
Method for prediction	1 (amino acid composition), 2(PSSM composition), 3 (hybrid AAC,PSSM, PSI-BLAST)	3
Output file	Alphanumeric	stdout

Tool Name : ESTScan

Version : 2.0b

Description

ESTScan is a program that can detect coding regions in DNA sequences, even if they are of low quality. ESTScan will also detect and correct sequencing errors that lead to frameshifts.

Reference : http://www.isrec.isb-sib.ch/ftp-server/ESTScan/

Tool Name : FASTA

Version : 3.4

Description

Provides sequence similarity searching against protein databases using the FASTA and SSEARCH programs. SSEARCH does a rigorous Smith-Waterman search for similarity between a query sequence and a database. GGSEARCH compares a protein or DNA sequence to a sequence database producing global-global alignment (Needleman-Wunsch). GLSEARCH compares a protein or DNA sequence to a sequence database. FASTA can be very specific when identifying long regions of low similarity especially for highly diverged sequences.

Reference : http://www.ebi.ac.uk/Tools/fasta/index.html

Tool Name : FAT
- Functional Annotation Tool (Anvaya Custom Tool)

Version : 1.0

Description

It takes the outputs of different programs and parses them to give the fasta-formatted file having annotation details in the header location of each sequence.

Mandatory options:

1. Input file containing fasta-formatted protein sequences

Optional input files:

1. Output file from BLASTP program
i. Identity cutoff [Default value: 90%]
ii. Query Coverage cut-off [Default value: 80%]
2. Output file from PfamHMM program
3. Output file from TMHMM program
4. Output file from SignalP program
5. Output file from InterProScan program (XML output)

Tool Name : fitch
- From PHYLIP Package

Version : 3.67

Description

Fitch carries out Fitch-Margoliash, Least Squares, and a number of similar methods as described in the documentation file for distance methods.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	alphanumeric	Outfile
Method	Fitch-Margoliash	Fitch-Margoliash
Method	Minimum Evolution	Fitch-Margoliash
Search for best tree	Yes	Yes
Search for best tree	No, use user trees in input file	Yes
Power	Any Number	2.0
Negative branch lengths allowed?	Yes/No	No
Lower-triangular data matrix?	Yes/No	No
Upper-triangular data matrix?	Yes/No	No
Subreplicates	Yes/No	No
Global rearrangements?	Yes/No	No
Randomize input order of species?	Yes/No	No. Use input order
Analyze multiple data sets?	Yes/No	No
Terminal type	BM PC	ANSI
	ANSI
	None
Print out the data at start of run	Yes/No	No
Print indications of progress of run	Yes/No	Yes
Print out tree	Yes/No	Yes
Write out trees onto tree file?	Yes/No	Yes
Use lengths from user trees	Yes/No	Yes

Tool Name : Freak
- From EMBOSS Package

Version : 4.1.0

Description

Freak takes one or more sequences as input and a set of bases or residues to search for. It then calculates the frequency of these bases/residues in a window as it moves along the sequence. The frequency is output to a data file or (optionally) plotted.

Advanced Parameters

Parameter	Allowed Values	Default Value
Sequence file	File	NA
Residue letters	Any string	“gc”
Output file	File	NA
Stepping value	Any integer value	1
Averaging window	Any integer value	30

Tool Name : genscan

Version : NA

Description

GenScan is an tool to identify complete gene structures in genomic DNA. It is a GHMM-based gene finder for human sequences.

Reference : http://genes.mit.edu/GENSCAN.html

Advanced Parameters

Display Name	Allowed Values	Default Value
verbose output (extra explanatory info)	NA	NA
Print predicted coding sequences (nucleic acid)	NA	NA

Tool Name : Glimmer
- Gene Locator and Interpolated Markov ModelER

Version : 3.02

Description

Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA

Advanced Parameters

Tool Name : Hamming
- Hamming Distance (Anvaya Custom Tool)

Version : 1.0

Description

This is customized tool provided as a part of Anvaya package.

Tool Name : HQ_EST
- (Anvaya Custom Tool)

Version : 1.0

Description

This program is a part of Anvaya package.

Tool Name : InterProScan

Version : 4.4

Description

InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.

Reference : http://www.ebi.ac.uk/interpro

Tool Name : kitsch
- From PHYLIP Package

Version : 3.67

Description

Kitsch carries out the Fitch-Margoliash and Least Squares methods, plus a variety of others of the same family, with the assumption that all tip species are contemporaneous, and that there is an evolutionary clock (in effect, a molecular clock). This means that branches of the tree cannot be of arbitrary length, but are constrained so that the total length from the root of the tree to any species is the same. The quantity minimized is the same weighted sum of squares described in the Distance Matrix Methods documentation file.

Advanced Parameters

Parameter and Display Name	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	alphanumeric	Outfile
Method	Fitch-Margoliash	Fitch-Margoliash
Method	Minimum Evolution	Fitch-Margoliash
Search for best tree	Yes/No	Yes
Power	Any Number	2.0
Negative branch lengths allowed?	Yes/No	No
Lower-triangular data matrix?	Yes/No	No
Upper-triangular data matrix?	Yes/No	No
Subreplicates	Yes/No	No
Randomize input order of species?	Yes/No	No. Use input order
Analyze multiple data sets?	Yes/No	No
Terminal type	BM PC	ANSI
	ANSI
	None
Print out the data at start of run	Yes/No	No
Print indications of progress of run	Yes/No	Yes
Print out tree	Yes/No	Yes
Write out trees onto tree file?	Yes/No	Yes

Tool Name : MAST
- Motif Alignment and Search Tool

Version : 4.1.0

Description

MAST is a tool for searching biological sequence databases for sequences that contain one or more of a group of known motifs.

Reference : http://meme.sdsc.edu/meme/mast-intro.html

Tool Name : MDScan

Version : 2004

Description

A Fast and Accurate Motif Finding Algorithm With Applications To Chromatin Immunoprecipitation Microarray Experiments

Reference : http://robotics.stanford.edu/~xsliu/MDscan/

Tool Name : MEME

Version : 4.1.0

Description

Meme is a motif finding tool for DNA as well as Protein.

Reference : http://meme.nbcr.net

Tool Name : MPAO
- Map Potential Antigenic Output (Anvaya Custom Tool)

Version : 1.0

Description

Parse the outputs of different antigen prediction tools and make a file which will give combined output.

Tool Name : MPT
- Motif Processing Tool (Anvaya Custom Tool)

Version : 1.0

Description

It is a custom tool which converts outputs from different motif prediction tools into a text-based visual format (for easy comparison of results). It requires output of at least two tools to compare. By default it takes first five motifs predicted by each tool. If predictions are less than five, then it takes all predicted motifs into consideration. It takes as input the outputs of different motif prediction programs and processes them to give the modified easy-to-interpret formatted output.

Tool Name : Mutual Info

Version : 0.64

Description

Calculates mutual informartions (MIs) from a table of continous and discontinous variables. It is used in detection of functional linkages by Phylogenetic profiling.

Tool Name : neighbor
- From PHYLIP Package

Version : 3.67

Description

Neighbor implements the Neighbor-Joining method. It constructs a tree by successive clustering of lineages, setting branch lengths as the lineages join. The tree is not rearranged thereafter. The tree does not assume an evolutionary clock, so that it is in effect an unrooted tree.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	alphanumeric	Outfile
Method	Neighbor-joining	Neighbor-joining
Method	UPGMA tree	Neighbor-joining
Lower-triangular data matrix?	Yes/No	No
Upper-triangular data matrix?	Yes/No	No
Subreplicates	Yes/No	No
Randomize input order of species?	Yes/No	No. Use input order
Analyze multiple data sets?	Yes/No	No
Terminal type	IBM PC	ANSI
	ANSI
	None
Print out the data at start of run	Yes/No	No
Print indications of progress of run	Yes/No	Yes
Print out tree	Yes/No	Yes
Write out trees onto tree file?	Yes/No	Yes
Outgroup root	Yes/No	No, use as outgroup species 1

Tool Name : NETCTL

Version : 1.2

Description

NetCTL 1.2 predicts CTL epitopes in protein sequences

Reference : http://www.cbs.dtu.dk/services/NetCTL/

Tool Name : OFB
- Ortholog From Blast (Anvaya Custom Tool)

Version : 1.0

Description

It is a custom tool which reports the final orthologs, if found from two TBLASTN output, given the desired identity and query coverage as well as unique genes found in the respective two genomes. It takes as input three files, orthologsQuery.out file from BBH, and the two TBLASTN output from the previous nodes. Given the desired identity value and percentage query coverage, it looks for genuine orthologs, if missed previously, and appends the result to the file orthologsQuery.out. The genes which did not satisfy the given identity and percent query coverage criteria are reported as unique genes.

Advanced Parameters

Three files to be reported by OFB appender tool
- Orthologs reported by both BBH and tBlastn (output1)
- Unique genes in Genome A (output2)
- Unique genes in Genome B (output3)

Tool Name : OCP
- Orthologous Cluster of conserved Protein (Anvaya Custom Tool)

Version : 1.0

Description

The .aln file of Clustalw is used by Orthologous Cluster tool, which extracts the conserved regions from given set of orthologous sequences.

Tool Name : PFAM

Version : Pfam_scan 0.7, Pfam 23.0

Description

The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

Reference : http://pfam.sanger.ac.uk/

Tool Name : Phred
- Base calling Program

Version : 2.2.14

Description

Phred is a base-calling program for DNA sequence traces. Phred reads DNA sequence chromatogram files and analyzes the peaks to call bases, assigning quality scores ("Phred scores") to each base call.

Tool Name : PredPrimer
- Predict Primer (Anvaya Custom Tool)

Version : 1.0

Description

It is a custom tool that reports a summarized output of the predicted primers by Primer3 after performing a similarity search against a desired database using BLASTn search and secondary structure formation using RNAfold. It takes as input two files, the BLASTn output file after running BLAST and the output of RNAfold. It reports the summarized output of atleast two hits reported by BLASTN for a query and its corresponding secondary structure prediction at a given temperature.

Tool Name : Prof_Vec
- Create Profile Vector (Anvaya Custom Tool)

Version : 1.0

Description

Reads in the results of independent Smith-Waterman database searches and creates a matrix containing normalized E-values.

The SSEARCH output is parsed in terms of % query overlap (default: 40%). The e-value is normalized using the formula normalized e-value=-1/log E

The tool reads in outputs of independent search (one per organism) and creates a matrix with normalized E-values. The following are some of the assumptions made for the construction of the matrix:

5 --> refers to No significant Hits Found query.

1 --> refers to eVal >= 1

0 --> refers to eVal = 0

2 --> refers to eVal >0 and <1 BUT NOT satisfying conditions.

normaliseEval --> refers to eVal > 0 and < 1 satisfying conditions.

Tool Name : protml
- From PHYLIP Package

Version : 3.67

Description

Protml PROTML is a PASCAL program for inferring evolutionary trees from protein (amino acid) sequences by using maximum likelihood. A maximum likelihood method for inferring trees from DNA or RNA sequences was developed by Felsenstein (1981). The method does not impose any constraint on the constancy of evolutionary rate among lineages.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	alphanumeric	Outfile
Search for best tree	Yes/No	Yes
Models of amino acid change	Jones-Taylor-Thornton	Jones-Taylor-Thornton
	Henikoff/Tillier PMB
	Dayhoff PAM
Number of categories	Any number between 1- 9	Yes
Hidden Markov Model rates	Constant rate	Constant rate
	Gamma distributed rates
	Gamma+Invariant sites
	user-defined HMM of rates
Weight	Yes/No	No
Speedier but rougher analysis	Yes/No	Yes
Global Arrangement	Yes/No	No
Randomize input order of sequences?	Yes/No	No. Use input order
Outgroup root?	Yes/No	No, use as outgroup species 1
Analyze multiple Data sets	Multiple data sets (type D)	No
Analyze multiple Data sets	Multiple weights (type W)	No
Input sequence	Sequential	Interleaved
Input sequence	Interleaved	Interleaved
Terminal Type	IBM PC	ANSI
	ANSI
	None
Print out the data at start of run	Yes/No	No
Print indications of progress of run	Yes/No	Yes
Print out tree?	Yes/No	Yes
Write out trees onto tree file?	Yes/No	Yes
Reconstruct hypothetical sequences?	Yes/No	No
Use lengths from user trees?	Yes/No	No
Rates at adjacent sites correlated?	Yes/No	No, they are independent

Tool Name : protdist
- From PHYLIP Package

Version : 3.67

Description

ProtDist uses protein sequences to compute a distance matrix, under three different models of amino acid replacement. The distance for each pair of species estimates the total branch length between the two species, and can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR. This is an alternative to use of the sequence data itself in the parsimony program PROTPARS.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	alphanumeric	Outfile
Method	JTT	Jones-Taylor-Thornton matrix
	Henikoff/Tillier PMB matrix
	Dayhoff PAM matrix
	Kimura formula
	Similarity Table
	Categories model
Gamma distribution	Yes/No	No
Gamma distribution	Gamma+Invariant	No
Number of categories	Any number between 0- 9	Yes
Weights	Yes/No	No
Multiple data sets	Multiple weights (type W)	No
Multiple data sets	Multiple data sets (type D)	No
Input Sequence	Sequential	Interleaved
Input Sequence	Interleaved	Interleaved
Terminal Type	ANSI	ANSI
	IBM PC
	None
Print out the data at start of run	Yes/No	No
Print indications of progress of run	Yes/No	Yes
Genetic codes	Universal	Universal
	Mitochondrial
	Vertebrate mitochondrial
	Fly mitochondrial
	Yeast mitochondrial
Categories of Amino acids	Chemical	George/Hunt/Barker
	George/Hunt/Barker
	Hall
Ease of changing category of amino acid	Any number below 1.0. Can’t be negative.	0.4570
Transition/transversion option	Any Number or ratio	2.0
Base Frequencies	Equal	Equal
Base Frequencies	Any number but the Frequencies must sum to 1.	Equal

Tool Name : protpars
- From PHYLIP Package

Version : 3.67

Description

ProtPars infers an unrooted phylogeny from protein sequences, using a new method intermediate between the approaches of Eck and Dayhoff (1966) and Fitch (1971). Eck and Dayhoff (1966) allowed any amino acid to change to any other, and counted the number of such changes needed to evolve the protein sequences on each given phylogeny. This has the problem that it allows replacements which are not consistent with the genetic code, counting them equally with replacements that are consistent. Fitch, on the other hand, counted the minimum number of nucleotide substitutions that would be needed to achieve the given protein sequences. This counts silent changes equally with those that change the amino acid.

Advanced Parameters

Parameter	Allowed Values	Default Value
Input file	Alphanumeric	Infile
Output file	alphanumeric	Outfile
Search for best tree?	No, use user trees in input file	Yes
Search for best tree?	Yes	Yes
Randomize input order of sequences	No/Yes	No. Use input order
Outgroup root	No/Yes	No, use as outgroup species 1
Use Threshold parsimony	No/Yes	No, use ordinary parsimony
Genetic code	Universal	Universal
	Mitochondria
	Vertebrate mitochondrial
	Fly mitochondrial
	Yeast mitochondrial
Use Transversion parsimony	No/ Yes, count only transversions	No, count all steps
Sites weighted	No/Yes	No
Analyze multiple data sets	Multiple data sets (type D)	No
Analyze multiple data sets	Multiple weights (type W)	No
Input sequences interleaved	Yes/No, sequential	Yes
Terminal type	IBM PC/ANSI/ none	ANSI
Print out the data at start of run	No/Yes	No
Print indications of progress of run	Yes/No	No
Print out tree	Yes/No	Yes
Print out steps in each site	Yes/No	No
Print sequences at all nodes of tree	Yes/No	No
Write out trees onto tree file?	Yes/No	Yes
Dot-differencing to display	Yes/No	No

Tool Name : psiBLAST

Version : 2.0

Description

Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.

Reference : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Tool Name : RBSFinder

Version : Not Available

Description

RBSfinder will search for regions in the vicinity of the gene start where the ribosom might bind. Based on its findings RBSfinder might propose a different gene start. In most cases the use of RBSfinder increases the accuracy of prediction of the gene start.

Advanced Parameters

Parameter	Allowed Values

Genome sequence file in fasta	File name ( Input File )
Glimmer output file	File name ( Input File )
Output file	File name
Length of upstream region of gene, where RBS will be searched (in bp)	<=300
consensus sequence	String of a, t, g, c (any length)
File containing position to relocate or check for RBS site	File name ( Input File)

Tool Name : RepeatMasker

Version : 3.0

Description

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns).

Advanced Parameters

Parameter

Allowed Values

Default Value

Alternate search engine

Cross_match , WuBlast , Decypher

Cross_match

Parallel version

No. of processors available

NA

Slow search

N.A

Unchecked

Quick search

N.A

Unchecked

Rush job

N.A

Unchecked

Don’t mask low complexity regions / simple repeats

N.A

Unchecked

Only masks low complex/simple repeats

N.A

Unchecked

Do not mask small RNA or pseudo genes

N.A

Unchecked

Mask Alus, 7SLRNA, SVA and LTR5

N.A

Unchecked

Maximum % divergence from consensus sequence

0 to 100

NA

Use custom library

File name

(Input File)

NA

Cutoff score for masking repeats.

> 0

225

Specify species of input sequence

String

N.A

Only clip E.coli insertion elements.

N.A

Unchecked

Clip IS element before analysis

N.A

Unchecked

Skip bacterial insertion element check

N.A

Unchecked

Check for rodent specific repeats

N.A

Unchecked

Checks for primate specific repeats

N.A

Unchecked

Use matrices calculated for x% background GC level

1 to100

N.A

Calculate GC content

N.A

Unchecked

Max. sequence length masked without fragmenting.

Any integer

If –e = DeCypher then Default value: 300000

Else

Default: 40000

Skip the steps in which repeats are excised

N.A

Unchecked

Write alignment in .align output file

N.A

Unchecked

Present alignment in the orientation of repeats

N.A

Unchecked

Outputs ambiguous DNA transposon in lower case

N.A

Unchecked

Mask Sequence in lower case

N.A

Unchecked

Returns repetitive regions in lowercase

N.A

Unchecked

Repetitive region masked with X’s

N.A

Unchecked

Reports simple repeats that may be polymorphic

N.A

N.A

Annotation with the HSP evidence

N.A

Unchecked

Create an additional output in xhtml format

N.A

Unchecked

Output in ACeDB format

N.A

Unchecked

Gene Feature Finding format output

N.A

Unchecked

Annotation output file not processed by ProcessRepeats

N.A

Unchecked

Output file in cross_match format

N.A

Unchecked

Create annotation file with fixed column width

N.A

Unchecked

Does not write final column with unique ID for each element

N.A

Unchecked

Tool Name : Restrict
- From EMBOSS Package

Version : 4.1.0

Description

Restricts finds restriction enzyme cleavage sites. Restrict uses the REBASE database of restriction enzymes to predict cut sites in a DNA sequence. The program allows you to select a range of cuts, whether the DNA is circular, whether IUB ambiguity codes are used, whether blunt or sticky ends or both are reported.

Advanced Parameters

Parameter

Allowed Values

Default Value

Input

Alphanumeric

Stdin

-Enzymes

Alphabetic

all

Minimum site length

Integer

4

Output

Alphanumeric

Stdin

Minimum cuts for an enzyme

Integer

1

Maximum cuts for an enzyme

Integer

2000000000

Fragment lengths

Boolean Yes/No

No

Solo

Fragment lengths of each enzyme

Boolean Yes/No

No

Only one fragment per enzyme

Boolean Yes/No

No

Allow blunt cut

Boolean Yes/No

Yes

Allow Sticky ends

Boolean Yes/No

Yes

Allow ambiguity

Boolean Yes/No

Yes

Span Plasmid/end sequence

Boolean Yes/No

No

commercial enzymes

Boolean Yes/No

No

Alternate RE datafile

Alphanumeric

(Input file)

Stdin

Isoschizomers

limit

Boolean

Yes/No

Yes

Sort out Alphabetically

Boolean

Yes/No

No

Tool Name : ROT
- Remote Ortholog Tool (Anvaya Custom Tool)

Version : 1.0

Description

The output of PSI-BLAST is used as input for Remote ortholog tool (ROT). Hits between 30% to 60% identity achieved from each round of PSI-BLAST are extracted by ROT. These are probable remote ortholog sequences and can be saved in FASTA format for further analysis.

Tool Name : SeqBoot
- From PHYLIP Package

Version : 3.67

Description

SEQBOOT is a general boostrapping tool. It is intended to allow you to generate multiple data sets that are resampled versions of the input data set. Since almost all programs in the package can analyze these multiple data sets, this allows almost anything in this package to be bootstrapped, jackknifed, or permuted. SEQBOOT can handle molecular sequences, binary characters, restriction sites, or gene frequencies.

Advanced Parameters

Parameter	Allowed Values		Default Value
Input file	Alphanumeric		Infile
Output file	alphanumeric		Outfile
Method	Molecular Sequence		Molecular Sequence
	Discrete Morphological Characters
	Restriction Sites
	Gene Frequencies
	Bootstrap		Bootstrap
	Delete half Jackknife
	Permute	Permute species for each character
		Permute character order
		Permute within species
	Rewrite data
Sampling fraction	Regular		Regular
	Altered
Block Bootstrap	Size of Block: Any Number		1
Number of replicate datasets	Any Number		100
Weight	Read weights of characters: Yes/No		No
Categories	Read categories of sites: Yes/No		No
	Data sets		Data sets
	Just weights
Input Sequence	Interleaved		Interleaved
	Sequential
Terminal Type	IBM PC
	ANSI
	None
Print out the data at start of run	Yes/No		No
Print indications of progress of run	Yes/No		No
Number of enzymes	Present in input file		Present in input file
	Not present in input file
All alleles present at each locus	No, one absent at each locus		No, one absent at each locus
	Yes
Factors	Yes/No		No
Ancestors	Yes/No		No
Mixture of methods	Yes/No		No
Dot-differencing to display	Yes/No		No
Output format	PHYLIP		PHYLIP
	NEXUS
	XML
Type of molecular sequences	DNA RNA Protein		DNA

Tool Name : seqClean

Version : 2.2.14

Description

A script for automated trimming and validation of ESTs or other DNA sequences by screening for various contaminants, low quality and low-complexity sequences.

Reference : http://jimmy.harvard.edu

Tool Name : SignalP

Version : 3.0

Description

SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models.

Reference : http://www.cbs.dtu.dk/services/SignalP/

Tool Name : SRT
- Sequence Retreival Tool (Anvaya Custom Tool)

Version : 1.0

Description

Retrieves the fasta sequences of the entries satisfying the above mentioned criteria from the corresponding table of the database stored in a RDBMS like MySQL. The storage of standard databases like UniProt and nr as tables in MySQL increased the pace of retrieval which otherwise had to be extracted from a text file with upto 3GB size and consumed lot of time for the same task.

Tool Name : STT
- Sequence Trimming Tool (Anvaya Custom Tool)

Version : 1.0

Description

Sequence trimming tool will remove the vector-masked regions from the EST sequences. It also removes polyA/T tails, poly C/G tails, adaptor and linker sequences.

Input files:

1. The cross_match vector masked sequence output file (*.seq.screen file)
2. The Quality value file produced by Phred tool

Output files: (need to be mentioned as an parameter)

1. Filtered Sequence file (*.seq)
2. Filtered Quality value file (*.qual)
3. Data filtering report file

Tool Name : TargetP

Version : v1.1b

Description

TargetP predicts the subcellular location of eukaryotic proteins. The location assignment is based on the predicted presence of any of the N-terminal presequences: chloroplast transit peptide (cTP), mitochondrial targeting peptide (mTP) or secretory pathway signal peptide (SP).

Reference : http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?targetp

Advanced Parameters

Parameter	Allowed Values	Default Value
Query input	Alphanumeric	Stdin/Fasta format
Organism group	N/A	Non-plant
Include cleavage site prediction	N/A	N/A
Cutoff for predicting cTP.	Float 0.0-1.0	0.0
Cutoff for predicting mTP.	Float 0.0-1.0	0.0
Cutoff for SP.	Float 0.0-1.0	0.0
Cutoff for other.	Float 0.0-1.0	0.0
Output file	Alphanumeric	stdout

Tool Name : TMHMM
- Prediction of transmembrane helices in proteins

Version : 2.0.c

Description

TMHMM predicts transmembrane helices and the location of the intervening loop regions.

If the whole sequence is labeled as inside or outside, the prediction is that it contains no membrane helices. It is probably not wise to interpret it as a prediction of location. The prediction gives the most probable location and orientation of transmembrane helices in the sequence.

Reference : http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm

Advanced Parameters

Parameter	Command line mapping	Allowed Values
Input file	NA	Alphanumeric
Model	-v

Tool Name : tRNAScan

Version : 1.21

Description

tRNA detection in large-scale genome sequence.

tRNAscan-SE detects ~99% of eukaryotic nuclear or prokaryotic tRNA genes, with a false positive rate of less than one per 15 gigabases, and with a search speed of about 30 kb/second

Reference : http://selab.janelia.org/software.html

Tool Name : UT_EST
- Unique Transcripts EST Tool (Anvaya Custom Tool)

Version : 1.0

Description

This is customized tool provided as a part of Anvaya package. This will give the concatenated file containing the all contigs and singlets.

Input files:

1. File containing contigs sequences. (Cap3 tool output)
2. File containing singlets sequences. (Cap3 tool output)

Output file name need to mention as an input parameter.

Tool Name : vRNAFold

Version : NA

Description

predict secondary structures of single stranded RNA or DNA sequences.

Tool Name : Weeder

Version : 1.3

Description

Weeder is a program for finding novel motifs ( transcription factor binding sites ) conserved in a set of regulatory regions of related genes.

Reference : http://159.149.109.9/modtools/

Workflows - Pipeline

A Workflow-Pipeline is logical connection of commonly used Bioinformatics tools, which are run either in serial mode or in parallel mode, to achieve a scientific target.

The pipeline is created by dragging appropriate tools from the available tool list, connecting them logically on the design canvas, and then setting the appropriate input-output files and advanced parameters.

Create Workflow

A workflow is created by dragging associated tools from the tool list onto the design canvas. The tools are then connected in logical order.

Execute Workflow

When user clicks "Run Workflow", the user will need to transfer the input and other configuration files before actual execution of workflow. Once the file transfer is complete, the user can start executing the workflow.

Status of the workflow can be viewed in tabular format on the status tab. The status is also depicted pictorially on the design canvas

Save Workflow

The workflow created on design canvas must be saved before execution.

Workflows - Glossary

Terminology	Description
Pipeline	Logical connection of tools
Rules Engine	A component of Anvaya Client which controls connection of tools. The rules engine defines which tools can be logically connected. The rules engine is pre-defined and should not be modified. by end-user
Custom Tool	These are tools which are custom made for Anvaya. These add value to the pre-defined workflows available in Anvaya
SubLayer	If the workflow, spans across large area on design canvas, the sub-parts of workflows can be grouped together in sub-layer, which then can be collapsed/expanded by the user

Index

A
- Alignace
- Antigenic
B
- bbh
- blast
- blast2go
C
- cap3
- cgf
- charge
- Client Configuration
- Client Features
- cln2qual
- clustalw
- cluster
- compseq
- Configurations
- consense
- consensus
- Create Workflow
- crossmatch
- custom_cap3
D
- db_est
- Delete Node
- Delete Project
- dnadist
- dnaml
- dnapars
- Drag Node
E
- Edit Node Properties
- eprimer3
- eslpred2
- EST Assembly
- estscan
- Execute Workflow
F
- fasta
- fat
- fitch
- freak
- Functional Annotation
G
- Genome Annotation
- genscan
- Get Started
- glimmer
- Glossary
H
- hamming_distance
- hq_est
I
- interproscan
- Introduction
K
- kitsch
M
- mast
- mdscan
- meme
- Micro Array
- Motif Identification
- mpao
- mpt
- mutual_info
N
- neighbor
- netctl
- New Project
O
- ofb
- Open Project
- orthologous_cluster
- Ortholog Prediction
P
- pfam
- phred
- Phylogenetic Profiling
- Phylogeny
- Potential Antigenic Sites
- predict_primer
- Primer Marker
- prof_vector
- Project Explorer
- proml
- protdist
- protpars
- psiblast
R
- RBSFinder
- Remote Ortholog
- repeat_masker
- restrict
- rot
S
- Save Workflow
- seqboot
- seqclean
- Server Configuration
- signalp
- srt
- stt
T
- targetp
- tmhmm
- trnascan
U
- ut_est
V
- vrnafold
W
- weeder
- workflow Pipeline
- Workflows Nodes
- Workflows Project
- Workflows Software UI Layout

Table of Contents

Introduction

Anvaya Client Features

Workflows - Get Started

Anvaya - Software UI Layout

Workflows - Configurations

Anvaya Client Configuration

Anvaya Server Configuration

Anvaya - Nodes

Workflows - Delete Node

Workflows - Drag Node

Workflows - Edit Node Properties

Anvaya - Project

Delete Project

New Project

Open Project

Project Explorer

Pre-defined Workflow - EST Assembly

Pre-defined Workflow - Functional Annotation

Pre-defined Workflow - Genome Annotation

Pre-defined Workflow - Promoter Identification using Micro-array data

Pre-defined Workflow - Motif Identification

Pre-defined Workflow - Ortholog Prediction

Pre-defined Workflow - Phylogenetic Profiling

Pre-defined Workflow - Phylogeny

Pre-defined Workflow - Prediction for Potential Antigenic Sites

Selected sequences are than used as input for tools like netCTL, Antigenic, TargetP and ESLpred2 to predict T cell and B cell epitopes along with their sub cellular location.

Antigenic (emboss) is used to predict B cell epitopes. The probable B-cell epitopes are predicted and retrieved by using Map potential antigenic output custom tool.

Pre-defined Workflow - Identification of Primer/Marker

Pre-defined Workflow - Remote Ortholog and Conserved Domain Prediction

Tools available in Workflows

Description

Description

Description

Advanced Parameters

Description

Advanced Parameters

Description

Description

Description

Advanced Parameters

Description

Advanced Parameters

Description

Description

Advanced Paramters

Description

Description

Advanced Parameters

Description

Advanced Parameters

Description

Description

Description

Description

Description

Advanced Parameters

Description

Advanced Parameters

Description

Advanced Parameters

Description

Description

Description

Description

Description

Description

Advanced Parameters

Description

Advanced Parameters

Description

Advanced Parameters

Description

Advanced Parameters

Description

Description

Description

Description

Advanced Parameters

Description