Algorithm

Feature genration

The composition-transition-distribution (CTD) feature is used to transform the protein sequences into numeric feature vectors. In CTD feature, composition (C) is the frequency of a particular amino acid type, transition (T) characterizes the percent frequency with which amino acids of a particular type is followed by other amino acids and distribution (D) measures the chain length within which the first, 25%, 50%, 75% and 100% of the amino acids of a particular type is located. Based on this feature, each protein sequence of L amino acids is encoded by a numeric vector of length L+{L*(L-1)/2}+(L*5). So, a total of 310 features are genrated for each of the nitrogen fixation protein (nif) dataset and non-nif dataset.

Support vector machine for prediction

The support vector machine with RBF kernel (default parameters) is used for prediction piurpose. The prediction for the test instances is actually done in two stages. First, the prediction is done to know whether the sequence is a nif or non-nif. Second, if the test sequence is predited as nif then subjected to the second stage of prediction, where it is classified into any one of the six categories of nif proteins.

Accuracy of nifPred

The overall accuracy of the nifPred server is >90% at both stages of predition, which has been evaluted using several datasets.