GOMI

Algorithm

Data Sets

Blood samples were collected from a total of 1037 unrelated animals belonging to twenty two different Indian goat breeds viz. Blackbengal, Ganjam, Gohilwari, Jharkhandblack, Attapaddy, Changthangi, Kutchi, Mehsana, Sirohi, Malabari, Jamunapari, Jhakarana, Surti, Gaddi, Marwari, Barbari, Beetal, Kanniadu, Sangamnari, Osmanabadi, Zalawari and Cheghu across India, over 10 years. The breeds selected were from diverse geographical regions and climatic conditions with varying utilities and body sizes. Genomic DNA was isolated from the blood samples by using SDS-Proteinase-K method. The quality and quantity of the DNA extracted was assessed by Nanodrop 1000 (Thermo Scientific, USA) before further use. 55000 allelic data of microsatellite marker based DNA fingerprinting of 22 goat breeds. These data are on 25 loci viz. ILST008, ILSTS059, ETH225, ILSTS044, ILSTS002, OarFCB304, OarFCB48, OarHH64, OarJMP29, ILSTS005, ILSTS019, OMHC1, ILSTS087, ILSTS30, ILSTS34, ILSTS033, ILSTS049, ILSTS065, ILSTS058, ILSTS029, RM088, ILSTS022, OarAE129, ILSTS082 and RM4 (Table 1). The system had been trained for 55000 microsatellite data on 25 loci to achieve the best fitted model for breed identification.

Table 1. List of 25 loci along with the primer pairs

Locus

Forward Primer

Reverse Primer

Dye

Size Range

ILST008 gaatcatggattttctgggg tagcagtgagtgaggttggc FAM 167-195

ILSTS059 gctgaacaatgtgatatgttcagg gggacaatactgtcttagatgctgc FAM 105-135

ETH225 gatcaccttgccactatttcct acatgacagccaagctgctact VIC 146-160

ILST044 agtcacccaaaagtaactgg acatgttgtattccaagtgc NED 145-177

ILSTS002 tctatacacatgtgctgtgc cttaggggtgtattccaagtgc VIC 113-135

OarFCB304 ccctaggagctttcaataaagaatcgg cgctgctgtcaactgggtcaggg FAM 119-169

OarFCB48 gagttagtacaaggatgacaagaggcac gactctagaggatcgcaaagaaccag VIC 149-181

OarHH64 cgttccctcactatggaaagttatatatgc cactctattgtaagaatttgaatgagagc PET 120-138

OarJMP29 gtatacacgtggacaccgctttgtac gaagtggcaagattcagaggggaag NED 120-140

ILSTS005 ggaagcaatgaaatctatagcc tgttctgtgagtttgtaagc VIC 174-190

ILSTS019 aagggacctcatgtagaagc acttttggaccctgtagtgc FAM 142-162

OMHC1 atctggtgggctacagtccatg gcaatgctttctaaattctgaggaa NED 179-209

ILSTS087 agcagacatgatgactcagc ctgcctcttttcttgagagc NED 142-164

ILSTS30 ctgcagttctgcatatgtgg cttagacaacaggggtttgg FAM 159-179

ILSTS34 aagggtctaagtccactggc gacctggtttagcagagagc VIC 153-185

ILSTS033 tattagagtggctcagtgcc atgcagacagttttagaggg PET 151-187

ILSTS049 caattttcttgtctctcccc gctgaatcttgtcaaacagg NED 160-184

ILSTS065 gctgcaaagagttgaacacc aactattacaggaggctccc PET 105-135

ILSTSO58 gccttactaccatttccagc catcctgactttggctgtgg PET 136-188

ILSTSO29 tgttttgatggaacacagcc tggatttagaccagggttgg PET 148-191

RM088 gatcctcttctgggaaaaagagac cctgttgaagtgaaccttcagaa FAM 109-147

ILSTS022 agtctgaaggcctgagaacc cttacagtccttggggttgc PET 186-202

OARE129 aatccagtgtgtgaaagactaatccag gtagatcaagatatagaatatttttcaacacc FAM 130-175

ILSTS082 ttcgttcctcatagtgctgg agaggattacaccaatcacc PET 100-136

RM4 cagcaaaatatcagcaaacct ccacctgggaaggccttta NED 104-127

Input/ Submission
STR alleles can be submitted directly in numerical values of base pair. Goat being diploid thus needs submission of both alleles of homologous chromosome. Alternatively the submission can also be done in a single file with format .txt or .csv and 50 records.

Machine Learning Workbench Used
WEKA machine learning workbench, (developed by The University of Waikato), with extensive collection of machine learning algorithms and data pre-processing methods was used for classiﬁcation and prediction. A suitable algorithm for generating an accurate predictive model was identified from it.
This is available at http://www.cs.waikato.ac.nz/ml/weka.
Bayesian Network as Classifier
Classification is a technique to identify class labels for instances based on a set of features (attributes). Application of BNs techniques to classification involves BN learning (training) and BN inference to classify instances. It has powerful probabilistic representation for classification and has received considerable attention in the recent past.
A Bayesian network B may be induced and encodes a probability distribution from a given training set. The resulting model can be used so that given a set of attributes , the classifier based on B returns the label c which maximizes the posterior probability, i.e.
Let denotes the training data set. Here, each is a tuple of the form which assigns values to the attributes and to the class variable C. The log likelihood function, which measures the quality of learned model can be written as

The first term in this equation measures efficiency of B toestimates the probability of a class given set of attribute values. The second term measures how well B estimates the joint distribution of the attributes. Since the classification is determined based on , only the first term is related to the score of the network as a classifier i.e., its predictive accuracy. This term is dominated by the second term, when there are many observations. As n grows larger, the probability of each particular assignment to becomes smaller, since the number of possible assignments grows exponentially in n.

Cross Validation
Five-fold cross validation technique was implemented, where the data sets were randomly divided into five equal sets and each set containing almost equal number of observations. These sets were grouped into training and test set. Among these, four sets were used for training and the remaining one set for testing. The process was repeated five times such that each set gets the opportunity to fall under testing. Average of five sets is calculated finally.

Assessment of Prediction Accuracy
Computational models that are valid, relevant, and properly assessed for accuracy can be used for planning of complementary laboratory experiments. The prediction quality was examined by testing the model, obtained after training the system, with test data set. Several measures are available for the statistical estimation of the accuracy of prediction models. The common statistical measures are Sensitivity, Specificity, Precision or Positive predictive value (PPV), Negative predictive value (NPV), False Discovery Rate (FDR), Accuracy and Mathew’s correlation coefficient (MCC).

The Sensitivity indicates the ‘‘quantity’’ of predictions, i.e., the proportion of real positives correctly predicted. The Specificity indicates the ‘‘quality’’ of predictions, i.e., the proportion of true negatives correctly predicted. The PPV indicates the proportion of true positives in predicted positives-“the success rate” while NPV is the proportion of true negatives in predicted negatives.

These measures are defined as follows:

Where

TP = True Positive : are defined as the correctly identified instances
TN = True Negative : are defined as the correctly rejected instances
FP = False Positive : are defined as the incorrectly identified instances
FN = False Negative : are defined as the incorrectly rejected instances