Accurate automatic classification of protein function remains a challenge for genome annotation.
We have benchmarked the automatic annotation of four bacterial genomes employing a 5-fold
cross-validation procedure three machine learning methods (linear regression analysis, k nearest neighbors and
associative neural networks). The analyzed genomes were previously manually annotated with FunCat categories in
MIPS providing a gold standard. Features describing pairs of sequences rather than each sequence alone were used.
The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, lengths of sequences,
and calculated protein properties. Following training we scored all pairs from the validation sets.
For each target protein we selected a pair with the highest predicted score and annotated the target protein with
functional categories of the prototype protein. The neural network approach calculated the highest annotation accuracy.
Moreover, the predicted annotation scores differentiated reliable vs. non-reliable annotations.
The sequence alignment scores and descriptors derived from InterPro domains provided the largest contribution to
the performance of the algorithm. The method was applied to annotate the protein sequences from 180 complete bacterial
Tetko, I.V.; Rodchenkov, I.V. Walter, M.C.; Rattei, T.; Mewes, H.W. Beyond the "Best" Match: Machine Learning Annotation of Protein Sequences by Integration of Different Sources of Informationi, Bioinformatics. 2008, 24(5):621-8.
This study was partially supported by the DFG grant TE 380/1-1 to Dr. I.V. Tetko and Prof. H.W. Mewes.
This server is no more supported. You can still search for old results but no new annotations will be submitted for calculations.
See also other servers developed by us