Project Details
Functional annotation of protein sequences using machine-learning algorithms
Applicant
Dr. Igor V. Tetko
Subject Area
Theoretical Computer Science
Term
from 2003 to 2006
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 5400561
Numerous genome-sequencing projects have caused a rapid growth of protein databases. In contrast to the pre-genomic era, when the analyses of only patchworks of sequenced genes were available from many organisms, the systematic exploration of gene function is able to assign more and precise functional properties. However, the manual annotation of sequences is laborious and costly. Thus there is a strong interest to develop new methods for automatic functional classification of genome sequences that will be able to predict reliably functional properties. In this project we propose to develop an automatic system for protein annotation that will accurately assign functional categories to new protein sequences using machine-learning algorithms. A main idea of our approach is to perform protein annotation using both labeled and unlabeled data within the framework of expectation-maximization and co-training approaches that have been successfully introduced in the field of text classification of WWW pages. In contrast to previous approaches that used the naive Bayes classifier, we will apply neural network algorithms specially developed to handle data with large number (103 - 104) of correlated input parameters such as Volume Learning Algorithms and Associative Neural Networks. These methods estimate the similarity of data cases in the space of neural network models and combine supervised and unsupervised algorithms. Our approach will predict functional categories of protein sequences according to the functional catalog of MIPS that is currently intensively used for the functional projection of experimental data. The input data will include BLAST/FASTA similarity scores, protein domains and motifs, as well as secondary and tertiary structures as predicted by the methods employed by the PEDANT system. A use of these different and complementary sources of information will provide a considerable improvement in the accuracy of annotation over traditional approaches. Results generated by the methods of this project will be made available through the MIPS WWW pages.
DFG Programme
Priority Programmes
Subproject of
SPP 1063:
Informatikmethoden zur Analyse und Interpretation großer genomischer Datenmengen
Participating Person
Professor Dr. Hans-Werner Mewes