Lernen mit abhängigen Daten: Mit Anwendungen in der rechnergestützten Genomanalyse
Theoretische Informatik
Zusammenfassung der Projektergebnisse
The project, which was co-hosted at two leading New York research institutions at the same time (Memorial Sloan-Kettering Cancer Center and the New York University), lead to a better theoretical understanding of the setting of learning from dependent data, to new algorithms capable of dealing with such data effectively, and new software for biological applications in the field of computational genome annotation. The major results can be summarized as follows. On the theoretical side, a novel Talagrand concentration inequality was shown for uniform sampling processes without replacement, which was applied to show, for the first time, fast-rate excess risk bounds for the transductive setting of learning theory, using local Rademacher theory. Here it is assumed that the learner is provided both with labeled training and unlabeled test data (this takes into account the fact that in practice the test data might be accessible, but not labeled yet; for instance, in visual image recognition, some of the images disseminated in the internet are labeled, but many of which are yet unlabeled). A novel algorithm was proposed for learning to select a kernel from a parametric set of highly dependent candidate kernels by optimizing a local Rademacher complexity criterion. The algorithm was evaluated for its applicability in transcription start site detection, where it surpassed the accuracy of a winner of an international comparison of 19 models. This work received media attention in that it was described in a well visited blog by Google Research, where it received the Google Most Influential Papers 2013 Award. Another algorithm for learning from dependent data was developed, capable of learning from a set of mutually related organisms to detect genomic signals such as splice or transcription start sites. The algorithm exploits potential similarities between organisms as induced by their evolutionary relationship. To evaluate the method, we curated a large dataset, where we aligned RNASeq data to the reference genomes of 30 organisms; this uniform procedure avoids biases between data sets. We experimented on six genomic signals, the results showing that the proposed approach can effectively exploit the information contained in the multiple related organisms, surpassing prevalent approaches that discard this information.
Projektbezogene Publikationen (Auswahl)
- Learning Kernels Using Local Rademacher Complexity. Advances in Neural Information Processing Systems 26 (NIPS 2013), 2760–2768
C. Cortes, M. Kloft, M. Mohri
- Localized Complexities for Transductive Learning. JMLR Workshop and Conference Proceedings 35 (COLT 2014), 857-884
I. Tolstikhin, G. Blanchard, M. Kloft
- Framework for Multi-task Multiple Kernel Learning and Applications in Computational Biology
C. Widmer, M. Kloft, V. Sreedharan, G. Rätsch