Learning with Dependent Data: With Applications in Computational Genome Analysis

Applicant Professor Dr. Marius Kloft

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Theoretical Computer Science

Term from 2012 to 2014

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 225910935

Final Report Year 2015

Final Report Abstract

The project, which was co-hosted at two leading New York research institutions at the same time (Memorial Sloan-Kettering Cancer Center and the New York University), lead to a better theoretical understanding of the setting of learning from dependent data, to new algorithms capable of dealing with such data eﬀectively, and new software for biological applications in the ﬁeld of computational genome annotation. The major results can be summarized as follows. On the theoretical side, a novel Talagrand concentration inequality was shown for uniform sampling processes without replacement, which was applied to show, for the ﬁrst time, fast-rate excess risk bounds for the transductive setting of learning theory, using local Rademacher theory. Here it is assumed that the learner is provided both with labeled training and unlabeled test data (this takes into account the fact that in practice the test data might be accessible, but not labeled yet; for instance, in visual image recognition, some of the images disseminated in the internet are labeled, but many of which are yet unlabeled). A novel algorithm was proposed for learning to select a kernel from a parametric set of highly dependent candidate kernels by optimizing a local Rademacher complexity criterion. The algorithm was evaluated for its applicability in transcription start site detection, where it surpassed the accuracy of a winner of an international comparison of 19 models. This work received media attention in that it was described in a well visited blog by Google Research, where it received the Google Most Inﬂuential Papers 2013 Award. Another algorithm for learning from dependent data was developed, capable of learning from a set of mutually related organisms to detect genomic signals such as splice or transcription start sites. The algorithm exploits potential similarities between organisms as induced by their evolutionary relationship. To evaluate the method, we curated a large dataset, where we aligned RNASeq data to the reference genomes of 30 organisms; this uniform procedure avoids biases between data sets. We experimented on six genomic signals, the results showing that the proposed approach can eﬀectively exploit the information contained in the multiple related organisms, surpassing prevalent approaches that discard this information.

Publications

Learning Kernels Using Local Rademacher Complexity. Advances in Neural Information Processing Systems 26 (NIPS 2013), 2760–2768
C. Cortes, M. Kloft, M. Mohri
Localized Complexities for Transductive Learning. JMLR Workshop and Conference Proceedings 35 (COLT 2014), 857-884
I. Tolstikhin, G. Blanchard, M. Kloft
Framework for Multi-task Multiple Kernel Learning and Applications in Computational Biology
C. Widmer, M. Kloft, V. Sreedharan, G. Rätsch

Servicenavigation

Hauptnavigation

Learning with Dependent Data: With Applications in Computational Genome Analysis

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

Learning with Dependent Data: With Applications in Computational Genome Analysis

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung