Project Details
Projekt Print View

Weiterentwicklung maschineller Lernmethoden für Sequenzen mit Anwendung zur rechnergestützter Generkennung

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term from 2009 to 2014
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 110857523
 
Final Report Year 2017

Final Report Abstract

In the course of this project, we were able to significantly improve upon the state-of-the-art for sequence learning. In specific, we provided solutions for the following problems: We were able to improve memory- and runtime behavior for binary SVMs significantly making application to large-scale genomes possible. In case of the hidden Markov SVM, we were able to improve the training time to only a fraction of the previous state-of-the-art. Furthermore, the quality of optimization can now be measured in terms of duality gap without further computational costs. We increased the flexibility of machine learning models and, in turn, gained a better representation of the underlying problem which, ultimately, resulted in higher prediction performance. During the project, we developed various extensions and enhancements to, e.g. go beyond simple decomposable loss functions such as Hamming-loss. Also, we developed efficient algorithms for slack rescaling in structured SVMs and were able to speed-up the training of structured SVMs by devising a novel optimization method based on bundle methods. Complex machine learning models demand a certain amount of training data to achieve high detection performance. We developed new binary SVMs and hidden Markov SVMs that leverage information according to a given structure (i.e. a task taxonomy). Building upon previous successful approaches, POIM and FIRM, we developed an automatic motif reconstruction method that was able to identify human splice site motif factors much more accurate than other competitors. Furthermore, we extended the previous constraint methodology to arbitrary learning machines and feature representations. We applied our novel methods to a variety of biological relevant tasks, e.g. de novo gene finding for human and mouse genomes, human splice site detection, and motif extraction for human splice sites. Indeed, most of our work features at least one real-world genomic application. With our Oqtans online web service, we provide a very convenient and sophisticated way of quantitative transcriptome analysis for researchers around the globe. Furthermore, the SHOGUN machine learning toolbox gained much interested and could, due to the help of many volunteers, be expanded significantly. Moreover, we are committed to Open Science developed methods and source code are publicly available to the scientific communities via • https://git.ratschlab.org • https://github.com/nicococo • https://github.com/shogun-toolbox With the rise of deep learning techniques and their recent success in computational biology tasks, we would like to improve our novel derived models to automatically learning feature representations using deep neural networks as input features. Naturally, these more expressive models will pose a challenge to our explanation methods which need to be adapted accordingly. Another major challenge poses the massively increasing amount and diversity of data available for genomic tasks. One possible extension of our evidence driven mtim method, which segments gene transcript based on RNA-seq data, would be the extension towards multiple RNA-seq observations from, e.g. various tissues.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung