Project Details
Machine learning methods for genome reconstruction in metagenomics
Applicant
Dr. Peter Meinicke
Subject Area
Bioinformatics and Theoretical Biology
Term
from 2016 to 2020
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 324226106
Metagenomics has become a standard approach for the analysis of microbial communities. The sequencing of environmental or clinical samples results in large collections of short sequence fragments that enable a comprehensive analysis of the species composition and the metabolic potential. The progress in sequencing technologies has given rise to a significantly increased sequencing depth so that it has become possible to assemble long contiguous regions (contigs) of microbial genomes even for the most complex communities. Recent studies have shown that it is possible to reconstruct nearly complete genomes by grouping these contigs into genome-specific bins. Metagenome binning is a computationally challenging problem, and although current binning tools for genome recovery are all based on similar clustering techniques, the quality of results depends on user specifications and largely varies across different communities and tools. To fully exploit the potential of metagenome binning for genome reconstruction, we aim to develop a machine learning framework that will enable an automatic optimization of the accuracy and the reproducibility of results. To achieve this goal we will integrate state-of-the-art machine learning models together with statistical models for count data and a simulation-based control of the overall model quality. In a collaboration with the Joint Genome Institute in Walnut Creek we will develop a new approach to assess the completeness and contamination of recovered genomes based on the protein domain content and on alignments with known genomes. In contrast to the common workflow of metagenome binning which applies quality checks just to the final bins we will include quality control as an integral part of the machine learning-based bin optimization process. In an ongoing collaboration with the Department of Genomic and Applied Microbiology in Göttingen we will also investigate how long reads from third generation sequencing technologies can be utilized to measure and improve binning accuracy. Together with our Göttingen collaboration partner from the Department of Forest Botany and Tree Physiology we aim to extend the binning approach to fungal communities which will open up new possibilities for research in ecology.
DFG Programme
Research Grants