Algorithms for the Analysis of Approximate Gene Cluster (3AGC)

Applicants Professor Dr. Sebastian Böcker; Professor Dr. Jens Stoye

Subject Area Bioinformatics and Theoretical Biology

Term from 2010 to 2014

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 156864160

Final Report Year 2015

Final Report Abstract

The sequential ordering of genes on chromosomes underlies functional constraints that keep some genes collocated across large evolutionary distances. Detecting these collocations oﬀers evidence on functional coupling and the roles of genes with previously unknown function. These collocations, known as gene clusters, can be rather faint signals, as their locations in the genomes may be incomplete, interspersed with unrelated genes, internally rearranged, and completely lost in some genomes. Preexisting methods for gene cluster detection use either oversimpliﬁed models that overlook clusters with diverse conservation patterns, or their runtime scales insuﬃciently with the number of studied genomes. The latter shortcoming exacerbates steadily with the increasing number of genomes available for whole genome analysis. The main goal of this project was to drive de novo detection in the direction of the analysis of hundreds of (bacterial) genomes at a time. We developed a new model for gene clusters that covers faint conservation patterns and at the same time conﬁnes the de novo detection problem to a polynomial search space. The algorithm we developed for this project has an eﬃcient polynomial runtime. Substantial practical speed-up and a reduction of memory consumption was achieved by extensive algorithmic engineering: For a dataset of 678 bacteria, the initial version of the algorithm takes about 22.5 minutes and 10 GB RAM; the current release version takes about 20 minutes and uses only 2.5 GB RAM; a speed-up of about 10% while at the same time using 75% less memory. In a second part of this project, we developed statistical methods to evaluate the statistical signiﬁcance of the predicted clusters using random gene order as background model. We also study alternative approaches when dealing with genomes too closely related to justify this null hypothesis. We combined the detection algorithm and the statistical evaluation in an easy-to-use software package. The graphical user-interface displays the predictions ranked by signiﬁcance, hiding co-optimal variants of the same cluster. We applied our approach on fungal genomes: The reduced number of inner genome duplications in L. corymbifera compared to R. oryzae gives a strong indication that the WGD reported for R. oryzae happened after the speciation event separating both species. In an application to bacterial genomes, we demonstrated that reference gene clusters are highly informative signals for the automatic detection of operons. Genome analysis on the level of gene order depends on preprocessing of the genomic raw data. The crucial step for gene cluster detection is the partitioning of genes into homology families which is usually based on pairwise sequence similarities. This is known to be an error-prone process as genes do not naturally cluster into disjoint cliques. Therefore the goal was to extend our models and algorithms to work directly with pairwise sequence similarities. We have shown that this type of information can be represented when genomes are modeled as indeterminate strings, i.e sequences that can have more than one character per position and rephrased our models accordingly to work with this more complex structures. Existing gene cluster detection algorithms can be extended to work on this type of sequences. However, our newly developed runtime heuristic proofed to outperform these approaches in practice. Currently, these approaches are available for pairwise genome comparison, the extension to multiple genomes is ongoing work.

Publications

Eﬃcient Computation of Approximate Gene Clusters Based on Reference Occurrences. Proceedings of RECOMB-CG 2010, vol. 6398 of LNBI, pages 264-277, 2010.
K. Jahn
Swiftly Computing Center Strings. Proc. of WABI 2010, vol. 6293 of LNBI, pages 325–336
F. Hufsky, L. Kuchenbecker, K. Jahn, J. Stoye, S. Böcker
Eﬃcient Computation of Approximate Gene Clusters Based on Reference Occurrences. J. Comp. Biol. 18(9):1255-1274
K. Jahn
Swiftly Computing Center Strings. BMC Bioinformatics 12(1):106
F. Hufsky, L. Kuchenbecker, K. Jahn, J. Stoye, S. Böcker
Statistics for approximate gene clusters. BMC Bioinformatics 14(Suppl 15):S14. Proc. of RECOMB-CG 2013
K. Jahn, S. Winter, J. Stoye, S. Böcker
The Potential of Family-Free Genome Comparison. In: Models and Algorithms for Genome Evolution. C. Chauve, N. El- Mabrouk, E. Tannier (Eds); Computational Biology Series 19, Springer Verlag: 287–307
M.D.V. Braga, C. Chauve, D. Dörr, K. Jahn, J. Stoye, A. Thévenin, R. Wittler
Gene expansion shapes genome architecture in the human pathogen Lichtheimia corymbifera: an evolutionary genomics analysis in the ancient terrestrial Mucorales (Mucoromycotina) PLOS Genetics 10(8):e1004496
Schwartze, Volker U.; Winter, Sascha; Shelest, Ekaterina; Marcet-Houben, Marina; Horn, Fabian; Wehner, Stefanie; Linde, Jörg; Valiante, Vito; Sammeth, Michael; Riege, Konstantin; Nowrousian, Minou; Kaerger, Kerstin; Jacobsen, Ilse D.; Marz, Manja; Brakhage, Axel A.; Gabaldón, Toni; Böcker, Sebastian & Voigt, Kerstin
Identifying Gene Clusters by Discovering Common Intervals in Indeterminate Strings. BMC Genomics 15(Suppl. 6):S2
Doerr, Daniel; Stoye, Jens; Böcker, Sebastian & Jahn, Katharina

Servicenavigation

Hauptnavigation

Algorithms for the Analysis of Approximate Gene Cluster (3AGC)

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

Algorithms for the Analysis of Approximate Gene Cluster (3AGC)

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung