Algorithms for the Analysis of Approximate Gene Cluster (3AGC)
Final Report Abstract
The sequential ordering of genes on chromosomes underlies functional constraints that keep some genes collocated across large evolutionary distances. Detecting these collocations offers evidence on functional coupling and the roles of genes with previously unknown function. These collocations, known as gene clusters, can be rather faint signals, as their locations in the genomes may be incomplete, interspersed with unrelated genes, internally rearranged, and completely lost in some genomes. Preexisting methods for gene cluster detection use either oversimplified models that overlook clusters with diverse conservation patterns, or their runtime scales insufficiently with the number of studied genomes. The latter shortcoming exacerbates steadily with the increasing number of genomes available for whole genome analysis. The main goal of this project was to drive de novo detection in the direction of the analysis of hundreds of (bacterial) genomes at a time. We developed a new model for gene clusters that covers faint conservation patterns and at the same time confines the de novo detection problem to a polynomial search space. The algorithm we developed for this project has an efficient polynomial runtime. Substantial practical speed-up and a reduction of memory consumption was achieved by extensive algorithmic engineering: For a dataset of 678 bacteria, the initial version of the algorithm takes about 22.5 minutes and 10 GB RAM; the current release version takes about 20 minutes and uses only 2.5 GB RAM; a speed-up of about 10% while at the same time using 75% less memory. In a second part of this project, we developed statistical methods to evaluate the statistical significance of the predicted clusters using random gene order as background model. We also study alternative approaches when dealing with genomes too closely related to justify this null hypothesis. We combined the detection algorithm and the statistical evaluation in an easy-to-use software package. The graphical user-interface displays the predictions ranked by significance, hiding co-optimal variants of the same cluster. We applied our approach on fungal genomes: The reduced number of inner genome duplications in L. corymbifera compared to R. oryzae gives a strong indication that the WGD reported for R. oryzae happened after the speciation event separating both species. In an application to bacterial genomes, we demonstrated that reference gene clusters are highly informative signals for the automatic detection of operons. Genome analysis on the level of gene order depends on preprocessing of the genomic raw data. The crucial step for gene cluster detection is the partitioning of genes into homology families which is usually based on pairwise sequence similarities. This is known to be an error-prone process as genes do not naturally cluster into disjoint cliques. Therefore the goal was to extend our models and algorithms to work directly with pairwise sequence similarities. We have shown that this type of information can be represented when genomes are modeled as indeterminate strings, i.e sequences that can have more than one character per position and rephrased our models accordingly to work with this more complex structures. Existing gene cluster detection algorithms can be extended to work on this type of sequences. However, our newly developed runtime heuristic proofed to outperform these approaches in practice. Currently, these approaches are available for pairwise genome comparison, the extension to multiple genomes is ongoing work.
Publications
-
(2010) Efficient Computation of Approximate Gene Clusters Based on Reference Occurrences. Proceedings of RECOMB-CG 2010, vol. 6398 of LNBI, pages 264-277, 2010.
K. Jahn
-
(2010) Swiftly Computing Center Strings. Proc. of WABI 2010, vol. 6293 of LNBI, pages 325–336
F. Hufsky, L. Kuchenbecker, K. Jahn, J. Stoye, S. Böcker
-
(2011) Efficient Computation of Approximate Gene Clusters Based on Reference Occurrences. J. Comp. Biol. 18(9):1255-1274
K. Jahn
-
(2011) Swiftly Computing Center Strings. BMC Bioinformatics 12(1):106
F. Hufsky, L. Kuchenbecker, K. Jahn, J. Stoye, S. Böcker
-
(2013) Statistics for approximate gene clusters. BMC Bioinformatics 14(Suppl 15):S14. Proc. of RECOMB-CG 2013
K. Jahn, S. Winter, J. Stoye, S. Böcker
-
(2013) The Potential of Family-Free Genome Comparison. In: Models and Algorithms for Genome Evolution. C. Chauve, N. El- Mabrouk, E. Tannier (Eds); Computational Biology Series 19, Springer Verlag: 287–307
M.D.V. Braga, C. Chauve, D. Dörr, K. Jahn, J. Stoye, A. Thévenin, R. Wittler
-
(2014) Gene expansion shapes genome architecture in the human pathogen Lichtheimia corymbifera: an evolutionary genomics analysis in the ancient terrestrial Mucorales (Mucoromycotina) PLOS Genetics 10(8):e1004496
V. Schwartze, S. Winter, E. Shelest, M. Marcet-Houben, F. Horn, S. Wehner, J. Linde, V. Valiante, M. Sammeth, K. Riege, M. Nowrousian, o K. Kaerger, I. D. Jacobsen, M. Marz, A. A. Brakhage, T. Gabaldón, S. Böcker, K. Voigt
-
(2014) Identifying Gene Clusters by Discovering Common Intervals in Indeterminate Strings. BMC Genomics 15(Suppl. 6):S2
D. Dörr, J. Stoye, S. Böcker, K. Jahn