Scalable Algorithms for Reconstruction of Plant Phylogenies in Conjunction with the NSF (National Science Foundation) iPlant Collaborative
Zusammenfassung der Projektergebnisse
In this project our main goal was to develop the software required to infer the largest Maximum Likelihood based phylogeny of plants, to date. Albeit smaller than initially planned, mainly for reasons of data assembly and availability we were able to infer the largest plant phylogeny to date. We also showed that such large trees are accurate enough for conducting meaningful post-analyses such as the identification of diversification rate shifts. In addition, we have developed and made available the RAxML-Light software that allows for inferring huge trees, both, with respect to the number of taxa and number of genes, on supercomputing systems using the Message Passing Interface (MPI). RAxML-Light is currently the only production-level software that offers a fine-grain MPI-based parallelization of the PLF for inferring huge trees on distributed memory systems. The checkpointing feature represents a prerequisite for deploying such codes on typical cluster and supercomputer configurations. We have also conducted low-level technical work by exploiting the features of modern hardware using 256-bit wide AVX vector instructions. This allows for optimally exploiting available hardware resources on modern servers. We have also shown that, the code can be used to infer trees comprising more than 100,000 taxa. The tree search convergence criterion we have developed shows good performance and avoids unnecessary likelihood calculations in the asymptotic convergence phase of the tree search. The criterion has been integrated as standard program option in RAxML and RAxML-Light. Using single- instead of double-precision floating point arithmetics did not work well, because single-precision arithmetics led to a high degree of numerical instability which appears to be impossible to resolve. In addition to this, we have explored novel tree search techniques and developed two novel mechanisms for reducing the memory footprints of likelihood-based analyses. The two orthogonal techniques (SEVs and recomputation) are generic, that is, they can be deployed in all likelihoodbased programs (ML and Bayesian inference codes). We are convinced that RAM requirements will constitute a limiting factor for future large-scale phylogenetic analyses. A currently still on-going project is the development of the perpetual tree pipeline that can be used to maintain an up-to-date large comprehensive reference tree containing all taxa of a specific taxonomic rank. This framework can be used for maintaining perpetually updated trees for any kind of organismal group. We anticipate to make this publicly available by mid 2013. In the final analysis, we have developed and made available an open source code that allows for analyzing datasets that are one order of magnitude larger than prior to this project. Moreover, we have shown that so-called mega-phylogenies inferred with RAxML-Light on real biological data are plausible and accurate. Finally, we have developed and published novel, generally applicable, methods for reducing the number of floating point operations and the memory requirements of likelihood calculations.
Projektbezogene Publikationen (Auswahl)
-
Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees. BMC Bioinformatics, 12(1):470, 2011
F. Izquierdo-Carrasco, S.A. Smith, and A. Stamatakis
-
Computing the phylogenetic likelihood function out-of-core. In In Proceedings of the IPDPS-HiCOMB 2011, 2011
F. Izquierdo-Carrasco and A. Stamatakis
-
Understanding angiosperm diversification using small and large phylogenetic trees. American Journal of Botany, 98(3):404–414, March 2011
Stephen A. Smith, Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue
-
”Computing the Phylogenetic Likelihood Function Out-of-Core”, IEEE HICOMB 2011 workshop (held in conjunction with IPDPS 2011), Anchorage, USA, May 2011
Fernando Izquierdo-Carrasco
-
RAxML-Light: a tool for computing terabyte phylogenies. Bioinformatics, 28(15):2064–2066, 2012
A. Stamatakis, A. J. Aberer, C. Goll, S. A. Smith, S. A. Berger, and F. Izquierdo-Carrasco
-
Trading running time for memory in phylogenetic likelihood computations. In Jan Schier, Carlos Manuel B. A. Correia, Ana L. N. Fred, and Hugo Gamboa, editors, BIOINFORMATICS, pages 86–95. SciTePress, 2012
Fernando Izquierdo-Carrasco, Julien Gagneur, and Alexandros Stamatakis
-
”Inference of Huge Trees under Maximum Likelihood”, IPDPS PhD forum, Shanghai, China, May 2012
Fernando Izquierdo-Carrasco, Alexandros Stamatakis
-
”Trading Memory for Running Time in Phylogenetic Likelihood Computations”, 2012 Bioinformatics conference, Vilamoura, Portugal, February 2012
Fernando Izquierdo-Carrasco