Project Details
Deep Learning of Protein Families and Multiple Sequence Alignments
Applicant
Professor Dr. Mario Stanke
Subject Area
Bioinformatics and Theoretical Biology
Term
since 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 539129343
Alignments serve as a foundational tool for understanding living organisms on a molecular level. However, the accurate construction of multiple sequence alignments (MSAs) for diverse and increasingly large protein families remains an unsolved problem. As the pool of available related sequences expands, challenges emerge, primarily because existing alignment methods are not equipped to capitalize on the growing data. At the same time, opportunities have also arisen. Deep learning methods now provide enhanced precision in determining whether two residues from different protein sequences evolved from the same site in a common ancestor, outperforming the comparatively simple amino acid scoring schemes of current alignment programs. This is due to rich structural, evolutionary and biophysical features that are implicitly learned across millions of diverse protein sequences at the resolution of individual residues. Our objective is to craft more precise MSAs using modern end-to-end machine learning techniques that combine emerging protein language models and established evolutionary models. Moreover, we aspire to pioneer the first model-based and alignment-free tool capable of sensitively searching for homologs of a protein family, eliminating the need for constructing an MSA for the family entirely. The proposal builds on and extends our tool learnMSA which follows a new paradigm to obtain MSAs: a profile hidden Markov model (HMM) is learned directly from unaligned sequences using gradient descent. This constitutes a change in perspective: where traditionally an MSA was the starting point and the profile HMM came thereafter, inversely, we learn the model first rendering it possible to bypass the MSA in any downstream task that depends on a profile model.
DFG Programme
Research Grants