Project Details
GML4Space: Generative Machine Learning Operating on Chemical Fragment Spaces
Applicant
Professor Dr. Matthias Rarey
Subject Area
Organic Molecular Chemistry - Synthesis and Characterisation
Theoretical Chemistry: Molecules, Materials, Surfaces
Theoretical Chemistry: Molecules, Materials, Surfaces
Term
since 2025
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 561190157
In early-phase drug discovery, several methods for the identification of novel, small organic bioactive compounds exist, including chemical similarity searching by topology and shape, pharmacophore matching, molecular docking, and, nowadays, supervised machine learning (ML) models. Traditionally, the search process was performed on large catalogues of small molecules, either experimentally or computationally (high-throughput or virtual screening). Due to the sheer size of chemical space, new approaches, especially fragment-based and de novo design, are promising alternatives. Here, small fragment binders are located first and either combined or grown to larger molecules afterwards. Recently, de novo design based on generative ML has received significant attention. The disadvantage of these approaches is that all designed compounds have to be individually synthesized, which is time- and cost-consuming. In parallel to the rise of ML, the concepts of combinatorial chemistry and chemical fragment spaces emerged. On their bases, compound vendors like Enamine or WuXi created large make-on-demand compound collections. Today, Enamine REAL contains about 50 billion compounds, others even trillions and higher. Since the spaces are too large to be handled molecule by molecule, combinatorial algorithms have emerged to search and navigate these collections. While solutions to handle chemical fragment spaces exist for many search scenarios, the combination of fragment spaces with generative ML is widely unexplored. This project aims at combining generative ML of molecules and chemical fragment spaces. A cascade of new methods enabling the efficient use of supervised ML on chemical fragment spaces will be developed. In a first phase, generic optimization algorithms will be combined with ML models to identify bioactive molecules in fragment spaces. Next, new techniques to describe chemical matter as molecules from fragment spaces will be developed. These encodings ensure that all compounds described are indeed contained in the search space. At the same time, they sensitively model molecular similarity aspects. Thereby, generative machine learning can directly operate on chemical fragment spaces, creating only those molecules contained in a predefined search space like Enamine REAL. In a final phase, explainable ML techniques will be used to extract knowledge about the importance of individual fragments in compounds and apply them directly to select optimized bioactive compounds. After careful validation, a series of new ML approaches directly operating on chemical fragment spaces will emerge.
DFG Programme
Priority Programmes
