Project Details
Projekt Print View

Maschinelle Lernmethoden für die Chemische Informatik II

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term from 2007 to 2012
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 51114943
 
Final Report Year 2012

Final Report Abstract

The use of machine leaming models in biochemical research has specific requirements in terms of reliability, robustness, and interpretability of such models. The complexity and constraints of many applications in these fields require tailored and/or advanced methods to successfully deal with them. In the context of risk assessment, we investigated existing metrics for the quantification of confidence estimates of predictions in the context of chemoinformatics. Gaussian processes have been successfully applied in chemoinformatics to predict properties of new chemical compounds. They provide an estimate of variance, or error bar, along with the prediction itself. This error bar may be used, e.g., to discard individual uncertain predictions. We investigated the limits of error bars and compared different methods to assess the quality of confidence estimates. Visual explanations were introduced by us to explain individual predictions of kernel-based leaming models. In our approach, the training examples that contribute most to a classifier decision are visualized, along with a quantification of their importance for the prediction. This allows users to understand predictions in terms of the objects in question, here molecules. A comprehensive study with test persons was conducted to quantify the effectiveness of our approach. The study used Ames mutagenicity as biochemical application, and revealed significant improvements of user's abihty to judge the reliability of a prediction. Besides improvements in prediction accuracy the explanatory components spotted insufficient coverage and important chemical characteristics of the training data. Screening large libraries of chemical compounds against a biological target, e.g., a receptor or enzyme, is crucial in the hit discovery phase of drug discovery. Virtual screening can be seen as a ranking problem that prefers as many actives as possible at the top of the ranking. Current methods use regression to predict each molecule's activity, and then sort to obtain a ranking. We developed a top-k ranking algorithm (StructRank) that solves this problem directly, without the intermediate regression step. Our approach empirically outperforms regression methods and a common ranking algorithm (RankSVM) in terms of actives found. StructRank is publicly available. It's corresponding journal publication is one of the most read ones in the Journal of Chemical Information and Modeling. When investigating new biological targets, often few measurement data are available for the new target, while at the same time there is more data for related targets. Similarly, in classical quantitative structure-property relationships, for the same property, separate linear models are established per group of compounds, assuming substituent effects to be additive inside each group, but not across groups. We investigated the use of multi-task learning to exploit relationships between data sets. Improvements over single models where empirically found in situations where only limited annotated data was available. The Institute of Pure and Applied Mathematics (IPAM) long program on "Navigating Chemical Compound Space for Materials and Bio Design" (2011) at the University of California, Los Angeles, USA brought together researchers from physics, chemistry, biology, and mathematics. Three new international collaborations were established, dealing with estimation of atomization energies of organic molecules, estimation of kinetic energies based on electron densities, and characterization of transition state surfaces. The program proved to be highly productive, yielded new applications of machine learning, and raised awareness of machine learning in several communities, including physical chemistry and materials science.

Publications

  • Explaining kernel based predictions in drug design. In 4th International Workshop on Machine Learning in Systems Biology (MLSB 2010), Edinburgh, Scotland, October 15-16, 2010
    Katja Hansen, David Baehrens, and Klaus-Robert Müller
  • From machine learning to novel agonists of the peroxisome proliferator-activated receptor. In 24th Annual Conference on Neural Information Processing Systems (NIPS 2010) Workshop on Charting Chemical Space: Challenges and Opportunities for AI and Machine Learning, Whistler, Canada, December 10-11, 2010
    Matthias Rupp
  • Graph kernels for chemoinformatics. A critical discussion. In 6th German Conference on Chemoinformatics, Goslar, Germany, November 7-9, 2010
    Matthias Rupp
  • Predictive variance of Gaussian processes and confidence estimation. In NIPS (Advances in Neural Information Processing Systems) Workshop on Charting Chemical Space, Whistler, Canada, December 11, 2010
    Katja Hansen
  • Struct-Rank: A new approach for ligand-based virtual screening. Journal of Chemical Information and Modeling, 51(1):83-92, 2010
    Fabian Rathke, Katja Hansen, Ulf Brefeld, and Klaus-Robert Müller
  • Editorial: Charting Chemical Space: Challenges and Opportunities for Artificial Intelligence and Machine Learning. Molecular Informatics, 30(9):751-752, 2011
    Pierre Baldi, Klaus-Robert Müller and Gisbert Schneider
  • Interpretation and explanation of kernel-based prediction models. In 242nd Annual Meeting of the American Chemical Society, Denver, Colorado, USA, August 28-September 1, 2011
    Katja Hansen, David Baehrens, Timon Schroeter, Matthias Rupp, and Klaus-Robert Müller
  • Spherical harmonics coefficients for ligand-based virtual screening of cyclooxygenase inhibitors. PLoS ONE, 6(7):e21554, 2011
    Quan Wang, Kerstin Birod, Carlo Angioni, Sabine Grösch, Tim Geppert, Petra Schneider, Matthias Rupp, and Gisbert Schneider
  • Visual interpretation of kernel-based prediction models. Molecular Informatics, 30(9):, 817-826, 2011
    Katja Hansen, David Baehrens, Timon Schroeter, Matthias Rupp, and Klaus-Robert Müller
  • DOGS: Reaction-driven de novo design of bioactive compounds. PLoS Comput Biology, 8(2):e1002380, 2012
    Markus Hartenfeller, Heiko Zettl, Miriam Walter, Matthias Rupp, Felix Reisen, Ewgenij Proschak, Sascha Weggen, Holger Stark, and Gisbert Schneider
 
 

Additional Information

Textvergrößerung und Kontrastanpassung