Peak intensity prediction in mass spectrometry data using machine learning algorithms

Applicant Professor Tim Nattkemper, Ph.D., since 7/2006

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing

Term from 2006 to 2010

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 16666471

Final Report Year 2009

Final Report Abstract

The aim of this project was to model peptide-specific sensitivities in mass spectra to enhance label-free protein quantification by mass spectrometry. Prior to this work, it was unknown if this is feasible at all. A combination of simulation and unsupervised learning methods for non-linear regression was chosen to map to peptide-specific sensitivities. As a model system, spectra from matrix-assisted laser desorption ionization (MALDI) of proteins separated with a two-dimensional gel (2D- PAGE) are used. In these spectra, usually only one protein is found. This implies that extracted peak intensities in these spectra can directly be interpreted as peptide-specific sensitivities. This work is the first to evaluate the prediction of peptide-specific sensitivities on new peptides. Our results show that the achieved prediction accuracies are promising and the prediction of peptide-specific sensitivities is indeed feasible. With our approach, significant correlations between target and predicted values are achieved. With support vector regression (SVR) and a low-dimensional purely string-based peptide representation, a Pearson’s correlation of r = 0.68 is achieved. The supervised non-linear regression method SVR outperforms the other methods in most cases, again underlining the great prediction capabilities of this method. Our main focus was to determine an appropriate encoding of peptides, given as strings as input for the learning algorithm. Knowledge extraction with feature selection methods leads to the rediscovery of known as well as new properties that are relevant for this problem. Our results indicate, a higher sensitivity for arginine-containing peptides, which is well-known from other works. In addition, the feature selection results indicate that conformation seems to be more important for prediction peak intensities than certain secondary structure elements might be. In addition to the incorporation of existing regression algorithms such as SVR or local linear models, we also developed a new regression algorithm based on the principles of unsupervised learning, local linear models, and ensemble learning. This new ensemble architecture called LERRANCO [1,2] combines the benefits of efficient, fast, and compact local linear modeling with ensemble learning ideas and is able to facilitate better prediction accuracies. In fact, we found the proposed ensemble architecture superior in a regression application for the prediction of peak intensities. For other benchmark datasets, we were able to achieve results comparable to those obtained by other reference ensemble architectures. Based on our studies and new proposed feature computation approaches, we are now able to propose the first integrated pipeline for automated peak intensity prediction. This is a vital step towards the improvement of label-free quantitative proteomics. It is very important to repeat our study with experimental datasets from shotgun proteomics: Here, proteins are first digested, peptides are separated by liquid chromatography, and then measured using mass spectrometry. If one wants to quantify proteins from such data, then peak intensity prediction allows us to correct peptide intensities and hence, results in better quantification accuracy. Shotgun proteomics uses a different ionization process (Electrospray Ionization, ESI instead of MALDI) what requires retraining of the predictors. Unfortunately, no reference training set from shotgun proteomic where available when this project started. Another next step is the integration of the developed software into proteomics analysis toolboxes such as the recently proposed Qupe software. This would be the basis for applications and evaluations of our approach to wet lab experiments on a larger scale.

Publications

Peak intensity prediction for PMF mass spectra using support vector regression. In Proc. of the 7th International FLINS Conference on Applied Artificial Intelligence, 2006
W. Timm, S. Böcker, T. Twellmann, and T. W. Nattkemper
Neural network approach for mass spectrometry prediction by peptide prototyping. In Proc. of International Conference on Artificial Neural Networks, ICANN 2007, Part II, LNCS 4669, pages 90–99
A. Scherbart, W. Timm, S. Böcker, T. W. Nattkemper
SOM-based Peptide Prototyping for Mass Spectrometry Peak Intensity Prediction. In: Proc. of Workshop on Self-Organizing Maps, WSOM 2007
A. Scherbart, W. Timm, S. Böcker, T. W. Nattkemper
Improved Mass Spectrometry Peak Intensity Prediction by Adaptive Feature Weighting. In: Proc. of International Conference on Neural Information Processing, ICONIP 2008
A. Scherbart, W. Timm, S. Böcker, T. W. Nattkemper
Peak intensity prediction in MALDI-TOF mass spectrometry: a machine learning study to support quantitative proteomics., BMC Bioinformatics 9 (2008) 443
Timm, Wiebke; Scherbart, Alexandra; Böcker, Sebastian; Kohlbacher, Oliver & Nattkemper, Tim W.
The Diversity of Regression Ensembles Combining Bagging and Random Subspace Method. In: Proc. of International Conference on Neural Information Processing, ICONIP 2008
A. Scherbart, T. W. Nattkemper
Contribution to OpenMS - An open-source framework for mass spectro- metry: Class PeakIntensityPredictor added for peak intensity prediction since version 1.3 (February 13th, 2009)
A. Scherbart

Servicenavigation

Hauptnavigation

Peak intensity prediction in mass spectrometry data using machine learning algorithms

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

Peak intensity prediction in mass spectrometry data using machine learning algorithms

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung