Project Details
A community-centered open-source framework for large-scale analysis of tandem mass spectrometry data
Applicant
Professor Dr. Florian Huber
Subject Area
Bioinformatics and Theoretical Biology
Medical Informatics and Medical Bioinformatics
Medical Informatics and Medical Bioinformatics
Term
since 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 528775510
Tandem mass spectrometry (MS/MS) is a critical tool in modern molecular biology, enabling the detection and structural identification of biomolecules and their quantification with high accuracy and specificity. Over the past two decades, mass spectrometry underwent a very rapid technological transformation with a shift to high-resolution tandem mass spectrometry (HRMS/MS) as well as systems with increased throughput. The resulting rapid growth in size and number of mass spectral datasets, creates entirely new possibilities for knowledge extraction, but requires new computational tools and methods. Currently, however, we see a fragmented landscape of HRMS/MS data analysis methods so that many researchers develop their own ad hoc solutions, rather than relying on robust, reliable, accessible, community-endorsed tools. We believe that matchms, an open-source Python library we first released in May 2020, can become such a central framework in the field of HRMS/MS data analysis. Matchms was developed with a team of professional research software developers together with domain researchers. Matchms as well as several accessory libraries for cutting-edge HRMS/MS data analysis (e.g., Spec2Vec, MS2DeepScore) were developed according to high code standards and software development best practices. This manifests, for instance, in very high unit test coverage, a clean code design and extensive documentation. At the same time, the development was guided by actual scientific research questions which has led to rapid adoption among developers in the field of computational metabolomics and cheminformatics. Matchms is a powerful tool for HRMS/MS analysis but comes with two major limitations that we aim to address with the proposed project. The first limitation is the required computational expertise necessary for effective use of our software. This makes it difficult to reach and benefit a wider community of researchers who could greatly benefit from its capabilities. We will address this gap by developing more accessibly graphical user interfaces for running central analysis workflows without the technical barrier in form of Python scripts. We have further identified scalability as a major issue. Matchms has already been optimized for computational efficiency and can handle relatively large datasets. Still, we foresee a drastic growth in both number and size of HRMS/MS datasets. To facilitate entirely novel large-scale HRMS/MS data analysis workflows, we will aim at maximum scalability by implementing parallelization and further optimization of central algorithms. Finally, facilitated by an international group of collaborators as well as through annual workshops and frequent introductory courses, this projects heavily aim on expand both the user and co-developer community of matchms. This will not only result in a much larger impact of our research software, but it will also go hand in hand with our goal to secure long-term reliability and support.
DFG Programme
Research Grants
International Connection
Belgium, Czech Republic, Denmark, Netherlands, Switzerland
Co-Investigator
Professorin Alina Huldtgren, Ph.D.