Project Details
Projekt Print View

HySim: Hybrid-parallel similarity search for the analysis of big genomic and proteomic data

Subject Area Bioinformatics and Theoretical Biology
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term from 2016 to 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 329350978
 
Recent years have seen a tremendous increase in the volume of data generated in the life sciences. The analysis of these data sets poses difficult computational challenges and is an active field of research. Currently, a popular strategy in data rich scenarios across many areas of science and industry is to adopt big data technologies. However, characteristics of typical biological data sets and their intended uses differ significantly from most other big data application areas. Biological data processing often requires more complex analysis techniques than can be afforded by big data technology, which is often constrained to algorithms or heuristics with linear or sublinear complexity. In many application scenarios, rough approximations of true outcomes are perfectly acceptable, but in the life sciences, this is rarely the case. A biomedical application will typically be unable to tolerate even moderate numbers of classification mistakes. Consequently, computational life sciences today tend to rely on a different computational model for large scale applications, namely high performance computing (HPC). However, HPC is tailored more towards problems with a significant amount of computational work (big compute) than at those with enormous storage requirements (big data). The peculiarities of biological data sets and the complexity of the required data analysis pose challenges that neither of the two approaches is perfectly suited to overcome. Instead, a hybrid approach, combining ideas from big data with HPC methodologies, might be preferable, as ideas from big data algorithms can help flexible and highly performant HPC methods to scale towards data sets that would otherwise be too large for them.In this project, we propose to study such hybrid methods in order to meet the challenge of processing large scale genomic and proteomic data sets efficiently yet accurately. Our particular focus is similarity search; an important algorithmic technique for a number of applications in both genomics and proteomics. Corresponding data sets are produced by two types of high throughput technologies: Next Generation Sequencers (NGS) and Mass Spectrometers (MS).Our specific project goals are threefold: (i) Design of efficient and accurate big data algorithms for similarity search in NGS data with applications to metagenomics and read error correction based on locality sensitive hashing (LSH) techniques. (ii) Design of efficient and accurate big data algorithms for similarity search in MS raw data with applications to proteomics based on LSH techniques. (iii) Development of efficient implementations of these new algorithms on a hybrid big data/HPC platform that provide strong scalability for large scale NGS and MS data sets.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung