Project Details
Coupled Storage System for Efficient Management of Self-Describing Data Formats
Applicant
Professor Dr. Michael Kuhn
Subject Area
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term
from 2019 to 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 417705296
Over the last decades, societies came to rely more than ever on technological progress in information technology. Especially in the area of scientific research, this does enable the possibility to solve increasingly complex problems, which nowadays require the computational power of supercomputers. The rising complexity of the processed problems as well as the growth of computation power leads to rapidly increasing data volumes; the globally produced data volume doubles approximately every two years, leading to an exponential data deluge. This imposes a serious problem as the development of the storage and network technologies is considerably slower. The result is a widening gap between the performance of computing and storage devices, resulting in a storage bottleneck. This is especially true for large-scale systems found in high-performance computing. To ease this situation, a hierarchy of different storage devices is used to suffice the demand for high capacity on the one hand and for high velocity as well as reliability on the other hand. By combining the advantages of different storage technologies, the overall performance is significantly increased while inducing lesser costs for acquisition, operation and maintenance. However, for future exascale systems, the difficulties will get even worse, requiring critical improvements in order to exploit the systems' capabilities. The existing input/output (I/O) stack leads to additional performance and management issues.The produced data is typically stored using self-describing data formats to facilitate exchange and analysis within the scientific community. The project goal is to explore the benefits of a coupled storage system for these formats. It will introduce a novel hybrid approach leveraging storage technologies from the fields of high-performance computing and database systems, where each technology will be used according to its respective strengths and weaknesses. By coupling the storage system tightly with self-describing data formats, it can make use of structural information for selecting appropriate storage technologies and tiers. As such information is currently not available, storage systems have to employ heuristics, which often lead to suboptimal performance as well as unnecessary and expensive data movements. Moreover, the storage system will support adaptable I/O semantics to tune its performance according to application and data format requirements. Together, these features will enable completely new data management methods and provide significant performance improvements. Existing workflows of scientific users will be supported through a dedicated data analysis interface. All changes will be thoroughly tested to ensure backwards compatibility with existing applications and interfaces. Consequently, no modifications will be necessary to run applications on top of CoSEMoS, which helps preserve past investments in scientific software development.
DFG Programme
Research Grants
Cooperation Partner
Professor Dr. Thomas Ludwig