Project Details
Projekt Print View

C5: Collaborative and Cross-Context Cluster Configuration for Distributed Data-Parallel Processing

Subject Area Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Security and Dependability, Operating-, Communication- and Distributed Systems
Term since 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 506529034
 
Many organizations routinely analyze large datasets today. For this, they make use of distributed data-parallel processing systems and take advantage of clusters of commodity resources. Especially smaller organizations and individual users are enabled by data processing frameworks and cloud computing, allowing them to work with large datasets at a high-level of abstraction. Still, users are required to configure adequate resources for their data processing jobs. This is often not straightforward and users frequently overprovision resources for their jobs, leading to low resource utilization as well as high costs and energy consumptions. Numerous works addressed this problem in the last decade for big data frameworks, scientific workflows, and machine learning systems, using statistical tools and performance models. However, much of the effort focused on industry settings, either assuming data on previous executions of jobs to be available or relying on potentially costly dedicated profiling. Little research has addressed use cases where runtime data is not as easily available. Addressing this research gap, we aim to develop new methods for the collaborative usage of runtime data in the proposed project, C5. We believe sharing of runtime information across different execution contexts presents a significant opportunity for performance modeling and model-based resource management in many situations, especially when the availability of runtime data is limited, and will improve the efficiency of distributed data-parallel processing. The methods we plan to develop and evaluate in this project include: - Similarity measures for computational resources and processing jobs to support the use of runtime data and performance models across execution contexts - Model selection and combination methods for robust performance estimations, even if limited training data is available or model components were trained in other contexts - Adjustment strategies that allow to efficiently update training data, performance models, and resource configurations at runtime. In addition to new methods for cross-context cluster configuration optimization based on shared performance data and models, we plan to conduct a thorough analysis of real workloads, design reproducible experiments based on infrastructure-as-code definitions and benchmarks, and provide a working implementation of the envisioned data sharing platform to the general public and ongoing collaborative research projects.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung