Project Details
Projekt Print View

Fault Tolerance and Locality-Aware Work Stealing for Dynamically Generated Dependent Tasks on Clusters

Subject Area Computer Architecture, Embedded and Massively Parallel Systems
Software Engineering and Programming Languages
Term since 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 512078735
 
Programming environments for today's supercomputers must support the design of efficient programs and handle issues such as: programmer productivity, i.e., the human effectiveness in writing programs; application irregularity, i.e., the limited planability of the computation; and fault tolerance, i.e., the ability to cope with hardware failures during program execution.While these issues are hard to achieve with traditional parallel programming environments, a promising paradigm to tackle all of them together is Asynchronous Many-Task~(AMT) programming. Here the computation is coded into many small work packages (tasks), which may depend on each other. The tasks are processed by a limited number of workers (e.g. processes), which often balance their load via work stealing.AMT programming environments enjoy growing popularity on single multicore computers, where they provide powerful functionalities such as dynamic task generation at runtime to facilitate the expression of irregularity, and load balancing via work stealing to improve resource utilization.On large supercomputers, i.e., clusters of such machines, in contrast, AMT functionalities are still limited. A major hurdle for AMT deployment there is the need to combine load balancing with low communication costs. In particular, tasks should run close to their data, which is denoted as locality. Similarly, fault tolerance takes on crucial importance in clusters, due to their larger number of hardware components. Although AMT is potentially well-suited to provide fault tolerance, concrete algorithms and techniques are stillrare.This project is targeted at the realization of both fault tolerance and locality awareness. As a basis for our study, we will first develop a prototypical cluster AMT programming library that captures the primary features of current single-machine environments in a simple form: dynamic task generation at runtime, global work stealing, dependent tasks (realized with the future construct), and global data accesses (to special types of variables). For this environment, we will then devise a transparent work stealing scheme that runs tasks close to their data, and provide fault tolerance for it. Thereby both hard failures, i.e., the loss of one or multiple processes, and soft errors, which include silent data corruption that may be detected late, will be considered. Regarding locality, we consider the placement of both tasks and their data.Methodically, the project includes the development of novel algorithms and other techniques, their implementation in the prototypical AMT environment, and their experimental evaluation. The techniques will be developed stepwise for increasingly complex patterns of task cooperation and failure scenarios. The different techniques will be integrated with each other.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung