Project Details
Projekt Print View

Dynamic Redundancy for Many-core Systems

Subject Area Computer Architecture, Embedded and Massively Parallel Systems
Term from 2017 to 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 337312426
 
Final Report Year 2021

Final Report Abstract

Safety-critical systems require redundancy for fault-detection and fault-tolerance. Depending on the application mode or execution state, different types of redundancy are required: Dual Modular Redundancy (DMR) for fail-safe modes and Triple Modular Redundancy (TMR) or even higher redundancy for fail-operational modes. Future safety-critical systems will feature mode switching between application of different criticality and fault-tolerance demands requiring more dynamicity in redundancy modes. An example in automotive may be switching from parking assistant to piloted driving with much higher safety demands, but executed on the same embedded multi-core. The ARMA (Adaptively Redundant Multicore Processor) project investigated the dynamic switching between redundancy modes depending on external causes. We call this dynamic redundancy. We investigate dynamic redundancy switching between hardware modes (no redundancy, DMR, and TMR), same for software modes and combinations. Our target hardware platform is a tile-based multicore processor that was enhanced by fault-tolerant hardware to reach an Adaptively Redundant Processor. Dynamic redundancy switching is controlled by a Redundancy Management Unit per tile and a Network-on-Chip (NoC) managed redundancy enhancement that enables scalable designs by introducing NoC voting capabilities. The multicore processor was prototyped on an FPGA. On the software side, we designed an actor-based data-flow execution model called RAPID, which is able to execute tasks, called dataflow actors, depending on input data availability in parallel. RAPID is based on Apache Spark’s Resilient distributed datasets (RDDs) and adapted to be used in embedded systems. We enhanced the dataflow actor model to run actors redundantly to reach DMR or TMR modes. Dataflow actors can be easily re-executed in case of failure because of their freedom from side effects. To support dynamic redundancy a horizontal sectioning of the dataflow graph was devised that allows to change redundancy mode at each section entry. To perform the evaluations, we developed a cross-layer system architecture, called Adaptively Redundant Many-core Architecture (ARMA), incorporating dynamic redundancy on software and hardware levels. ARMA is intended to combine high performance parallel execution with safety mechanisms like fault tolerance and timing predictability to cover the different requirements of a broad range of embedded domains, like automotive, avionic, and space. We investigated dynamic redundancy switching between software and hardware modes as well as timing behavior and real-time schedulability in case of failures. The dynamic redundancy switching concept was proven on a FPGA-based ARMA multi-core processor, yet the insights gained and techniques developed in this project shall be applicable to future MPSoC architectures in general. We have shown the feasibility of a holistic hardware/software co-design approach to achieve high-performance fault-tolerant application execution on a scalable many-core platform. Using a coarse-grained dataflow paradigm, the resulting side-effect-free dataflow actors are small enough to be easily analyzable. We used immutable datasets to circumvent complex and costly NoC re-transmission to keep re-execution costs small and predictable. We have successfully demonstrated the applicability and continuous integration of an adaptive and reconfigurable hardware platform for fault-aware systems. We were able to realize adaptive fault handling in software, hardware, and combinations of both.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung