Dynamic Redundancy for Many-core Systems
Final Report Abstract
Safety-critical systems require redundancy for fault-detection and fault-tolerance. Depending on the application mode or execution state, different types of redundancy are required: Dual Modular Redundancy (DMR) for fail-safe modes and Triple Modular Redundancy (TMR) or even higher redundancy for fail-operational modes. Future safety-critical systems will feature mode switching between application of different criticality and fault-tolerance demands requiring more dynamicity in redundancy modes. An example in automotive may be switching from parking assistant to piloted driving with much higher safety demands, but executed on the same embedded multi-core. The ARMA (Adaptively Redundant Multicore Processor) project investigated the dynamic switching between redundancy modes depending on external causes. We call this dynamic redundancy. We investigate dynamic redundancy switching between hardware modes (no redundancy, DMR, and TMR), same for software modes and combinations. Our target hardware platform is a tile-based multicore processor that was enhanced by fault-tolerant hardware to reach an Adaptively Redundant Processor. Dynamic redundancy switching is controlled by a Redundancy Management Unit per tile and a Network-on-Chip (NoC) managed redundancy enhancement that enables scalable designs by introducing NoC voting capabilities. The multicore processor was prototyped on an FPGA. On the software side, we designed an actor-based data-flow execution model called RAPID, which is able to execute tasks, called dataflow actors, depending on input data availability in parallel. RAPID is based on Apache Spark’s Resilient distributed datasets (RDDs) and adapted to be used in embedded systems. We enhanced the dataflow actor model to run actors redundantly to reach DMR or TMR modes. Dataflow actors can be easily re-executed in case of failure because of their freedom from side effects. To support dynamic redundancy a horizontal sectioning of the dataflow graph was devised that allows to change redundancy mode at each section entry. To perform the evaluations, we developed a cross-layer system architecture, called Adaptively Redundant Many-core Architecture (ARMA), incorporating dynamic redundancy on software and hardware levels. ARMA is intended to combine high performance parallel execution with safety mechanisms like fault tolerance and timing predictability to cover the different requirements of a broad range of embedded domains, like automotive, avionic, and space. We investigated dynamic redundancy switching between software and hardware modes as well as timing behavior and real-time schedulability in case of failures. The dynamic redundancy switching concept was proven on a FPGA-based ARMA multi-core processor, yet the insights gained and techniques developed in this project shall be applicable to future MPSoC architectures in general. We have shown the feasibility of a holistic hardware/software co-design approach to achieve high-performance fault-tolerant application execution on a scalable many-core platform. Using a coarse-grained dataflow paradigm, the resulting side-effect-free dataflow actors are small enough to be easily analyzable. We used immutable datasets to circumvent complex and costly NoC re-transmission to keep re-execution costs small and predictable. We have successfully demonstrated the applicability and continuous integration of an adaptive and reconfigurable hardware platform for fault-aware systems. We were able to realize adaptive fault handling in software, hardware, and combinations of both.
Publications
-
“A Functional Programming Model for Embedded Dataflow Applications”. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). vol. 2. July 2019, pp. 646– 651
Christoph Kühbacher, Christian Mellwig, Florian Haas, and Theo Ungerer
-
“A Network on Chip Adapter for Real-Time and Safety-Critical Applications”. In: 2019 32nd IEEE International System-on-Chip Conference (SOCC). Sept. 2019, pp. 39–44
Fabian Kempf, Nidhi Anantharajaiah, Leonard Masing, and Jürgen Becker.
-
“An Adaptive Lockstep Architecture for Mixed-Criticality Systems”. In: 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 2021
Fabian Kempf, Thomas Hartmann, Steffen Baehr, and Juergen Becker