Project Details
Projekt Print View

Design and Architecture for Racetrack based Hybrid Memory Systems

Subject Area Computer Architecture, Embedded and Massively Parallel Systems
Term from 2020 to 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 437232907
 
Final Report Year 2023

Final Report Abstract

The overall goal of this project was to lay the foundation for highly efficient data management on hybrid memory system by devising memory-aware and application-aware data management strategies. With this goal in mind, various compiler, hardware, controller, and system optimizations were devised which systematically manages the diverse memories by enabling the advantages of its component memory types in a cost-effective manner and by mitigating their drawbacks. In this project, we have explored different hybrid memory systems for different application scenarios such as general-purpose, genomics, computational fluid dynamics, and machine learning domains. Exploiting application knowledge and underlying memory architecture, we developed different hardware and software optimizations. Concretely, we worked on the efficient integration of relatively newer RTM with other memories at different levels in the memory hierarchy including scratchpad, caches, and memory. In particular, multi-bit storage cells in RTM are inherently sequential and thus require intelligent techniques to mitigate the performance and energy impacts of shifting during data accesses. On the hardware side, two novel RTM based cache architectures have been proposed and their integration with DRAM based off-chip memory has been realized for general-purpose applications. The first BlendCache addresses the energy consumption problem of a larger cache by efficiently storing the tags in the leakage optimized multi-bit RTM cells. To overcome the performance penalty of long shifting overhead in multi-bit RTM cells, BlendCache is equipped a novel data mapping that exploits program’s spatial locality by assigning consecutive block from main memory blocks to consecutive locations in RTM. This data mapping optimization reduces RTM shifting overhead and make the latency-critical tag store in the leakage optimized multi-bit RTM cells practical, thereby achieving high energy efficiency. We also found that existing cache replacement policies incur high shift penalty for ultra-dense multi-bit RTM cells. To address this problem, we devise a shift-aware replacement policy that first calculates the shift cost of a rarely reused cache lines in a cache set. After that, it chooses a victim cache line from that group that incurs minimum shift cost. We used a hybrid STT/RTM architecture where the latency-critical tags are stored in STT while the data is stored in RTM. In addition, we proposed a hybrid tag-data storage for the DRAM LLC that reduces conflict misses using a combination of associative caching and LLC bypassing. It retains the benefits of associative LLCs while overcoming their limitations by reducing the tag serialization effect, the number of tag lookups, and the in-package traffic. Also, we proposed an algorithm-hardware co-design that exploits the data-reuse in the DNA filtering operation that cuts the number of memory accesses compared to the state-of-the-art. On the controller side, we designed intelligent controller for RTM and DRAM that interleaves memory accesses across independent memory subarrays which reduces latency by pipelining accesses. We proposed a a near-memory pre-alignment filter for RTM that provides novel data layout, preshifting, and circular buffers to significantly reduce shift operations in RTM. In addition, we proposed a contention aware DRAM controller that reduces the impact of different contentions by efficiently overlapping computation with memory access for performance and energy efficiencies. On the compiler side, we proposed different techniques including optimal and near-optimal algorithms for placing decision trees in RTMs required for the execution of machine learning models on distributed devices. We proposed layout transformations that produce RTM-friendly code and significantly reduce shift operations in RTM. At system level, performance and energy efficient methods for tensor contractions have been proposed using a hybrid RTM-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as data layout transformation is paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order and data mapping are employed to reduce the contention in the off-chip memory. We assess our optimizations for different application domains and we demonstrated the advantages of our optimizations in terms of performance and energy efficiencies compared to the state-of-the-art memory systems.

Publications

  • ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering. In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), vol. 10, no. 3, pp. 1464-1475
    F. Hameed, A.A. Khan, and J. Castrillon
    (See online at https://doi.org/10.1109/TETC.2021.3093840)
  • BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), vol. 41, no. 12, pp. 5288-5298
    F. Hameed and J. Castrillon
    (See online at https://doi.org/10.1109/TCAD.2022.3161198)
  • DNA Pre-alignment Filter using Processing Near Racetrack Memory. In IEEE Computer Architecture Letters, vol. 21, no. 2, pp. 53-56
    F. Hameed, A.A. Khan, S. Olliver, A.K. Jones, and J. Castrillon
    (See online at https://doi.org/10.1109/LCA.2022.3194263)
  • ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees. In IEEE Transactions on Computers, 14 pp
    C. Hakert, A.A. Khan, K-H. Chen, F. Hameed, J. Castrillon, and J-J. Chen
    (See online at https://doi.org/10.1109/TC.2022.3197094)
 
 

Additional Information

Textvergrößerung und Kontrastanpassung