Project Details
Projekt Print View

Resource-Efficient Deep Models for Embedded Systems

Subject Area Computer Architecture, Embedded and Massively Parallel Systems
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term from 2016 to 2020
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 285966169
 
Final Report Year 2022

Final Report Abstract

Machine Learning (ML) is among the most promising strategies to address learning and reasoning under uncertainty in Artificial Intelligence (AI). The overwhelming majority of recent advances in AI stem from Deep Neural Networks (DNN) on big data, and today’s deep learning algorithms dramatically advance state-of-the-art performance in terms of accuracy for the vast majority of AI tasks. Examples include image and speech processing, with applications as broad as robotics, medicine, autonomous navigation, recommender systems, etc. Still, today the main application domain of ML is the “virtual world“. To address the requirements of upcoming applications such as autonomous navigation for personal transport and delivery services, a transition of ML into the “wild“ is required. This transition requires the processing of complex ML models close to the point of interest, which usually has limited compute capability, unreliable online connectivity and limited battery life. This requires to address the gap in between the tremendous compute requirements of such ML models and the hardware capability. The present project pursued a two-fold approach to this problem. On one hand, methods have been developed that compress existing ML models so that compute and memory requirements are substantially reduced, and thereby a deployment on resource-constrained mobile devices is feasible. Examples for such include work on quantization, which basically reverts from high-precision floating-point operators to low-precision fixed-point ones, but also pruning that introduces sparsity in the models by training certain model weights towards zero. On the other hand, also new ML models have been investigated that exhibit less compute and memory requirements or already include built-in support for compression. One example is a Bayesian Network classifier, which structure is learning during training to find models as small as possible. Another example is a Bayesian Neural Network in which scalar values are replaced by distributions, but later from these distributions well-chosen quantized values are sampled. While the present project did result in various results, in the following we would like to shortly highlight a couple of major insights. It was surprising to see that ARM processors can be competitive to specialized processors such as GPUs and FPGAs, if the software architecture is well-chosen. While this does not mean that ARM processors are ultimately faster or achieve high accuracy, the gap in between those processors can be substantially reduced if a good compression method is chosen. Essentially, this allows that ubiquitously availabe ARM processors can be leveraged more than previously thought. While large parts of the community believe that FPGAs are the best choice for machine learning, it has to be stated that this is only partially true. Comprehensive experiments using different compression techniques on different processors, all with a similar power budget of about 5 Watts, have shown that ultimately GPUs allow for highest accuracy by supporting the largest models. Contrary, FPGAs cannot support such larger models, but they excel in terms of throughput if they can hold a model on-chip. In summary, the need for highest accuracy is most likely today being supported best by GPUs, while highest throughput can be achieved usually using FPGAs. General-purpose processors such as ARM are usually a trade-off in between: while they can achieve the top accuracy of GPUs (given the right software architecture), they are always behind GPUs and FPGAs with regard to throughput. Still, their ubiquitous availabilty can make them an important candidate. Last, we believe that in principle FPGAs are excellently suited for machine learning, however they lack the memory bandwidth required to support models larger than on-chip capacity. It will be interesting to see how the computing landscape changes if for instance 3D die stacking allows FPGA vendors to overcome this bandwidth limitation. Further, we conclude by disclaiming that all previous statements only hold true for CMOS-based processors, and for ML models that are based on the “Deep Learning“ paradigm, i.e. deep convolutional neural networks. Innovation in computer architecture as well as in machine learning might change this situation substantially.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung