Analysis of Dataset Shifts in Mobile Malware
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Final Report Abstract
In recent years, mobile devices have become a popular target for malware authors, resulting in a steady increase in new variants of mobile malware. Unfortunately, traditional solutions for detecting malware do not provide proper protection to counter this threat, as they generally rely on manually crafted detection patterns. Therefore, researchers have started to explore whether machine learning techniques can be used to derive effective detection patterns automatically. As a result, many learningbased approaches for detecting mobile malware have been proposed in recent years, showing promising results in laboratory settings. Unfortunately, recent research has shown that the detection performance of learning-based approaches for the detection of mobile malware is often overestimated. A main reason for the overestimation of detection performance is that the evaluation of AI-based approaches commonly assumes that the underlying data distribution does not change over time. However, this assumption does generally not hold for mobile malware. Instead, the distribution continuously changes over time—a phenomenon known as ’dataset shift’ in learning theory. In consequence, the detection performance of current learning approaches drastically decreases in real-world settings. Although some factors are already known that contribute to the creation of Dataset Shifts in this domain, the exact causes remain so far mostly unclear. The goal of this research project was to develop novel techniques for analyzing the root causes of dataset shifts in mobile applications in order to improve learning-based detection systems using the gained knowledge. As a result, in collaboration with researchers from University College London, King’s College London, and Technische Universit¨t Berlin, a framework has been developed that can identify and provide insights into dataset shifts in evaluation datasets using explainable learning (XAI) techniques. The insights gained from this can, in turn, be used to improve the detection performance of learning-based methods and to recognize possible biases in evaluation datasets. As another outcome of this research, we identified and systematized further pitfalls that could lead to an overestimation of the capabilities of machine learning techniques. The corresponding publication was awarded the Distinguished Paper Award at the renowned USENIX Security conference in 2022.
Publications
-
Dos and don’ts of machine learning in computer security. In Proc. of USENIX Security Symposium, 2022.
Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro & K. Rieck
-
Misleading Deep-Fake Detection with GAN Fingerprints. 2022 IEEE Security and Privacy Workshops (SPW), 59-65. IEEE.
Wesselkamp, Vera; Rieck, Konrad; Arp, Daniel & Quiring, Erwin
-
Quantifying the Risk of Wormhole Attacks on Bluetooth Contact Tracing. Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy, 264-275. ACM.
Czybik, Stefan; Arp, Daniel & Rieck, Konrad
-
Drift Forensics of Malware Classifiers. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 197-207. ACM.
Chow, Theo; Kan, Zeliang; Linhardt, Lorenz; Cavallaro, Lorenzo; Arp, Daniel & Pierazzi, Fabio
-
Lessons Learned on Machine Learning for Computer Security. IEEE Security & Privacy, 21(5), 72-77.
Arp, Daniel; Quiring, Erwin; Pendlebury, Feargus; Warnecke, Alexander; Pierazzi, Fabio; Wressnegger, Christian; Cavallaro, Lorenzo & Rieck, Konrad
-
Return of a new version of drinik android malware targeting indian taxpayers
S. Agarwal & D. Arp
-
Code repository of drift forensics project
T. Chow, Z. Kan, L. Linhardt, L. Cavallaro, D. Arp & F. Pierazzi
