Project Details
Projekt Print View

REFOCuS: Robust Estimation for Cell- and Casewise Contamination in Sparse Regression Models

Subject Area Electronic Semiconductors, Components and Circuits, Integrated Systems, Sensor Technology, Theoretical Electrical Engineering
Term from 2019 to 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 425884435
 
Final Report Year 2023

Final Report Abstract

With the rapid advances in data science and signal processing, there is an ever-increasing need for reliable and robust information extraction and processing. Regression analysis is one of the most widely used techniques for investigating and modeling the relations between variables, with many applications in engineering, economics, biomedicine, social sciences, and others. However, in recent years, data science has quickly expanded the boundaries of signal processing and statistical learning beyond their accustomed domains. The DFG Project REFOCuS develops advanced robust regression methods that provide statistical guarantees, even for the challenging case of high-dimensional and outlier contaminated data. The combination of small sample-sizes and high-dimensionality of the data is the worst-case setting, both for classical robust methods that are derived on asymptotic arguments (i.e., sample-size going to infinity) and for data driven methods (e.g., deep-learning) that assume an abundance of training data. Coming revolutions, e.g., in biotechnology, however, demand for new learning methods that operate in such a high-dimensional regime and provide non-asymptotic statistical robustness guarantees. The most important result of this project is the development of the Terminating-Random Experiments (T-Rex) selector, a fast variable selection method for high-dimensional data. The T-Rex selector controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables and, thus, achieving a high true positive rate (TPR). A completely new learning framework has been developed that fuses the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated dummy predictors. A finite sample proof based on martingale theory for the FDR control property is provided. We were able to prove under mild conditions that the dummies can be sampled from any univariate probability distribution with finite expectation and variance. The computational complexity of the proposed method is linear in the number of variables. The T-Rex selector outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. Two open-source R software packages were developed and published on CRAN. The outcome of this project has led to three research grants, i.e., the ERC Starting Grant ScReeningData, the project curAIsig, which is part of the BMBF Cluster for Future curATime, and an innovation project within the LOEWE Center emergenCITY, that all build upon the T-Rex Methods that were developed in this DFG Project. Applications of the T-Rex framework in biomedicine, robotics, and finance are currently being explored.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung