REFOCuS: Robust Estimation for Cell- and Casewise Contamination in Sparse Regression Models
Final Report Abstract
With the rapid advances in data science and signal processing, there is an ever-increasing need for reliable and robust information extraction and processing. Regression analysis is one of the most widely used techniques for investigating and modeling the relations between variables, with many applications in engineering, economics, biomedicine, social sciences, and others. However, in recent years, data science has quickly expanded the boundaries of signal processing and statistical learning beyond their accustomed domains. The DFG Project REFOCuS develops advanced robust regression methods that provide statistical guarantees, even for the challenging case of high-dimensional and outlier contaminated data. The combination of small sample-sizes and high-dimensionality of the data is the worst-case setting, both for classical robust methods that are derived on asymptotic arguments (i.e., sample-size going to infinity) and for data driven methods (e.g., deep-learning) that assume an abundance of training data. Coming revolutions, e.g., in biotechnology, however, demand for new learning methods that operate in such a high-dimensional regime and provide non-asymptotic statistical robustness guarantees. The most important result of this project is the development of the Terminating-Random Experiments (T-Rex) selector, a fast variable selection method for high-dimensional data. The T-Rex selector controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables and, thus, achieving a high true positive rate (TPR). A completely new learning framework has been developed that fuses the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated dummy predictors. A finite sample proof based on martingale theory for the FDR control property is provided. We were able to prove under mild conditions that the dummies can be sampled from any univariate probability distribution with finite expectation and variance. The computational complexity of the proposed method is linear in the number of variables. The T-Rex selector outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. Two open-source R software packages were developed and published on CRAN. The outcome of this project has led to three research grants, i.e., the ERC Starting Grant ScReeningData, the project curAIsig, which is part of the BMBF Cluster for Future curATime, and an innovation project within the LOEWE Center emergenCITY, that all build upon the T-Rex Methods that were developed in this DFG Project. Applications of the T-Rex framework in biomedicine, robotics, and finance are currently being explored.
Publications
-
A robust adaptive Lasso estimator for the independent contamination model. Signal Processing, 174, 107608.
Machkour, Jasin; Muma, Michael; Alt, Bastian & Zoubir, Abdelhak M.
-
False Discovery Rate Control for Grouped Variable Selection in High-Dimensional Linear Models Using the T-Knock Filter. 2022 30th European Signal Processing Conference (EUSIPCO) (2022, 8, 29), 892-896. American Geophysical Union (AGU).
Machkour, Jasin; Muma, Michael & Palomar, Daniel P.
-
tlars: The T-LARS Algorithm: Early-Terminated Forward Variable Selection. CRAN: Contributed Packages (2022, 7, 15). American Geophysical Union (AGU).
Machkour, Jasin; Tien, Simon; Palomar, Daniel P. & Muma, Michael
-
TRexSelector: T-Rex Selector: High-Dimensional Variable Selection & FDR Control. CRAN: Contributed Packages (2022, 8, 17). American Geophysical Union (AGU).
Machkour, Jasin; Tien, Simon; Palomar, Daniel P. & Muma, Michael
-
The terminating-random experiments selector: Fast high-dimensional variable selection with false discovery rate control. Signal Processing, 231, 109894.
Machkour, Jasin; Muma, Michael & Palomar, Daniel P.
