Multicollinearity in the statistical genomics era: Proposals to account for dependencies between molecular covariates with application to animal breeding
Final Report Abstract
Tens of thousands of molecular markers are often used as predictor variables in a linear regression model for the evaluation of a performance or health trait in livestock. On the one hand, extensive genetic information enables the identification of genome regions causative for trait expression but on the other hand, it leads to high model dimensions challenging any statistical approach. In a typical situation of genomic evaluation, there are many more predictor variables than observations causing linear dependencies among predictors (called multicollinearity). Due to the proximity of markers, linkage and linkage disequilibrium (LD) between markers add a biologically justified source of dependency. A penalised or a grouped penalised regression approach provides the framework to account for multicollinearity adequately. At first, we provide an efficient implementation of the commonly used approaches, such as lasso, group lasso and sparse-group lasso. In contrast to previous achievements, our implementation is based on the method of proximal gradient descent for numerically solving the optimisation problem, resulting in a good computational performance and high accuracy of predicting the outcome. The most promising approach for genomic evaluations, the sparse-group lasso, was designed to identify groups of markers that are associated with trait expression and to provide a sparse solution of regression coefficients. Groups of predictors need to be specified in advance. The extent of dependence between markers, which relies on the family design in a breeding population, helps grouping predictor variables. We further developed the sparse-group lasso approach and proposed a “fitted” variant with which the penalty term affects the single regression coefficients and groups of fitted values. With that, we aim at selecting the animals with extreme breeding values more precisely, providing, for instance, guidance for selecting the best animals for breeding. Furthermore, the (fitted) sparse-group lasso approach is applicable not only to genome-wide regression models in livestock but also to any field of application where grouped predictors may occur.
Publications
- A sparse-group lasso variant for whole-genome regression models in half sibs, 69th Annual Meeting of the EAAP in Dubrovnik, Croatia, August 27-30, 2018
Klosa, J.
- Generalised sparse-group lasso for whole-genome regression and genomic selection, 70th Annual Meeting of the EAAP in Ghent, Belgium, August 26-29, 2019
Klosa, J.
- Sparse-group lasso variants for whole-genome regression models in livestock, DAGStat Conference in Munich, Germany, March 18 - 22, 2019
Klosa, J.
(See online at https://doi.org/10.1101/2020.02.13.947473) - (2020) Seagull: lasso, group lasso and sparse-group lasso regularisation for linear regression models via proximal gradient descent, BMC Bioinf., 21, 407
Klosa, J., Simon, N., Westermark, P. O., Liebscher, V. & Wittenburg, D.
(See online at https://doi.org/10.1186/s12859-020-03725-w) - (2021) Grouping of genomic markers in populations with family structure, BMC Bioinf., 22, 79
Wittenburg, D., Doschoris, M. & Klosa, J.
(See online at https://doi.org/10.1186/s12859-021-04010-0)