Likelihood-basierte komponentenweise Boostingverfahren zur Effektwahl in Cox Frailty Modellen
Statistik und Ökonometrie
Zusammenfassung der Projektergebnisse
As in many other sorts of regression problems, also in survival analysis it has become more and more relevant to face high-dimensional data with lots of potentially influential covariates. These generally can have time-constant or time-varying effect types, which a priori is often unknown to the modeler. A possible solution is to apply regularized estimation methods that allow to select relevant covariates and distinguish between these effect types. Hence, the main goal of this project is to develop, implement and test a suitable likelihood-based component-wise boosting approach for both variable and model selection in a specific Cox-type model, the so-called Cox frailty model, that comes along with a corresponding R package. In particular, we pursue the following two major objectives with this project: a) Frailty distribution: equip the planned regularization approach with a powerful and flexible class of frailty distributions, namely the (multiplicative) log-normal frailty distribution; b) Regularization: develop a likelihood-based boosting approach for variable and model selection in Cox frailty models with time-varying coefficients, such that single effects are either included as time-varying, are included in the form of a constant effect or are totally excluded. For the update step of the boosting algorithm it is essential that a potentially time-varying effect γ(t) of covariate z, which is expanded in P-splines (i.e., penalized B-splines), can be split into an parametric part consisting of an unpenalized polynomial and the nonparametric deviation from this polynomial. By ensuring comparable complexity of both the linear and smooth component, both variable selection and model choice are obtained simultaneously, as the method can select either the linear or smooth part of a certain effect, or none. Altogether, the method results in flexible and sparse hazard rate models for survival data. The method is currently analyzed in extensive simulation studies and is planned to be applied to model time until pregnancy of German women in order to illustrate that it can strongly reduce the complexity of the influence structure. During the research stay at Stanford University, the method was implemented in the statistical software program R and partly in C++. The implementation of the method took most of the time. The rest of the time was mostly invested into several simulation studies to check whether the method is working correctly and to compare its performance to competing methods. Also, we worked on theoretical aspects regarding the derivation of the effective degrees of freedom corresponding to a single boosting step. After all, the research stay generated important foundations for the research project which is now further pursued. In particular, the correct determination of the effective degrees of freedom corresponding to a single boosting step and, consequently, the possibility to assess the optimal number of boosting steps based on information criteria such as AIC or BIC might lead to a substantial improvement regarding computational time. This is a very important aspect with regard to the method’s attractiveness and usability, which I feel could strongly increase the value of the whole project.