Project Details
Projekt Print View

Machine-Learning Methods for Conditional Logistic Regression

Subject Area Epidemiology and Medical Biometry/Statistics
Statistics and Econometrics
Term since 2025
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 562549888
 
Conditional Logistic Regression is used in two very different application areas: matched case-control studies and the analysis of discrete choice data. In matched case-control studies, a case (i.e., a person suffering from the disease in question) is matched to one or more controls (i.e., healthy persons). Matching typically relies on variables such as age, gender, and place of residence, which may differ only slightly or not at all between matched cases and controls. The goal is to identify factors that may either promote or prevent the development of the disease in question. Discrete choice data arise when a person selects exactly one option from a given set of alternatives, for example, when choosing a type of treatment for a particular patient (with treatments A, B, and C as given options). In both cases, the data are based on a specific stratified structure that must be considered in the data analysis. This structure arises because a) only one matched individual is a case, or b) a person can choose only one of the possible alternatives. The typical method of data analysis in both cases is Conditional Logistic Regression. However, Conditional Logistic Regression relies on very restrictive assumptions like linearity and additivity of effects, which prevent it from modeling complex functional relationships. Classic machine learning methods could address this issue but cannot be directly applied to this type of data, as they are not suited to handle the stratified data appropriately. This research project aims to develop and implement machine learning methods that can replace Conditional Logistic Regression. Specifically, these will be tree-based methods and boosting methods. The main preliminary works for this project have already developed decision trees and random forests for matched case-control studies. Random forests combine a large number of decision trees, leading to even greater flexibility. However, since the data structure between matched case-control studies and discrete choice data differs in some aspects, the tree-based methods developed for matched case-control studies need to be adapted for use with discrete choice data. Additionally, a flexible boosting method for Conditional Logistic Regression will be developed. Boosting is a stepwise algorithm that can combine various model components (e.g., linear effects, smooth effects, spatial effects, etc.) into one model, which makes it a very flexible and powerful estimation technique. The boosting method will apply to matched case-control studies and discrete choice data.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung