Beyond prediction: Statistical inference with machine learning

Applicant Professor Dr. Marvin Wright

Subject Area Medical Informatics and Medical Bioinformatics

Term since 2020

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 437611051

Project Description

In the age of digital epidemiology, gigantic amounts of data that provide information about the population's state of health are available due to modern technologies. Genetic data, mobility and behavioral data as well as electronic health records provide a comprehensive and consistent picture of health behavior and disease progression. The complexity and amount of these data pose a challenge for statistical modeling. Machine learning proved to be excellent in providing predictions and decisions based on such complex data collections. These methods autonomously learn to recognize patterns in unstructured data without the need to prespecify rules or algorithms. However, a major objective of epidemiology is to analyze the determinants of disease, that is, to explain the underlying disease mechanisms. Here, the current machine learning methods reach their limits. In order to enable logical conclusions and causal interpretations with machine learning methods, and not just predictions, statistical inference methods for machine learning methods will be developed in this research project.In order to successfully tackle this challenge, we focus on four important aspects, each of which is represented by a work package. In the first work package we will develop a model-agnostic conditional independence test and methods for confounder adjustment in machine learning. In the light of the recent success of deep learning, in the second work package we will derive statistical properties of neural networks, extend methods developed for image analysis, natural language processing or similar applications to epidemiological research questions and implement a software package for statistical inference with neural networks. In the third work package, we will build upon the first work package to develop machine learning methods in order detect associations between genetic variants and diseases as well as methods to handle population stratification. In the fourth work package, we will develop methods for statistical inference with competing risks and for estimation of time-specific effects and extend a method for estimation of heterogeneous treatment effects to survival outcomes. In summary, we will develop machine learning methods to understand underlying disease mechanisms. We will put special emphasis on statistical inference and problems faced in epidemiology such as confounding, high-dimensional data and survival outcomes. The project is of methodological nature but with a strong focus on applications. All methods will be made publicly available as software packages, ready to be used by practitioners and applied researchers.

DFG Programme Emmy Noether Independent Research Groups

Servicenavigation

Hauptnavigation

Beyond prediction: Statistical inference with machine learning

Additional Information

Servicenavigation

Hauptnavigation

Beyond prediction: Statistical inference with machine learning

Additional Information

Textvergrößerung und Kontrastanpassung