Statistical theory on finite alphabet structures: inference, algorithms, and applications

Applicant Professorin Dr. Merle Behr

Subject Area Mathematics

Term from 2018 to 2020

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 411042450

Final Report Year 2021

Final Report Abstract

Modern data problems consist of increasingly complex, often high-dimensional data sources. In order to extract meaningful and interpretable information it is therefore often necessary to incorporate additional structural assumption or pre-knowledge into the analysis. Arguably, one of the most widely used assumptions in this context is sparsity. This means that, although, a model can include potentially many different variables, only a few of them have a non-zero inﬂuence in the model. In this project we focused on a related alternative simplifying assumption, namely, a given ﬁnite alphabet (FA) assumption. This means that, although many different variables can enter the model simultaneously, each of them can only take a few different values, represented via the giving FA. There are various domains where such a FA assumption appears naturally. A particularly suitable area of application for FA structures is the ﬁeld of genetics, which we also mainly, but not exclusively, focused on in this project. For instance, the ﬁnitely many base pairs of the DNA constitute a natural FA, but there are various other examples, such as copy-number-variations and single-nucleotide polymorphisms, on both a diploid or haploid level. In the ﬁrst part of this project we analyzed FA structures for speciﬁc matrix factorization problems, which are particularly relevant for genetic data from pooled sequencing. We discovered that FA’s are extremely powerful in this setting. In particular, we were able to show that the optimal statistical accuracy in such estimation problems is essentially the same as if the FA matrix was known exactly. We also derived a computationally efﬁcient algorithm which can achieve this optimal estimation rate. Based on this, we develop a completely new approach for haplotype reconstruction from pool sequencing data of multiple samples, which is both, more accurate as well as computationally more efﬁcient than other methods. In the second part of this project we focused on a high-dimensional Boolean interaction model (where the Boolean components relate to a FA structure), which can be used, for example, to describe higher-order gene interactions. In recent years, there have been several empirical works which found that tree based ensemble methods, such as the random forest algorithm, can be very successful in term of interaction discovery, in particular, for genetics data. To the best of our knowledge, we were the ﬁrst to obtain a theoretical foundation for these empirical ﬁndings. In particular, we showed that consistent interaction discovery of such Boolean interaction components is feasible with random forest. In the third part of this projects we focused on FA structures that correspond to the ﬁnite set of inner nodes in a tree structure. On the one hand, we considered the situation where the tree structure is ﬁxed and given. There we were the ﬁrst to derived a precise statistical testing procedure, treeSeg, to detect sub-branches simultaneously on all hierarchy levels in the tree. treeSeg is applicable in various domains, where detection of sub-groups is of interest. For example, treeSeg can be applied to detect sub-lineages within phylogentic trees (e.g., of virus or bacteria strains), which show a signiﬁcantly different outcome distribution. It allows for precise uncertainty quantiﬁcation and therefore, increases interpretability and reproducibility across many ﬁelds of science. Besides settings where the tree structure is given in advance (independent of the outcome variables), we also considered settings where the tree is learned directly from the data in a supervised setting via decision tree algorithms. In this context we derived a pipeline, epiTree, which uses such tree structures in order to provide an estimation and testing procedure for gene interactions.

Publications

Modeling epistatic polygenic contributions to phenotypes using machine learning with boolean interactions
M. Behr, K. Kumbier, M. Aguirre, R. Arnaout, E. Ashley, B. Brown, J. Priest, B. Yu
Testing for dependence on tree structures, Proceedings of the National Academy of Sciences (PNAS), 117(18):9787-9792 (2020)
Behr, Merle; Ansari, M. Azim; Munk, Axel & Holmes, Chris
Multiple haplotype reconstruction from allele frequency data, Nature Computational Science (2021)
Pelizzola, Marta; Behr, Merle; Li, Housen; Munk, Axel & Futschik, Andreas
Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests
M. Behr, H. Wang, X. Li, B. Yu

Servicenavigation

Hauptnavigation

Statistical theory on finite alphabet structures: inference, algorithms, and applications

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

Statistical theory on finite alphabet structures: inference, algorithms, and applications

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung