Trustworthy multi-scale manifold learning for genomic and transcriptomic data

Applicant Dr. Dmitry Kobak

Subject Area Bioinformatics and Theoretical Biology

Term since 2021

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 471473934

Project Description

In recent years, large high-dimensional datasets have become commonplace in biology. For example, single-cell transcriptomics routinely produces datasets with sample sizes in hundreds of thousands of cells and dimensionality in tens of thousands of genes. Similarly, genomic datasets can encompass hundreds of thousands of people’s genomes, profiled using millions of single-nucleotide polymorphisms. One defining feature of such datasets is their hierarchical organization, with biologically meaningful structure present on several levels. Such datasets require adequate computational methods for data analysis, including unsupervised data exploration, to allow researchers to compactly represent and make sense of their data. It is commonplace in single-cell transcriptomics to generate low-dimensional embeddings of the data, using algorithms such as e.g. t-SNE or UMAP, but the existing methods fall short of representing the hierarchical structure of the data. Whereas they excel at preserving local structure, they are unable to recapitulate larger-scale global structure often present in the data, making it difficult to interpret the embedding correctly. In this project, our first aim is to develop a dimensionality reduction method able to preserve crucial properties of high-dimensional data, such as the local cluster structure, continuous trajectories, and global hierarchical organization. The second aim is to develop a suite of quality metrics that will allow us to benchmark existing and novel algorithms on a range of challenging datasets. Finally, the third aim is to adapt this machinery to ultra-high-dimensional data from population genomics. On the technical level, we are going to rely on the k-nearest-neighbour graphs and graph coarse-graining. Our work will be useful in practical applications in biology and bioinformatics, while at the same time being of high interest for the manifold learning part of the machine learning community.

DFG Programme Research Grants

Servicenavigation

Hauptnavigation

Trustworthy multi-scale manifold learning for genomic and transcriptomic data

Additional Information

Servicenavigation

Hauptnavigation

Trustworthy multi-scale manifold learning for genomic and transcriptomic data

Additional Information

Textvergrößerung und Kontrastanpassung