Enabling haplotype-level genomics: Whole-chromosome integrative read-based phasing
Final Report Abstract
We have entered an era where genomics significantly impacts individuals and society. Recent advances in sequencing technology are transforming medical and fundamental research: Large genotype-phenotype studies are now being carried out routinely and yield new insights about the genetic basis of disease and drug response. These advances in medical genomics enable precision-medicine approaches for the treatment of patients, which are becoming more and more widespread and successful. Other fields, such as population genomics, benefit from the possibility to study millions of loci in large populations. However, individual genomes are currently predominantly studied at the level of genotypes. Genotyping refers to determining the two alleles (one inherited from each parent) present at a particular genetic locus and can be achieved using various technologies including microarrays and short-read sequencing. Whether a heterozygous variant resides on the paternal or the maternal chromosomal copy is unknown using genotype-level genomics, and therefore, the information passed on to downstream analyses is incomplete. The full sequences of the two chromosomal copies are known as haplotypes. Moving from (sequences of) genotypes to haplotypes is known as phasing. Haplotype-level genomics will enable researchers to look at genomic sequences at full resolution. Besides allowing to address important questions in population genetics, for instance to study demographic history and selection, haplotype-level genomics is particularly relevant for medical genomics. In this project, we provide the algorithmic basis for entering the era of haplotype-level genomics. It will pave the way to a better understanding of the regulatory mechanisms underlying disease and non-disease phenotypes and to explaining missing heritability—the fact that only a small fraction of heritable disease risks has been successfully linked to genetic variants. We will design, implement, and benchmark read-based phasing algorithms to achieve three main goals: First, we solve problem instances that resist current approaches by developing novel algorithms. This particularly applies to problem instances that can deliver chromosome-length haplotypes by integrating different technologies and/or when using sequencing reads and pedigree information in combination. Second, we deliver an experimental map that precisely delineates the strengths and weaknesses of different (combinations of) technologies and hence guides future study design. This is made possible through tight collaboration with the Human Genome Structural Variation Consortium. Third, all algorithmic advances are integrated in our open source WhatsHap software suite, for direct inclusion in production pipelines.
Publications
-
Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nature Biotechnology, 39(3), 302-308.
Porubsky, David; Ebert, Peter; Audano, Peter A.; Vollger, Mitchell R.; Harvey, William T.; Marijon, Pierre; Ebler, Jana; Munson, Katherine M.; Sorensen, Melanie; Sulovari, Arvis; Haukness, Marina; Ghareghani, Maryam; Lansdorp, Peter M.; Paten, Benedict; Devine, Scott E.; Sanders, Ashley D.; Lee, Charles; Chaisson, Mark J. P. ... & Marschall, Tobias
-
Haplotype threading: accurate polyploid phasing from long reads. Genome Biology, 21(1).
Schrinner, Sven D.; Mari, Rebecca Serra; Ebler, Jana; Rautiainen, Mikko; Seillier, Lancelot; Reimer, Julia J.; Usadel, Björn; Marschall, Tobias & Klau, Gunnar W.
-
The Longest Run Subsequence Problem. In Proc. WABI 2020: 20th International Workshop on Algorithms in Bioinformatics. Editors: Carl Kingsford and Nadia Pisanti; Article No. 6; pp. 6:1–6:13
Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger & Gunnar W. Klau
-
The Lost Recipes from the Four Schools of Amathus. Lecture Notes in Computer Science, 16-23. Springer International Publishing.
Klau, Gunnar W.
-
Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science, 372(6537).
Ebert, Peter; Audano, Peter A.; Zhu, Qihui; Rodriguez-Martin, Bernardo; Porubsky, David; Bonder, Marc Jan; Sulovari, Arvis; Ebler, Jana; Zhou, Weichen; Serra, Mari Rebecca; Yilmaz, Feyza; Zhao, Xuefang; Hsieh, PingHsun; Lee, Joyce; Kumar, Sushant; Lin, Jiadong; Rausch, Tobias; Chen, Yu; Ren, Jingwen ... & Eichler, Evan E.
-
Using the longest run subsequence problem within homology-based scaffolding. Algorithms for Molecular Biology, 16(1).
Schrinner, Sven; Goel, Manish; Wulfert, Michael; Spohr, Philipp; Schneeberger, Korbinian & Klau, Gunnar W.
-
Genetic polyploid phasing from low-depth progeny samples. iScience, 25(6), 104461.
Schrinner, Sven; Serra, Mari Rebecca; Finkers, Richard; Arens, Paul; Usadel, Björn; Marschall, Tobias & Klau, Gunnar W.
-
Haplotype-resolved assembly of a tetraploid potato genome using long reads and low-depth offspring data. Genome Biology, 25(1).
Serra, Mari Rebecca; Schrinner, Sven; Finkers, Richard; Ziegler, Freya Maria Rosemarie; Arens, Paul; Schmidt, Maximilian H.-W.; Usadel, Björn; Klau, Gunnar W. & Marschall, Tobias
