Project Details
Projekt Print View

CADD-SV – Scoring functional effects and deleteriousness of structural variants using machine learning

Subject Area Bioinformatics and Theoretical Biology
General Genetics and Functional Genome Biology
Human Genetics
Term since 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 528500855
 
In light of recent advances in structural variant (SV) detection and the study of regulatory genome architectures, we propose a computational approach to estimate the effects of SVs across the human genome. Due to their size, SVs may encompass different types of genomic sequence, i.e. encoding for proteins and functional RNAs, sequences that are of regulatory nature or sequences that are not anticipated to be functional. Particularly, SVs might interfere with the regulatory architecture of the genome and therefore moved into the focus of research as they can help to understand previously unexplained disease phenotypes. In our preliminary work, we derived an unbiased training dataset to differentiate functional SVs from neutral variants. This provides us with an unbiased and sufficiently large dataset to train machine learning models for insertions, deletions and duplications. This work also enables fast SV annotation and data summarization and allows us to combine a large collection of features in a machine learning model to identify functional and disease relevant SVs. Here, we will further develop this idea and specifically address the following aims: (1) improving the scoring of SVs by integrating sequence-based scores, e.g. predicting the potential functional content of inserted sequences, (2) inclusion of new model features (e.g. SCREEN candidate regulatory elements and gene fusions) and application of CNNs to generalize functional data (e.g. across many cell-types) or to predict molecular assay data for new sequences (e.g. Hi-C contacts with deepC), and (3) developing a robust and superior score for SVs all over the genome – confirmed by an unbiased benchmark, as well as model interpretation for the most relevant predictive features and assessing the contribution of mechanistic effects in pathogenic SVs (e.g. 3D architecture vs coding sequence effects). The result will be an improved general framework (Combined Annotation Dependent Depletion for Structural Variants, CADD-SV) for the computational scoring of structural variants, based on integrating diverse information from regulatory genome architecture to coding sequence effects. We will develop an innovative machine learning tool and scoring website to make SV variant prioritization easily accessible for the community. The interpretation of our models can provide mechanistic insights into genome regulation as well as a resource for the discovery of new genotype-phenotype effects.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung