Project Details
Functional annotation of genomic innovations in a densely populated clade with deep learning
Applicants
Professor Dr. Erich Bornberg-Bauer; Professor Dr. Gregor Bucher; Privatdozentin Dr. Katharina Hoff
Subject Area
Evolution, Anthropology
General Genetics and Functional Genome Biology
Bioinformatics and Theoretical Biology
General Genetics and Functional Genome Biology
Bioinformatics and Theoretical Biology
Term
since 2022
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 503348080
Comparative Evolutionary Genomics aims at exploiting the available richness of genomic data with the objective to understand: (i) principles of molecular evolution, (ii) the modes of genome evolution, (iii) the underlying evolutionary history of organisms, and (iv) the likely functional properties of a gene, which in turn can facilitate targeted experiments. This enables a glimpse into the past and, therefore, allows to understand how genomic innovations such as novel genes or transposable elements come about and shape new traits. This proposal tackles a core obstacle in comparative genomics: reliably finding and annotating fast-evolving, recently arisen genes that classical tools overlook and which are notoriously difficult to annotate as they lack clear homology. Building on the rich and high-quality data set from GEvol's first funding period (dense insect genomes, transcriptomes, ribo-seq), and existing pipelines, we will merge these resources with deep learning methods to create an automated clade annotation framework. We will train the deep learning gene finder Tiberius for insects and thereby reach state-of-the-art accuracy across the main insect orders. A redesigned loss function, class balancing, and a ClaMSA track that flags absent evolutionary constraint teach the model to spot genuine de novo genes, while RNA-seq, Iso-Seq and ribo-seq refine exon–intron boundaries, add precise UTRs and estimate fixation likelihood. A containerised Nextflow workflow then pipes protein sequences through FANTASIA for GO terms and employs Llama 3 to turn them into harmonised product names that pass GenBank filters, producing INSDC-ready GFF3 files at scale. Raw annotations become searchable knowledge through EMOBase, a cluster of gene-centered phenotypic databases cloned from iBeetle-Base for emerging models. Orthology via OrthoFinder and fDOG plus BUSCO, fCAT and microsynteny metrics link entries across species and FlyBase while AI distils literature summaries. Design of EMOBase will enable easy uploading of custom-made new tracks, ensuring FAIR data without full-time curators. The team will also provide progressive (re)annotation services and workshops to GEvol members. Once trained, the models annotate new genomes with minimal compute, making the approach both economical and green. By illuminating the most elusive segment of insect gene space and standardising functional metadata at unprecedented scale, the project will unlock robust tests of how genomic novelty drives phenotypic innovation and entrench machine learning genomics expertise in Germany.
DFG Programme
Priority Programmes
