Project Details
Diffusion-Based Deep Generative Models for Speech Processing
Applicant
Professor Dr.-Ing. Timo Gerkmann
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term
since 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 545210893
Recently a novel and very exciting generative machine learning approach has gained increasing interest in the machine learning, computer vision, and speech communities: Diffusion-based generative models, or simply diffusion models. These models are based on the idea to gradually turn data into noise (forward diffusion process), and to train a neural network that learns to invert this process for different noise scales (reverse diffusion process). The forward and backward diffusion processes have been either proposed to be modeled using Markov chains or stochastic differential equations (SDEs). We recently proposed to employ SDE-based diffusion models for speech enhancement by integrating a drift term that allows to also use recorded real-world environmental noise during training. We have shown that this generative approach is very powerful and outperforms competing discriminative approaches in cross-corpora evaluations, which highlights a very good generalization performance. However, many open questions arise that we want to tackle in this project. Our objectives are to make diffusion models capable for real-time processing with only a modest latency by reducing the memory and computational footprint. Furthermore, we will investigate novel methods to increase the robustness of diffusion models in challenging acoustic scenarios.
DFG Programme
Research Grants