Project Details
Projekt Print View

Deep speech representation learning for research in phonetics

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Applied Linguistics, Computational Linguistics
Communication Technology and Networks, High-Frequency Technology and Photonic Systems, Signal Processing and Machine Learning for Information Technology
Term since 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 446378607
 
The speech signal is a rich source of information that conveys not only linguistic but also extra/para-linguistic information, such as the speaker's identity, gender, emotional state, age, or the social status. However, those traits are hidden in complex, non-transparent variations of the speech signal, and mostly obscure to speech research. With recent progress in speech synthesis and voice conversion caused by the advent of deep learning, notably by deep generative modeling, we argue that synthesized speech can become a valuable tool for research in phonetics.The overarching goal of this project is thus to explore the potential of deep generative modeling of speech as a tool to support basic research in phonetics. To constrain the task, we will not consider the synthesis of stimuli from text, but concentrate on the dedicated manipulation of speech to generate new speech signals with desired properties. The goal is to develop generative models which offer a representation of the speech signal by latent variables, which is compact and informative about the observed speech signal, which represents different sources of variation of the speech signal by different dimensions of the representation, which allows a dedicated manipulation of a phonetic cue along phonetically plausible dimensions, and which is amenable to human interpretation. With these tools a phonetician will be given control over both low-level acoustic-phonetic properties as well as high-level abstract concepts. The high-level concepts considered here will be the disentanglement and dedicated manipulation of speaker and linguistic content related variations of a speech signal, as well as the isolation of dialectal cues. Being data-driven, the developed techniques will, however, have the potential to be useful also for the study of other extra- or paralinguistic traits, provided appropriate training data is available. The performance and utility of the developed tools will be measured by both machine and human perception experiments, as well as by phonetic expert scrutiny for phonetic plausibility.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung