Project Details
Projekt Print View

Structured Hybrid Models for Audiovisual Speech Processing

Subject Area Electronic Semiconductors, Components and Circuits, Integrated Systems, Sensor Technology, Theoretical Electrical Engineering
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term from 2014 to 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 253379932
 
In difficult acoustic conditions, human speech perception and speech recognition improve notably, when the face of the speaker is visible. The visual data does not only help with speaker localization, but it also contains important information about the place of articulation and about the temporal segmentation of the utterance. However, until recently, making comprehensive use of such video information was precluded by the lack of sensors and the necessary computational power. Today, the increasing availability of multi-modal and specifically audiovisual speech data - e.g. in voice-over-IP communication, smartphones, speech- and gesture-controlled video games, and in the quickly growing body of multimedia data on the Internet - finally allows for the wide use of video information in traditional audio classification and signal processing tasks. This additional information can be of great value for automatic speech recognition as well, which has also been shown in the first funding period of this project.Motivated by these developments, by the significant improvements that have been possible in the first funding period, and by the rapid developments in machine learning, especially in neural-network-based speech recognition, this continuation proposal aims at creating new hybrid, i.e. neural/probabilistic, models for multi-modal speech data. In this way, the project will contribute novel methods and algorithms for highly robust audiovisual speech recognition in complex acoustic environments.This promises a broad range of applications: Audiovisual speech recognition can be used for voice control in acoustically difficult environments, or for a reliable transcription of multimedia data, e.g. for subtitling voice-over-IP communication and web content, and for audio-visual speaker identification. In addition, a robust audiovisual speech recognition will lay the groundwork for multi-modal speech enhancement, which can improve speech intelligibility in difficult acoustic conditions by having access to a reliable estimate of the phonetic state of the speaker of interest.On the theoretical side, we expect to contribute results and insights on the optimal fusion of neural and probabilistic components for reliable recognition of time-series data, attained by considering reliability information and the structure of underlying state spaces, and refined in an end-to-end optimization.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung