Automatische Transkription von Gesprächssituationen

Antragsteller Professor Dr.-Ing. Reinhold Häb-Umbach; Privatdozent Dr. Ralf Schlüter

Fachliche Zuordnung Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing
Kommunikationstechnik und -netze, Hochfrequenztechnik und photonische Systeme, Signalverarbeitung und maschinelles Lernen für die Informationstechnik

Förderung Förderung von 2021 bis 2024

Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 448568305

Erstellungsjahr 2025

Zusammenfassung der Projektergebnisse

Multi-talker conversational speech recognition is concerned with transcribing meetings recorded with distant microphones. The difficulty of the task can be attributed to three factors. First, the recording conditions are challenging: The speech signal captured by microphones from a distance is noisy and reverberated and often contains nonstationary acoustic distortions, which makes it hard to decode. Second, there is a significant percentage of time with overlapped speech, where multiple speakers talk at the same time. Finally, the interaction dynamics of the scenario are challenging because speakers articulate themselves in an intermittent manner with alternating segments of speech inactivity, single-, and multi-talker speech. This project was concerned with developing a transcription system that can operate on arbitrarily long input, correctly handles segments of overlapped as well as non-overlapped speech, and transcribes the speech of different speakers consistently into separate output streams. Such a multi-talker Automatic Speech Recognition (ASR) system typically consists of the following three components: a source separation and enhancement block, a diarization stage, that attributes segments of input speech to speakers, and an ASR stage, whereby different orders of processing have been proposed. Those orders differ in when to do diarization. While existing approaches employed separately trained subsystems for diarization, separation, and recognition, our research hypothesis was that a joint approach, which is optimized under a single training objective, should lead to superior solutions compared to the separate optimization of individual components. Such a coherent formulation, however, would not necessarily mean that the three aforementioned tasks had to be carried out in a single, monolithic (probably neural) integrated system. Indeed, the research carried out showed that it is beneficial to have separate subsystems, however, with a tight coupling between them. Examples of such systems we developed are • TS-SEP, which carries out diarization and separation/enhancement, with a tight coupling in-between. • Mixture encoder, which leverages explicit speech separation, but also forwards the not yet separated speech to the ASR module to mitigate error propagation from the separator to the recognizer. • Joint diarization and separation, realized by a statistical mixture model, which integrates a mixture model for diarization and one for separation, that share a common hidden state variable. • Transcription-supported diarization, which uses sentence- and word-level boundaries of the ASR module to support speaker turn detection. Furthermore, we developed new approaches to the individual subsystems and shared several tools and data sets with the research community.

Projektbezogene Publikationen (Auswahl)

A meeting transcription system for an Ad-Hoc acoustic sensor network
T. Gburrek, C. Boeddeker, T. von Neumann, T. Cord-Landwehr, J. Schmalenstroeer & R. Haeb-Umbach
An Initialization Scheme for Meeting Separation with Spatial Mixture Models. Interspeech 2022, 271-275. ISCA.
Boeddeker, Christoph; Cord-Landwehr, Tobias; von Neumann, Thilo & Haeb-Umbach, Reinhold
Monaural Source Separation: From Anechoic To Reverberant Environments. 2022 International Workshop on Acoustic Signal Enhancement (IWAENC), 1-5. IEEE.
Cord-Landwehr, Tobias; Boeddeker, Christoph; von Neumann, Thilo; Zorilă, Cătălin; Doddipatla, Rama & Haeb-Umbach, Reinhold
SA-SDR: A Novel Loss Function for Separation of Meeting Style Data. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6022-6026. IEEE.
von Neumann, Thilo; Kinoshita, Keisuke; Boeddeker, Christoph; Delcroix, Marc & Haeb-Umbach, Reinhold
A Teacher-Student Approach for Extracting Informative Speaker Embeddings From Speech Mixtures. INTERSPEECH 2023, 4703-4707. ISCA.
Cord-Landwehr, Tobias; Boeddeker, Christoph; Zorilă, Cătălin; Doddipatla, Rama & Haeb-Umbach, Reinhold
Frame-Wise and Overlap-Robust Speaker Embeddings for Meeting Diarization. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5. IEEE.
Cord-Landwehr, Tobias; Boeddeker, Christoph; Zorilă, Cătălin; Doddipatla, Rama & Haeb-Umbach, Reinhold
HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch. 2022 IEEE Spoken Language Technology Workshop (SLT), 287-294. IEEE.
Raissi, Tina; Zhou, Wei; Berger, Simon; Schluter, Ralf & Ney, Hermann
MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems. 7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023), 27-32. ISCA.
von Neumann, Thilo; Boeddeker, Christoph; Delcroix, Marc & Haeb-Umbach, Reinhold
Mixture Encoder for Joint Speech Separation and Recognition. INTERSPEECH 2023, 3527-3531. ISCA.
Berger, Simon; Vieting, Peter; Boeddeker, Christoph; Schlüter, Ralf & Haeb-Umbach, Reinhold
Multi-stage diarization refinement for the CHiME-7 DASR scenario. 7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023), 51-56. ISCA.
Boeddeker, Christoph; Cord-Landwehr, Tobias; von Neumann, Thilo & Haeb-Umbach, Reinhold
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5. IEEE.
von Neumann, Thilo; Boeddeker, Christoph; Kinoshita, Keisuke; Delcroix, Marc & Haeb-Umbach, Reinhold
Combining TF-GridNet And Mixture Encoder For Continuous Speech Separation For Meeting Transcription. 2024 IEEE Spoken Language Technology Workshop (SLT), 155-162. IEEE.
Vieting, Peter; Berger, Simon; Neumann, Thilo von; Boeddeker, Christoph; Schlüter, Ralf & Haeb-Umbach, Reinhold
Geodesic Interpolation of Frame-Wise Speaker Embeddings for the Diarization of Meeting Scenarios. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 11886-11890. IEEE.
Cord-Landwehr, Tobias; Boeddeker, Christoph; Zorilă, Cătălin; Doddipatla, Rama & Haeb-Umbach, Reinhold
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 775-779. IEEE.
Von Neumann, Thilo; Boeddeker, Christoph; Cord-Landwehr, Tobias; Delcroix, Marc & Haeb-Umbach, Reinhold
Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment. Interspeech 2024, 1615-1619. ISCA.
Boeddeker, Christoph; Cord-Landwehr, Tobias & Haeb-Umbach, Reinhold
TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 1185-1197.
Boeddeker, Christoph; Subramanian, Aswin Shanmugam; Wichern, Gordon; Haeb-Umbach, Reinhold & Le Roux, Jonathan
Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5. IEEE.
Cord-Landwehr, Tobias; Boeddeker, Christoph & Haeb-Umbach, Reinhold
Word Error Rate Definitions and Algorithms for Long-Form Multi-Talker Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing, 33, 3174-3188.
von Neumann, Thilo; Boeddeker, Christoph; Delcroix, Marc & Haeb-Umbach, Reinhold

Servicenavigation

Hauptnavigation

Automatische Transkription von Gesprächssituationen

Zusammenfassung der Projektergebnisse

Projektbezogene Publikationen (Auswahl)

Zusatzinformationen

Servicenavigation

Hauptnavigation

Automatische Transkription von Gesprächssituationen

Zusammenfassung der Projektergebnisse

Projektbezogene Publikationen (Auswahl)

Zusatzinformationen

Textvergrößerung und Kontrastanpassung