Automatische Transkription von Gesprächssituationen
Kommunikationstechnik und -netze, Hochfrequenztechnik und photonische Systeme, Signalverarbeitung und maschinelles Lernen für die Informationstechnik
Zusammenfassung der Projektergebnisse
Multi-talker conversational speech recognition is concerned with transcribing meetings recorded with distant microphones. The difficulty of the task can be attributed to three factors. First, the recording conditions are challenging: The speech signal captured by microphones from a distance is noisy and reverberated and often contains nonstationary acoustic distortions, which makes it hard to decode. Second, there is a significant percentage of time with overlapped speech, where multiple speakers talk at the same time. Finally, the interaction dynamics of the scenario are challenging because speakers articulate themselves in an intermittent manner with alternating segments of speech inactivity, single-, and multi-talker speech. This project was concerned with developing a transcription system that can operate on arbitrarily long input, correctly handles segments of overlapped as well as non-overlapped speech, and transcribes the speech of different speakers consistently into separate output streams. Such a multi-talker Automatic Speech Recognition (ASR) system typically consists of the following three components: a source separation and enhancement block, a diarization stage, that attributes segments of input speech to speakers, and an ASR stage, whereby different orders of processing have been proposed. Those orders differ in when to do diarization. While existing approaches employed separately trained subsystems for diarization, separation, and recognition, our research hypothesis was that a joint approach, which is optimized under a single training objective, should lead to superior solutions compared to the separate optimization of individual components. Such a coherent formulation, however, would not necessarily mean that the three aforementioned tasks had to be carried out in a single, monolithic (probably neural) integrated system. Indeed, the research carried out showed that it is beneficial to have separate subsystems, however, with a tight coupling between them. Examples of such systems we developed are • TS-SEP, which carries out diarization and separation/enhancement, with a tight coupling in-between. • Mixture encoder, which leverages explicit speech separation, but also forwards the not yet separated speech to the ASR module to mitigate error propagation from the separator to the recognizer. • Joint diarization and separation, realized by a statistical mixture model, which integrates a mixture model for diarization and one for separation, that share a common hidden state variable. • Transcription-supported diarization, which uses sentence- and word-level boundaries of the ASR module to support speaker turn detection. Furthermore, we developed new approaches to the individual subsystems and shared several tools and data sets with the research community.
Projektbezogene Publikationen (Auswahl)
-
A meeting transcription system for an Ad-Hoc acoustic sensor network
T. Gburrek, C. Boeddeker, T. von Neumann, T. Cord-Landwehr, J. Schmalenstroeer & R. Haeb-Umbach
-
An Initialization Scheme for Meeting Separation with Spatial Mixture Models. Interspeech 2022, 271-275. ISCA.
Boeddeker, Christoph; Cord-Landwehr, Tobias; von Neumann, Thilo & Haeb-Umbach, Reinhold
-
Monaural Source Separation: From Anechoic To Reverberant Environments. 2022 International Workshop on Acoustic Signal Enhancement (IWAENC), 1-5. IEEE.
Cord-Landwehr, Tobias; Boeddeker, Christoph; von Neumann, Thilo; Zorilă, Cătălin; Doddipatla, Rama & Haeb-Umbach, Reinhold
-
SA-SDR: A Novel Loss Function for Separation of Meeting Style Data. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6022-6026. IEEE.
von Neumann, Thilo; Kinoshita, Keisuke; Boeddeker, Christoph; Delcroix, Marc & Haeb-Umbach, Reinhold
-
A Teacher-Student Approach for Extracting Informative Speaker Embeddings From Speech Mixtures. INTERSPEECH 2023, 4703-4707. ISCA.
Cord-Landwehr, Tobias; Boeddeker, Christoph; Zorilă, Cătălin; Doddipatla, Rama & Haeb-Umbach, Reinhold
-
Frame-Wise and Overlap-Robust Speaker Embeddings for Meeting Diarization. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5. IEEE.
Cord-Landwehr, Tobias; Boeddeker, Christoph; Zorilă, Cătălin; Doddipatla, Rama & Haeb-Umbach, Reinhold
-
HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch. 2022 IEEE Spoken Language Technology Workshop (SLT), 287-294. IEEE.
Raissi, Tina; Zhou, Wei; Berger, Simon; Schluter, Ralf & Ney, Hermann
-
MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems. 7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023), 27-32. ISCA.
von Neumann, Thilo; Boeddeker, Christoph; Delcroix, Marc & Haeb-Umbach, Reinhold
-
Mixture Encoder for Joint Speech Separation and Recognition. INTERSPEECH 2023, 3527-3531. ISCA.
Berger, Simon; Vieting, Peter; Boeddeker, Christoph; Schlüter, Ralf & Haeb-Umbach, Reinhold
-
Multi-stage diarization refinement for the CHiME-7 DASR scenario. 7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023), 51-56. ISCA.
Boeddeker, Christoph; Cord-Landwehr, Tobias; von Neumann, Thilo & Haeb-Umbach, Reinhold
-
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5. IEEE.
von Neumann, Thilo; Boeddeker, Christoph; Kinoshita, Keisuke; Delcroix, Marc & Haeb-Umbach, Reinhold
-
Combining TF-GridNet And Mixture Encoder For Continuous Speech Separation For Meeting Transcription. 2024 IEEE Spoken Language Technology Workshop (SLT), 155-162. IEEE.
Vieting, Peter; Berger, Simon; Neumann, Thilo von; Boeddeker, Christoph; Schlüter, Ralf & Haeb-Umbach, Reinhold
-
Geodesic Interpolation of Frame-Wise Speaker Embeddings for the Diarization of Meeting Scenarios. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 11886-11890. IEEE.
Cord-Landwehr, Tobias; Boeddeker, Christoph; Zorilă, Cătălin; Doddipatla, Rama & Haeb-Umbach, Reinhold
-
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 775-779. IEEE.
Von Neumann, Thilo; Boeddeker, Christoph; Cord-Landwehr, Tobias; Delcroix, Marc & Haeb-Umbach, Reinhold
-
Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment. Interspeech 2024, 1615-1619. ISCA.
Boeddeker, Christoph; Cord-Landwehr, Tobias & Haeb-Umbach, Reinhold
-
TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 1185-1197.
Boeddeker, Christoph; Subramanian, Aswin Shanmugam; Wichern, Gordon; Haeb-Umbach, Reinhold & Le Roux, Jonathan
-
Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5. IEEE.
Cord-Landwehr, Tobias; Boeddeker, Christoph & Haeb-Umbach, Reinhold
-
Word Error Rate Definitions and Algorithms for Long-Form Multi-Talker Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing, 33, 3174-3188.
von Neumann, Thilo; Boeddeker, Christoph; Delcroix, Marc & Haeb-Umbach, Reinhold
