Iterative Information Fusion in Automatis Speech Recognition According to the Turbo Principle

Applicant Professor Dr.-Ing. Tim Fingscheidt

Subject Area Electronic Semiconductors, Components and Circuits, Integrated Systems, Sensor Technology, Theoretical Electrical Engineering

Term from 2019 to 2023

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 414091002

Final Report Year 2022

Final Report Abstract

The work of this project initially focussed on turbo information fusion, which was able to achieve promising results by transferring the principle ofturbo codes from Communications to automatic speech recognition in previous publications. We started the work by extending the turbo information fusion method with recently successful neural network architectures. We developed posterior-in-posterior-out (PIPO-)BLSTMs, a type of recurrent neural network specifically designed for turbo information fusion, which replace the original turbo forward backward algorithm and revealed surprising properties during the project. Due to the probability interface at the input and output, PIPO-BLSTMs are fully modular state sequence enhancers and can be combined with various acoustic models that provide acoustic state probabilities. Two key properties ofPIPO-BLSTMs were discovered in the project: First, PIPO-BLSTMs provide best performance when trained with state probabilities that emerge from acoustic models that process little or even no temporal input context, but are then advantageously combined with acoustic models that process large temporal input context during inference. Second, turbo information fusion can be used to augment the PIPO BLSTM's training data by using the iteratively exchanged probabilities over multiple iterations on the training set as additional data. These capabilities enabled PIPO-BLSTM-based turbo information fusion to achieve a competitive phoneme error rate of18.02% on the well-known TIMIT database in a completely new fusion setting of the same features but different DFT window lengths, an improvement ofl.95% absolute compared to a common reference fusion method using a multi-stream HMM. Further work was driven by a paradigm shift, which is currently changing the focus of automatic speech recognition research from HMM-based hybrid approaches to so-called end-to-end methods, where the acoustic speech signal is directly converted into a sequence of graphemes by a single neural network. Information fusion has been poorly explored in these networks, so in this project we first developed several novel fusion methods and evaluated them for computational complexity and recognition rates. In addition, we developed multi-encoder learning, a method that uses fusion with an additional feature stream only in training and thus can improve a standard transformer model without adding any complexity during inference. Surprisingly, however, by analyzing this method very precisely, we were able to develop another performance-improving method that makes fusion, and thus the added complexity due to additional features and encoders, obsolete. This so-called relaxed attention performs a simple smoothing of the attention weights within the multi-head attention function in the transformer model. Relaxed attention benefits are two-fold: First, as relaxed self-attention, it regularizes the transformer encoder, and second, as relaxed cross attention, it suppresses the internally learned language model in the decoder to improve the combination oftransformer models with large external language models. Thus, we were ultimately able to achieve consistent improvement on various applications even beyond automatic speech recognition and outperformed the current state ofthe art on various tasks: On the Wall Street Journal database for automatic speech recognition we achieved a word error rate of 3.19% (vs. 3.4% with the best attention-based encoder decoder model to date), on the LRS3 database for automatic lip-reading we achieved a word error rate of 25.51% (vs. 26.90%), and on the IWSLT14 database for machine translation we increased the BLEU score to 37.85 (vs. 37.60).

Publications

On Temporal Context Information for Hybrid BLSTM-Based Phoneme Recognition. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 516-523. IEEE.
Lohrenz, Timo; Strake, Maximilian & Fingscheidt, Tim
BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example. Interspeech 2020, 26-30. ISCA.
Lohrenz, Timo & Fingscheidt, Tim
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition. Interspeech 2021. ISCA.
Lohrenz, Timo; Li, Zhengyang & Fingscheidt, Tim
Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 177-184. IEEE.
Lohrenz, Timo; Schwarz, Patrick; Li, Zhengyang & Fingscheidt, Tim
„Multi-Head Fusion Attention for Transformer- Based End-to-End Automatic Speech Recognition“. In: Proc. of ITG-Fachtagung “Sprachkommunikation”. Kiel, Germany, Sep. 2021, S. 14–18.
Timo Lohrenz, Patrick Schwarz, Zhengyang Li & Tim Fingscheidt
Relaxed Attention for Transformer Models. 2023 International Joint Conference on Neural Networks (IJCNN), 1-10. IEEE.
Lohrenz, Timo; Möller, Björn; Li, Zhengyang & Fingscheidt, Tim

Servicenavigation

Hauptnavigation

Iterative Information Fusion in Automatis Speech Recognition According to the Turbo Principle

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

Iterative Information Fusion in Automatis Speech Recognition According to the Turbo Principle

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung