Project Details
Projekt Print View

Iterative Information Fusion in Automatis Speech Recognition According to the Turbo Principle

Subject Area Electronic Semiconductors, Components and Circuits, Integrated Systems, Sensor Technology, Theoretical Electrical Engineering
Term from 2019 to 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 414091002
 
Final Report Year 2022

Final Report Abstract

The work of this project initially focussed on turbo information fusion, which was able to achieve promising results by transferring the principle ofturbo codes from Communications to automatic speech recognition in previous publications. We started the work by extending the turbo information fusion method with recently successful neural network architectures. We developed posterior-in-posterior-out (PIPO-)BLSTMs, a type of recurrent neural network specifically designed for turbo information fusion, which replace the original turbo forward backward algorithm and revealed surprising properties during the project. Due to the probability interface at the input and output, PIPO-BLSTMs are fully modular state sequence enhancers and can be combined with various acoustic models that provide acoustic state probabilities. Two key properties ofPIPO-BLSTMs were discovered in the project: First, PIPO-BLSTMs provide best performance when trained with state probabilities that emerge from acoustic models that process little or even no temporal input context, but are then advantageously combined with acoustic models that process large temporal input context during inference. Second, turbo information fusion can be used to augment the PIPO BLSTM's training data by using the iteratively exchanged probabilities over multiple iterations on the training set as additional data. These capabilities enabled PIPO-BLSTM-based turbo information fusion to achieve a competitive phoneme error rate of18.02% on the well-known TIMIT database in a completely new fusion setting of the same features but different DFT window lengths, an improvement ofl.95% absolute compared to a common reference fusion method using a multi-stream HMM. Further work was driven by a paradigm shift, which is currently changing the focus of automatic speech recognition research from HMM-based hybrid approaches to so-called end-to-end methods, where the acoustic speech signal is directly converted into a sequence of graphemes by a single neural network. Information fusion has been poorly explored in these networks, so in this project we first developed several novel fusion methods and evaluated them for computational complexity and recognition rates. In addition, we developed multi-encoder learning, a method that uses fusion with an additional feature stream only in training and thus can improve a standard transformer model without adding any complexity during inference. Surprisingly, however, by analyzing this method very precisely, we were able to develop another performance-improving method that makes fusion, and thus the added complexity due to additional features and encoders, obsolete. This so-called relaxed attention performs a simple smoothing of the attention weights within the multi-head attention function in the transformer model. Relaxed attention benefits are two-fold: First, as relaxed self-attention, it regularizes the transformer encoder, and second, as relaxed cross attention, it suppresses the internally learned language model in the decoder to improve the combination oftransformer models with large external language models. Thus, we were ultimately able to achieve consistent improvement on various applications even beyond automatic speech recognition and outperformed the current state ofthe art on various tasks: On the Wall Street Journal database for automatic speech recognition we achieved a word error rate of 3.19% (vs. 3.4% with the best attention-based encoder decoder model to date), on the LRS3 database for automatic lip-reading we achieved a word error rate of 25.51% (vs. 26.90%), and on the IWSLT14 database for machine translation we increased the BLEU score to 37.85 (vs. 37.60).

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung