Detailseite
Projekt Druckansicht

Methoden des Spracherwerbs basierend auf spärlicher Kodierung

Fachliche Zuordnung Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing
Förderung Förderung von 2011 bis 2016
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 200293401
 
Erstellungsjahr 2015

Zusammenfassung der Projektergebnisse

The goal of this project was the development of a learning system which is able to reveal the latent structure of a language solely from spoken input. A hierarchical representation should be learnt, which consists of phone-like units at the lower and word-like units at the higher level of abstraction. Furthermore, a probabilistic coupling between the phone and word discovery should be established, such that one layer can benefit from information gleaned at the other. To achieve this goal methods from dictionary learning should be augmented with temporal modeling approaches, such as Dynamic Time Warping (DTW) or Hidden Markov Models (HMMs), to capture the temporal nature of speech. Initial investigations concentrated on DTW-based approaches. A two-layer learning system was developed, which discovered phone and word-like units. In order to overcome the high speakervariability of the acoustic realisation of subword units a DTW-based initialisation was proposed, which exploited the fact that the temporal alteration of the subword units in a word is more speaker independent than the acoustic realisation of each individual unit itself. Furthermore, a probabilistic lexicon was developed by which a word’s pronunciation is modeled by a hidden Markov Model with discrete emission probabilities over the set of subword units. In this way, pronunciation variants and errors in the subword unit discovery stage could be accommodated for. However, this approach is applicable to small vocabulary tasks only, where several utterances of each word to be modeled are available to learn the probabilistic pronunciation lexicon. In order to allow for a lexicon which grows with the amount of data becoming available, focus was shifted to a new methodology, i.e., nonparametric Bayesian methods. Here, the number of parameters is determined by the learning system and need not be specified manually in advance, which is much in the spirit of autonomous learning. A word discovery system was developed, which employed a nested hierarchical Pitman-Yor language model and which was able to operate on the noisy phoneme sequence produced by an ASR phoneme recognizer. The system iterates between ASR phoneme recognition, given a language model, and word segmentation and language model estimation, given the best scoring phoneme sequence. The task of word segmentation and language model estimation is again an iterative process by itself: It alternates between word and phoneme language model estimation, given a segmentation of the input phoneme sequence, and segmentation of the input sequence in words, given the language model probabilities. This system is suitable for an open vocabulary, since the lexicon size need not be known a priori. The algorithm has been successfully used on the Wall Street Journal corpus. Interestingly, the phoneme error rate of the ASR decoder could be reduced from an initial 33% to 25% after word discovery. This clearly shows that a coupling of the two levels of hierarchy indeed leads to improvements, as conjectured in the initial project proposal. The encouraging results obtained with nonparametric Bayesian methods have led us to apply for a project continuation with the DFG priority programme, which in the meantime has been granted. Work has also been conducted towards the use of the developed algorithms in specific applications. The first was a contribution to a speech interface for people with speaking impairments, which is developed at KU Leuven. Since standard speech recognition does not work for dysarthric speech, our developed unsupervised learning system served as an alternative. We have developed a semantic inference algorithm, based on Markov Logic Networks, which maps speech input to actions, bypassing a verbatim word-by-word recognition. The second application is the use of the proposed unsupervised word discorvery system as a tool for linguists to help them document rare languages. In an ongoing cooperation with the Institute of Linguistics at the University of Cologne we experiment with nonparametric Bayesian method based word segmentation on two Austronesian languages, Wooi and Waima’a, two languages threatened by extinction and for which our colleagues from Cologne have provided us with speech data.

Projektbezogene Publikationen (Auswahl)

  • A Hierarchical System for Word Discovery Exploiting DTW-Based Initialization, in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Olomouc, Czech Republic, Dec. 2013 (Best Student Paper Award at ASRU 2013)
    O. Walter, T. Korthals, R. Haeb-Umbach and B. Raj
  • Unsupervised Word Discovery from Phonetic Input Using Nested Pitman-Yor Language Modeling, in Proc. IEEE International Conference on Robotics and Automation, Karlsruhe, May 2013
    O. Walter, R. Haeb-Umbach, S. Chaudhuri and B. Raj
  • Unsupervised Word Segmentation from Noisy Input, in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Olomouc, Czech Republic, Dec. 2013
    J. Heymann, O. Walter, R. Haeb-Umbach and B. Raj
  • An Evaluation of Unsupervised Acoustic Model Training for a Dysarthric Speech Interface, In Proc. Interspeech, Singapore, Sept. 2014
    O. Walter, V. Despotovic, R. Haeb-Umbach, J. Gemmeke, B. Ons, H. Van hamme
  • Iterative Bayesian Word Segmentation for Unspuervised Vocabulary Discovery from Phoneme Lattices In 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), Florence, Italy, May 2014
    J. Heymann, O. Walter, R. Haeb-Umbach and B. Raj
    (Siehe online unter https://doi.org/10.1109/ICASSP.2014.6854364)
  • Semantic Analysis of Spoken Input Using Markov Logic Networks, in Proc. Interspeech, Dresden, Sept. 2015
    V. Despotovic, O. Walter and R. Haeb-Umbach
 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung