Detailseite
Projekt Druckansicht

Das Überwinden der Schranke zu ungeschriebenen Sprachen

Fachliche Zuordnung Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing
Förderung Förderung von 2014 bis 2019
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 259117245
 
Erstellungsjahr 2019

Zusammenfassung der Projektergebnisse

Of the 4.000-7.000 languages in the world today many are in danger of becoming extinct. Experts estimate that up to half of these languages might die by the end of the century. With every language that becomes extinct, a whole culture vanishes, as culture and language are inseparably intertwined. In order preserve the knowledge and valuable wealth that would vanish with the dying languages, documentary linguists try to preserve the languages by documenting them before they go away. However, documenting a language is a laborious process that is time and labor intensive and needs experts of which only few exist. The documentation process is complicated and slowed down even more by the fact that most of the endangered languages – just as most of the languages in the world – are unwritten. Within BULB we therefore researched and created methods that support and facilitate the work of documentary linguists by employing specialized speech and language processing technologies, in order to automate certain processes or at least to provide preliminary results automatically that then later are refined by linguists. The technologies we developed were specifically targeted at unwritten languages, thus focusing on extracting results, such as phone boundaries, phone sets or word like units directly from speech recordings for which no textual representation is available due to the lack of a writing system for the languages. In these efforts we make use of the fact that often in these communities of endangered languages a second language is spoken, which is normally a well-known language with a writing system. We then conduct a special data collection in which we collect spoken data in the language of interest and collect at the same time oral translations of the sentences collected in the endangered language. We then exploited the parallelism of the well-known language of which we can produce automatic transcriptions and the audio recording of the language to be documented. We tested our approaches on three mostly unwritten languages from the Bantu family, spoken in sub-Saharan Africa: Basaa, Embosi and Myene. While the French partners of the bi-national project focused on developing a data collection tool for our use case and the exploitation of the parallel data to find word like units, on the German side partners focused on collecting data for Basaa and automatically detecting a phone set from the audio recordings of the target languages without any supervision. Within in the project a corpus of approximately 55h of Basaa data with French translations was collected. For detecting the phone set we used a three-step approach of a) detecting phone boundaries, b) classifying articulatory features for each detected segment, and c) clustering the detected segments based on the assigned articulatory features into a phone set. Unlike when using multilingual phone models for this task, with our approach we can detect arbitrary phones and not only the ones present in the multilingual model. In order to do perform step a) and b) we trained specific models based on deep bidirectional long short-term memory neural networks. In order to have the models generalize better across languages, so as also to work on previously unseen languages, we employed unsupervised language adaptation techniques such as language feature vectors.

Projektbezogene Publikationen (Auswahl)

  • "Phoneme boundary detection using deep bidirectional LSTMs." In Speech Communication; 12. ITG Symposium, pp. 1-5. VDE, 2016
    Franke, Jörg, Markus Müller, Fatima Hamlaoui, Sebastian Stüker, and Alex Waibel
  • "Unsupervised Phoneme Segmentation of Previously Unseen Languages." In INTERSPEECH, pp. 3544-3548. 2016
    Vetter, Marco, Markus Müller, Fatima Hamlaoui, Graham Neubig, Satoshi Nakamura, Sebastian Stüker, and Alex Waibel
    (Siehe online unter https://doi.org/10.21437/Interspeech.2016-1440)
  • “Innovative Technologies for Under-Resourced Language Documentation: The BULB Project.” In Proceedings of the 2nd Workshop on Collaboration and Computing for Under-Resourced Languages ‘Towards an Alliance for Digital Language Diversity’, Portoroz, Slovenia, May 23, 2016
    Sebastian Stüker, Gilles Adda, Martine Adda-Decker, Odette Ambouroue, Laurent Besacier, David Blachon, Hélène Maynard, Elodie Gauthier, Pierre Godard, Fatima Hamlaoui, Dmitry Idiatov, Guy-Noel Kouarata, Lori Lamel, Emmanuel-Moselly Makasso, Markus Müller, Annie Rialland, Mark Van de Velde, François Yvon, Sabine Zerbian
  • “Intonation in Sotho-Tswana”. In: L.J. Downing & A. Rialland (eds.). Intonation in African Tone Languages. Berlin: Mouton de Gruyter. Pp. 393-433. 2016
    S. Zerbian
    (Siehe online unter https://doi.org/10.1515/9783110503524-012)
  • “The prosody of focus and emphasis in Sepedi”. Proceedings of PRASA (Pattern Recognition Association of South Africa). Stellenbosch. Pp. 15-18. 2016
    M. Raborife, G. Turco & S. Zerbian
    (Siehe online unter https://doi.org/10.1109/RoboMech.2016.7813146)
  • "Towards phoneme inventory discovery for documentation of unwritten languages." In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5200-5204. IEEE, 2017
    Müller, Markus, Jörg Franke, Alex Waibel, and Sebastian Stüker
    (Siehe online unter https://doi.org/10.1109/ICASSP.2017.7953148)
  • “DBLSTM Based Multilingual Articulatory Feature Extraction for Language Documentation”. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan, December 16-20, 2017
    Müller, Markus, Sebastian Stüker, and Alex Waibel
    (Siehe online unter https://doi.org/10.1109/ASRU.2017.8268966)
  • “BULBasaa: A Bilingual Bàsàá-French Speech Corpus for the Evaluation of Language Documentation Tools”. LREC 2018
    Fatima Hamlaoui, Emmanuel-Moselly Makasso, Markus Müller, Jonas Engelmann, Gilles Adda, Alex Waibel, Sebastian Stüker
  • “Multilingual Modulation by Neural Language Codes,” Dissertation, Karlsruhe, 2018
    Markus Müller
  • “Neural Language Codes for Multilingual Acoustic Models”. In Proceedings of Interspeech 2018, Hyderabad, India, pp. 2419-2423
    Markus Müller, Sebastian Stüker, Alex Waibel
    (Siehe online unter https://doi.org/10.21437/Interspeech.2018-1241)
 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung