Breaking the Unwritten Language Barrier
Final Report Abstract
Of the 4.000-7.000 languages in the world today many are in danger of becoming extinct. Experts estimate that up to half of these languages might die by the end of the century. With every language that becomes extinct, a whole culture vanishes, as culture and language are inseparably intertwined. In order preserve the knowledge and valuable wealth that would vanish with the dying languages, documentary linguists try to preserve the languages by documenting them before they go away. However, documenting a language is a laborious process that is time and labor intensive and needs experts of which only few exist. The documentation process is complicated and slowed down even more by the fact that most of the endangered languages – just as most of the languages in the world – are unwritten. Within BULB we therefore researched and created methods that support and facilitate the work of documentary linguists by employing specialized speech and language processing technologies, in order to automate certain processes or at least to provide preliminary results automatically that then later are refined by linguists. The technologies we developed were specifically targeted at unwritten languages, thus focusing on extracting results, such as phone boundaries, phone sets or word like units directly from speech recordings for which no textual representation is available due to the lack of a writing system for the languages. In these efforts we make use of the fact that often in these communities of endangered languages a second language is spoken, which is normally a well-known language with a writing system. We then conduct a special data collection in which we collect spoken data in the language of interest and collect at the same time oral translations of the sentences collected in the endangered language. We then exploited the parallelism of the well-known language of which we can produce automatic transcriptions and the audio recording of the language to be documented. We tested our approaches on three mostly unwritten languages from the Bantu family, spoken in sub-Saharan Africa: Basaa, Embosi and Myene. While the French partners of the bi-national project focused on developing a data collection tool for our use case and the exploitation of the parallel data to find word like units, on the German side partners focused on collecting data for Basaa and automatically detecting a phone set from the audio recordings of the target languages without any supervision. Within in the project a corpus of approximately 55h of Basaa data with French translations was collected. For detecting the phone set we used a three-step approach of a) detecting phone boundaries, b) classifying articulatory features for each detected segment, and c) clustering the detected segments based on the assigned articulatory features into a phone set. Unlike when using multilingual phone models for this task, with our approach we can detect arbitrary phones and not only the ones present in the multilingual model. In order to do perform step a) and b) we trained specific models based on deep bidirectional long short-term memory neural networks. In order to have the models generalize better across languages, so as also to work on previously unseen languages, we employed unsupervised language adaptation techniques such as language feature vectors.
Publications
-
"Phoneme boundary detection using deep bidirectional LSTMs." In Speech Communication; 12. ITG Symposium, pp. 1-5. VDE, 2016
Franke, Jörg, Markus Müller, Fatima Hamlaoui, Sebastian Stüker, and Alex Waibel
-
"Unsupervised Phoneme Segmentation of Previously Unseen Languages." In INTERSPEECH, pp. 3544-3548. 2016
Vetter, Marco, Markus Müller, Fatima Hamlaoui, Graham Neubig, Satoshi Nakamura, Sebastian Stüker, and Alex Waibel
-
“Innovative Technologies for Under-Resourced Language Documentation: The BULB Project.” In Proceedings of the 2nd Workshop on Collaboration and Computing for Under-Resourced Languages ‘Towards an Alliance for Digital Language Diversity’, Portoroz, Slovenia, May 23, 2016
Sebastian Stüker, Gilles Adda, Martine Adda-Decker, Odette Ambouroue, Laurent Besacier, David Blachon, Hélène Maynard, Elodie Gauthier, Pierre Godard, Fatima Hamlaoui, Dmitry Idiatov, Guy-Noel Kouarata, Lori Lamel, Emmanuel-Moselly Makasso, Markus Müller, Annie Rialland, Mark Van de Velde, François Yvon, Sabine Zerbian
-
“Intonation in Sotho-Tswana”. In: L.J. Downing & A. Rialland (eds.). Intonation in African Tone Languages. Berlin: Mouton de Gruyter. Pp. 393-433. 2016
S. Zerbian
-
“The prosody of focus and emphasis in Sepedi”. Proceedings of PRASA (Pattern Recognition Association of South Africa). Stellenbosch. Pp. 15-18. 2016
M. Raborife, G. Turco & S. Zerbian
-
"Towards phoneme inventory discovery for documentation of unwritten languages." In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5200-5204. IEEE, 2017
Müller, Markus, Jörg Franke, Alex Waibel, and Sebastian Stüker
-
“DBLSTM Based Multilingual Articulatory Feature Extraction for Language Documentation”. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan, December 16-20, 2017
Müller, Markus, Sebastian Stüker, and Alex Waibel
-
“BULBasaa: A Bilingual Bàsàá-French Speech Corpus for the Evaluation of Language Documentation Tools”. LREC 2018
Fatima Hamlaoui, Emmanuel-Moselly Makasso, Markus Müller, Jonas Engelmann, Gilles Adda, Alex Waibel, Sebastian Stüker
-
“Multilingual Modulation by Neural Language Codes,” Dissertation, Karlsruhe, 2018
Markus Müller
-
“Neural Language Codes for Multilingual Acoustic Models”. In Proceedings of Interspeech 2018, Hyderabad, India, pp. 2419-2423
Markus Müller, Sebastian Stüker, Alex Waibel