Detailseite
Projekt Druckansicht

Das Überwinden der Schranke zu ungeschriebenen Sprachen

Fachliche Zuordnung Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing
Förderung Förderung von 2014 bis 2019
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 259117245
 
Erstellungsjahr 2019

Zusammenfassung der Projektergebnisse

Of the 4.000-7.000 languages in the world today many are in danger of becoming extinct. Experts estimate that up to half of these languages might die by the end of the century. With every language that becomes extinct, a whole culture vanishes, as culture and language are inseparably intertwined. In order preserve the knowledge and valuable wealth that would vanish with the dying languages, documentary linguists try to preserve the languages by documenting them before they go away. However, documenting a language is a laborious process that is time and labor intensive and needs experts of which only few exist. The documentation process is complicated and slowed down even more by the fact that most of the endangered languages – just as most of the languages in the world – are unwritten. Within BULB we therefore researched and created methods that support and facilitate the work of documentary linguists by employing specialized speech and language processing technologies, in order to automate certain processes or at least to provide preliminary results automatically that then later are refined by linguists. The technologies we developed were specifically targeted at unwritten languages, thus focusing on extracting results, such as phone boundaries, phone sets or word like units directly from speech recordings for which no textual representation is available due to the lack of a writing system for the languages. In these efforts we make use of the fact that often in these communities of endangered languages a second language is spoken, which is normally a well-known language with a writing system. We then conduct a special data collection in which we collect spoken data in the language of interest and collect at the same time oral translations of the sentences collected in the endangered language. We then exploited the parallelism of the well-known language of which we can produce automatic transcriptions and the audio recording of the language to be documented. We tested our approaches on three mostly unwritten languages from the Bantu family, spoken in sub-Saharan Africa: Basaa, Embosi and Myene. While the French partners of the bi-national project focused on developing a data collection tool for our use case and the exploitation of the parallel data to find word like units, on the German side partners focused on collecting data for Basaa and automatically detecting a phone set from the audio recordings of the target languages without any supervision. Within in the project a corpus of approximately 55h of Basaa data with French translations was collected. For detecting the phone set we used a three-step approach of a) detecting phone boundaries, b) classifying articulatory features for each detected segment, and c) clustering the detected segments based on the assigned articulatory features into a phone set. Unlike when using multilingual phone models for this task, with our approach we can detect arbitrary phones and not only the ones present in the multilingual model. In order to do perform step a) and b) we trained specific models based on deep bidirectional long short-term memory neural networks. In order to have the models generalize better across languages, so as also to work on previously unseen languages, we employed unsupervised language adaptation techniques such as language feature vectors.

Projektbezogene Publikationen (Auswahl)

 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung