Detailseite
Projekt Druckansicht

Komputationelle strukturelle Analyse von dem deutsch-türkischen Code-Switching

Fachliche Zuordnung Allgemeine und Vergleichende Sprachwissenschaft, Experimentelle Linguistik, Typologie, Außereuropäische Sprachen
Förderung Förderung von 2017 bis 2022
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 392877234
 
Erstellungsjahr 2022

Zusammenfassung der Projektergebnisse

Our main goal when starting the SAGT project has been the systematic and comprehensive structural analysis of code-switching, with a specialisation on Turkish-German. Main challenges in code-switching research include complex structures due to mixing languages, noisy data due to the nature of the data sources, such as social media and speech data, and lack of labelled and unlabelled resources. In our project we addressed these issues by • Understanding the data: We collected speech data of bilingual speakers, observed the distribution of the code-switched data in terms of languages, POS tags, and syntactic structures. We also inventoried new structures we came across and developed representations for them. • Creating resources: We created a Turkish-German code-switching treebank where each sentence has at least one CS point. We manually annotated this treebank with language ID, lemma, POS, morphology, and dependency layers. The source for the treebank sentences were interview transcriptions. The audio files of these interviews were also prepared as a speech resource with CS points, language IDs, and normalisation layers. We also contributed to a Turkish-German corpus of bilingual conversations, and an Egyptian Arabic-English corpus. • Structured analysis: We employed several state-of-the-art methodologies and created our own tools to analyse code-switched text. We built models for normalisation, language identification, POS tagging, morphological analysis and parsing. For each task we included qualitative analysis in our studies. • Data-driven modelling: Our architectures were based on language-independent machine learning approaches. Therefore, as long as there were datasets available for a task, we included them in our experiments. These language pairs we examined, namely Arabic- English, Indonesian-English, Hindi-English, Wixarika-Spanish, Frisian-Dutch, and Komi Zyrian-Russian, came from a wide variety of language families and helped us understand the behaviour of our models better. From the experiments, it was possible to derive two generalisations for each task: i) our proposed models increased the performance over all datasets and achieved the best results ii) the impact of the model’s improvement varied from language pair to language pair. Three major reasons played a role in this variation: language typology, the size of tested CS data, hence possible CS data sparsity, and the size of (or lack thereof) the CS and monolingual resources that could be used in training (from labelled CS data to monolingual resources to embeddings). During the course of our project, deep learning methods, particularly transformer-based language models, dominated the field. We have also successfully used them in solving lower level tasks and improving base models. Nevertheless, we have observed that state-of-the-art models did not always outperform other machine learning approaches when used on CS data. The main reason is that such models do not contain much code-switching in their training data, and they require large amounts of CS data to be fine-tuned. All these insights we derived from our project experience point out the importance of size and variety of CS resources both for in-depth analysis and for computational implementations. As our efforts proved, resource creation is highly costly. As an alternative, artificial data generation is a future direction that is worth pursuing in CS research. While research on codeswitching has substantially increased, in most tasks, NLP tools have not yet caught the performance they achieve on data that has been a long-term interest (e.g., edited English). The language resources and computational models we developed during the project are valuable resources for future work that would focus on this performance gap; and are publicly available for research purposes.

Projektbezogene Publikationen (Auswahl)

 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung