Computational Structural Analysis of German-Turkish Code-Switching

Applicant Ozlem Cetinoglu El Khoury, Ph.D.

Subject Area General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages

Term from 2017 to 2022

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 392877234

Final Report Year 2022

Final Report Abstract

Our main goal when starting the SAGT project has been the systematic and comprehensive structural analysis of code-switching, with a specialisation on Turkish-German. Main challenges in code-switching research include complex structures due to mixing languages, noisy data due to the nature of the data sources, such as social media and speech data, and lack of labelled and unlabelled resources. In our project we addressed these issues by • Understanding the data: We collected speech data of bilingual speakers, observed the distribution of the code-switched data in terms of languages, POS tags, and syntactic structures. We also inventoried new structures we came across and developed representations for them. • Creating resources: We created a Turkish-German code-switching treebank where each sentence has at least one CS point. We manually annotated this treebank with language ID, lemma, POS, morphology, and dependency layers. The source for the treebank sentences were interview transcriptions. The audio files of these interviews were also prepared as a speech resource with CS points, language IDs, and normalisation layers. We also contributed to a Turkish-German corpus of bilingual conversations, and an Egyptian Arabic-English corpus. • Structured analysis: We employed several state-of-the-art methodologies and created our own tools to analyse code-switched text. We built models for normalisation, language identification, POS tagging, morphological analysis and parsing. For each task we included qualitative analysis in our studies. • Data-driven modelling: Our architectures were based on language-independent machine learning approaches. Therefore, as long as there were datasets available for a task, we included them in our experiments. These language pairs we examined, namely Arabic- English, Indonesian-English, Hindi-English, Wixarika-Spanish, Frisian-Dutch, and Komi Zyrian-Russian, came from a wide variety of language families and helped us understand the behaviour of our models better. From the experiments, it was possible to derive two generalisations for each task: i) our proposed models increased the performance over all datasets and achieved the best results ii) the impact of the model’s improvement varied from language pair to language pair. Three major reasons played a role in this variation: language typology, the size of tested CS data, hence possible CS data sparsity, and the size of (or lack thereof) the CS and monolingual resources that could be used in training (from labelled CS data to monolingual resources to embeddings). During the course of our project, deep learning methods, particularly transformer-based language models, dominated the field. We have also successfully used them in solving lower level tasks and improving base models. Nevertheless, we have observed that state-of-the-art models did not always outperform other machine learning approaches when used on CS data. The main reason is that such models do not contain much code-switching in their training data, and they require large amounts of CS data to be fine-tuned. All these insights we derived from our project experience point out the importance of size and variety of CS resources both for in-depth analysis and for computational implementations. As our efforts proved, resource creation is highly costly. As an alternative, artificial data generation is a future direction that is worth pursuing in CS research. While research on codeswitching has substantially increased, in most tasks, NLP tools have not yet caught the performance they achieve on data that has been a long-term interest (e.g., edited English). The language resources and computational models we developed during the project are valuable resources for future work that would focus on this performance gap; and are publicly available for research purposes.

Publications

Challenges of Annotating a Code-Switching Treebank. Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 82-90. Association for Computational Linguistics.
Çetinoğlu, Özlem & Çöltekin, Çağrı
Subword-Level Language Identification for Intra-Word Code-Switching. Proceedings of the 2019 Conference of the North, 2005-2011. Association for Computational Linguistics.
Mager, Manuel; Çetinoğlu, Özlem & Kann, Katharina
Cairo student codeswitch (CSCS) corpus: an annotated Egyptian Arabic-English corpus. In: Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, pp. 3973–3977
Balabel, M., I. Hamed, S. Abdennadher, N. T. Vu & O. Cetinoglu
A Language-aware Approach to Code-switched Morphological Tagging. Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, 72-83. Association for Computational Linguistics.
Özateş, Şaziye Betül & Çetinoğlu, Özlem
Language Identification of Intra-Word Code-Switching for Arabic–English. Array, 12, 100104.
Sabty, Caroline; Mesabah, Islam; Çetinoğlu, Özlem & Abdennadher, Slim
Lexical Normalization for Code-switched Data and its Effect on POS Tagging. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2352-2365. Association for Computational Linguistics.
van der Goot, Rob & Çetinoğlu, Özlem
Anonymising the sagt speech corpus and treebank. In: Proceedings of the Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, pp. 5557–5564
Cetinoglu, O. & A. Schweitzer
Improving Code-Switching Dependency Parsing with Semi-Supervised Auxiliary Tasks. Findings of the Association for Computational Linguistics: NAACL 2022, 1159-1171. Association for Computational Linguistics.
Özateş, Şaziye; Özgür, Arzucan; Gungor, Tunga & Çetinoğlu, Özlem
TuGeBiC: A Turkish German Bilingual Code-Switching Corpus
Treffers-Daller, J. & O. Cetinoglu
Two languages, one treebank: building a Turkish–German code-switching treebank and its challenges. Language Resources and Evaluation, 57(2), 545-579.
Çetinoğlu, Özlem & Çöltekin, Çağrı

Servicenavigation

Hauptnavigation

Computational Structural Analysis of German-Turkish Code-Switching

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

Computational Structural Analysis of German-Turkish Code-Switching

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung