Project Details
Projekt Print View

The Unit of Representation in Multilingual Language Models

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term since 2025
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 550341764
 
Language technology is increasingly becoming a part of our daily lives. Yet, its access remains vastly unequal for the different languages of the world. This is because the field of Natural Language Processing (NLP) has historically been operating with a strong bias towards work that primarily focuses on English. While increasing efforts are being made towards making language technology more multilingual, English remains the language on which NLP technology is developed first. When directly applied to other languages, this approach often leads to degraded performance compared to English. A fundamental modelling choice is the tokenizer which determines the central units of representation for language processing. While these units determine what a model can learn, alternative input representations remain highly under-researched, especially in a multilingual context. In this project, we plan to systematically compare different choices for the representational unit based on characters, bytes, pixels, and phonemes for multilingual language models applied to typologically diverse languages. We will first examine the interaction between unit choices, typological characteristics of languages, and model performance. Based on this information, we develop new approaches for typologically-informed multilingual modelling that are more adaptable to new languages to increase cross-lingual fairness.
DFG Programme Research Grants
International Connection Belgium
Cooperation Partner Dr. Miryam de Lhoneux
 
 

Additional Information

Textvergrößerung und Kontrastanpassung