Project Details
The Unit of Representation in Multilingual Language Models
Applicant
Professorin Dr. Lisa Beinborn
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term
since 2025
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 550341764
Language technology is increasingly becoming a part of our daily lives. Yet, its access remains vastly unequal for the different languages of the world. This is because the field of Natural Language Processing (NLP) has historically been operating with a strong bias towards work that primarily focuses on English. While increasing efforts are being made towards making language technology more multilingual, English remains the language on which NLP technology is developed first. When directly applied to other languages, this approach often leads to degraded performance compared to English. A fundamental modelling choice is the tokenizer which determines the central units of representation for language processing. While these units determine what a model can learn, alternative input representations remain highly under-researched, especially in a multilingual context. In this project, we plan to systematically compare different choices for the representational unit based on characters, bytes, pixels, and phonemes for multilingual language models applied to typologically diverse languages. We will first examine the interaction between unit choices, typological characteristics of languages, and model performance. Based on this information, we develop new approaches for typologically-informed multilingual modelling that are more adaptable to new languages to increase cross-lingual fairness.
DFG Programme
Research Grants
International Connection
Belgium
Cooperation Partner
Dr. Miryam de Lhoneux
