Project Details
Projekt Print View

Computational Models of Semantic Variation in Multi-Word Expression Meanings across Speakers and Languages

Subject Area Applied Linguistics, Computational Linguistics
Term since 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 462212526
 
Multiword expressions (MWEs) are ubiquitous across languages; yet, individual languages and language varieties differ in the manifestation and distribution of MWE types, and the mechanisms behind these variation phenomena remain unclear. Existing conclusions are challenged by reliance on broad linguistic categories (e.g., lexical bundles); surface patterns rather than core semantic properties (e.g., particle placement vs. compositionality); and disjoint analysis of potentially interacting factors (e.g., geographic origin and age). Furthermore, there is an added lack of clarity regarding optimal modelling solutions and the more general quality of MWE representations in language models. The reported inconsistencies may arise from a host of unexplored factors, including variability in dataset creation; model bias towards specific sociodemographic categories; and variable robustness to unseen data. In our project SemVarMWE we will define a multi-faceted programme regarding computational models of semantic variation of multiword expressions across speakers and languages. Focusing on two rather different types of multiword expressions, i.e., noun-noun compounds and particle verbs, we provide three levels of analysis across a broad range of target languages: Germanic (English, German), Romance (French, Italian, Spanish) and Slavic (BCMS: Bosnian, Croatian, Montenegrin, Serbian): (i) cross-lingual variation, comparing general usage across all target languages; (ii) regional variation, comparing country-level varieties for one language per family (English, Spanish, BCMS); (iii) sociodemographic and register variation, focusing on fine-grained distinctions in US English. Our overall aim is to establish gold standards and machine-learning approaches that capture the same types of MWEs across broader as well as more specific language varieties, deploying them to quantify differences in cross-variety MWE prominence. SemVarMWE will put considerable effort into elaborate common strategies regarding corpus collection, enrichment, and harmonisation as well as gold standard creation across languages and varieties, including the exploration of contributions of annotator agreement from a perspectivist standpoint. We will compare traditional and state-of-the-art measures for the interpretability of MWE meanings, specifically to identify salient features for compositionality and examine alignment with specific groups of speakers. Given the productive nature of our selected types of MWEs, we will employ neologisms to address creativity and generalisation across varieties, especially relevant regarding bias towards certain types of language use. For an in-depth understanding of MWEs, we plan to learn paraphrases as a window into polysemy, ranging from simple systematic pattern-based variations to sophisticated machine-learning and backtranslation variants, including generative models.
DFG Programme Research Grants
International Connection Italy
Cooperation Partner Professor Dr. Dirk Hovy
 
 

Additional Information

Textvergrößerung und Kontrastanpassung