Project Details
Projekt Print View

Northern Mansi Corpus (NOMAC): A century and more of Northern Mansi in a diachronic corpus

Subject Area General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Applied Linguistics, Computational Linguistics
Term since 2025
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 568769695
 
NOMAC (Northern Mansi Corpus) will create an openly accessible diachronic corpus of a critically endangered Uralic minority language of Siberia. It will make more than a century of language change and cultural history accessible to scholarly and speaker communities, and it will allow us to study how a language can change under immense pressure from a dominant language (in this case, Russian). Northern Mansi is a comparatively close linguistic relative of Hungarian that has long attracted scholarly attention. It is relatively well documented, but poorly accessible: a wide range of written records from the 19th century until today exist, but they are highly heterogeneous and disparate. They make use of a wide range of writing systems (different transcriptions and orthographies) and have largely not been digitized or have been digitized in idiosyncratic ways that preclude comparison. By digitizing, homogenizing, and publishing the wealth of existing data, NOMAC will create an unprecedented resource for the study and description of language change over more than a century, which has relevance to the study of language change under contact pressure in general. The corpus will include the totality of texts collected and transcribed by field researchers in the late Russian Imperial period and in the early Soviet period, as well as a maximally large selection of late Soviet and contemporary texts, including spoken texts with audio recordings. In the academic work carried out in conjunction with corpus building we will study the diachronic change in – among other things – the formation of complex sentences, the usage of the passive voice, and verb argument structure over the time span covered by the corpus. Modern technologies and standards in digital humanities are what make the ambitious goals set by NOMAC feasible in the first place and will consequently be employed in our project. Digitalization will happen using the AI-powered OCR software Transkribus; the resulting digital resources will employ consistent Unicode character encoding and adhere to TEI standards. The possibilities offered by modern technology will allow NOMAC to create a comprehensive diachronic corpus for a critically endangered minority language allowing an in-detail view at the minutiae of a language’s history that has not been available to linguists before. This is in sharp contrast to the more modest goals necessitated by technological limitations in previous corpus-building initiatives. The project will be carried out at the Institute of Finno-Ugric/Uralic Studies at LMU Munich, an institution with a long tradition of research in the Ob-Ugric languages. It will involve researchers with strong backgrounds in general linguistics and typology, Ob-Ugric studies as well as digital humanities and computer science.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung