Project Details
Projekt Print View

Cross-linguistic phonetics and morphology using a time-aligned multilingual reference corpus built from documentations of 50 languages: Big data on small languages

Applicant Privatdozent Dr. Frank Seifart, since 11/2019
Subject Area General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term from 2019 to 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 411066783
 
Final Report Year 2022

Final Report Abstract

Many prominent theories of human language are based on evidence from only very few, mostly large, European languages, disregarding potential variation across the ca. 7,000 languages that are currently still spoken. Among these are models of speech rate and pauses, which provide us with a window into the cognitive-neural and physiological-articulatory bases of the human language production system. The DoReCo project addressed cross-linguistic variation in this domain by (i) building a database on spontaneously spoken language in a world-wide sample of languages and by (ii) carrying out comparative studies on these data. The project was carried out by an interdisciplinary team bringing together expertise on documentary linguistics, phonetics, typology, and quantitative linguistics at two leading research centers in Germany and France. The DoReCo multilingual reference corpus is primarily based on data, that had been collected by linguists during field work aimed at documenting small and often endangered languages. Most of these are narratives that had been audio-recorded and transcribed and translated into a major language. Within the DoReCo project, these data were time-aligned at the fine-grained level of phonemes required for investigating the temporal structure of speech, and the diverse annotation styles were converted towards a common format to facilitate cross-linguistic research. We included such corpora on 51 languages, each consisting of on average around 10,000 words. A subset of 36 corpora are additionally annotated for morpheme breaks and morpheme glosses. In DoReCo, corpora on individual languages are treated as citable publications authored by the linguists who originally collected the data and provided with permanent identifiers and associated with a CC BY 4.0 license. Via the website, the DoReCo database offers easy access to, in total, almost halve a million words of annotated corpus data from languages from six continents and 32 language families for cross-linguistic research on spoken language. This represents an unprecedented contribution to open, reproducible science regarding global linguistic diversity and cultural heritage, also raising visibility of fieldwork-based documentation efforts and of marginalized speaker populations (see, for example, the interview given by a DoReCo PI on the occasion of the launch of DoReCo 1.0 on German National Radio. Research carried out by the DoReCo project assessed the universality of constraints on human language arising from species-wide articulatory and cognitive properties: Firstly, we investigated whether articulation universally slows down before pauses across a representative sample of the world's languages. We found that such decelerations depend more strongly than previously assumed on the grammatical rules of individual languages, rather than on inertia of articulatory organs. Secondly, we investigated how information is temporally distributed in speech across different languages. Here, we found that despite drastic cross-linguistic differences regarding how long and complex words are, languages tend to temporally distribute information in very similar ways in terms of speeding up and slowing down speech depending on informativity, highlighting apparently universal aspects of human cognition.

Publications

  • Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo). Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2657–2666.
    Paschen, Ludger, François Delafontaine, Christoph Draxler, Susanne Fuchs, Matthew Stave & Frank Seifart
  • Combining documentary linguistics and corpus phonetics to advance corpus-based typology. In Geoffrey Haig, Stefan Schnell & Frank Seifart (eds.), Doing corpus-based typology with spoken language corpora. State of the art (Language Documentation & Conservation Special Publication 25), 115–139. Honolulu: University of Hawai’i Press.
    Seifart, Frank
  • Doing corpus-based typology with spoken language corpora. State of the art (Language Documentation & Conservation Special Publication 25). Honolulu: University of Hawai’i Press
    Haig, Geoffrey, Stefan Schnell & Frank Seifart
  • Optimization of morpheme length: a cross-linguistic assessment of Zipf’s and Menzerath’s laws. Linguistics Vanguard, 7(s3).
    Stave, Matthew; Paschen, Ludger; Pellegrino, François & Seifart, Frank
  • Syllable Complexity and Morphological Synthesis: A Well-Motivated Positive Complexity Correlation Across Subdomains. Frontiers in Psychology, 12.
    Easterday, Shelece; Stave, Matthew; Allassonnière-Tang, Marc & Seifart, Frank
  • Bora DoReCo dataset. In Seifart, Frank, Ludger Paschen and Matthew Stave (eds.). Language Documentation Reference Corpus (DoReCo) 1.1. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2).
    Seifart, Frank
  • Daakie DoReCo dataset. In Seifart, Frank, Ludger Paschen and Matthew Stave (eds.). Language Documentation Reference Corpus (DoReCo) 1.1. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2).
    Krifka, Manfred
  • Final Lengthening and vowel length in 25 languages. Journal of Phonetics, 94, 101179.
    Paschen, Ludger; Fuchs, Susanne & Seifart, Frank
  • Resígaro DoReCo dataset. In Seifart, Frank, Ludger Paschen and Matthew Stave (eds.). Language Documentation Reference Corpus (DoReCo) 1.1. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2).
    Seifart, Frank
  • The role of language documentation in corpus-based typology. Universitatsbibliothek Bamberg.
    Schnell, Stefan; Haig, Geoffrey & Seifart, Frank
 
 

Additional Information

Textvergrößerung und Kontrastanpassung