Project Details
Projekt Print View

Corpus linguistic methods

Subject Area General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term since 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 313607803
 
Project Pc is both an infrastructure and a research project within RUEG2. It is the successor to project Pd in RUEG1. On the side of infrastructure and support, it will continuously provide integration of new and/or corrected annotations, data curation and sustainability, as well as technical support and research engineering, i.e. the improvement of automatic and semi-automatic annotation of non-standard data across two modalities, and more generally the development of tools and pipelines for information retrieval/text mining and quantitative analysis. It will also provide support and consultation in the choice and application of quantitative research methods for projects P8-P11 in RUEG2.On the research side, it aims to advance the field of corpus linguistics in two ways: (1) through an evaluation of advanced machine learning techniques and the feasibility and usefulness of their application for the automatic and semi-automatic annotation and information retrieval in non-standard corpora of limited size; and (2) through a focus on the development, validation, evaluation, and epistemological embedding of methods for the RUEG corpus specifically, as well as small and mid-sized corpora in general. The RUEG corpus, being a mid-sized corpus and very well controlled in terms of topic, structure, setting, participants‘ backgrounds, and enriched with ample metadata, offers the chance to deeply understand, annotate, and analyze the full data set in a collaborative effort of the whole research group. It is in fact one of the few corpora that allow for variationist analyses across samples from different production situations and modes, speaker groups, age groups, and two languages recorded for each speaker. However, the trade-off for capturing this complexity lies in the diminished sample size for each group, which does not typically reach representativity as it would be required for frequentist statistics. Since there is no existing set of quantitative techniques that beyond reasonable doubt yield reliable results for smaller corpora, methodological development is crucial to the quantitative study of the RUEG data. At the same time, RUEG is unusually well-suited as a testing field for the evaluation of methods. It thus provides exceptionally synergetic potential for the development of corpus-linguistic methods overall. Pc will investigate and evaluate several promising techniques: a) The applicability (including the validity, reliability, and explanatory power) of mixed-effect models (MEMs), b) two frameworks that are currently almost unused in core-linguistics, graph theory or network analysis and Bayesian statistics, but show promising results in other quantitative fields; and c) the application of machine learning techniques for knowledge gain (rather than text mining objectives, as it is currently mainly used in computational linguistics).
DFG Programme Research Units
 
 

Additional Information

Textvergrößerung und Kontrastanpassung