Linguistische Form und Bedeutung in einer computerlinguistischen Analyse von Lernersprache -- Zur Integration von morphosyntaktischer und semantischer Analyse
Zusammenfassung der Projektergebnisse
Investigating the challenge of integrating bottom-up analysis of learner language with topdown task information, the project made substantial conceptual, architectural, and empirical contributions: We spelled out the conceptual basis for NLP analyses of learner language and the need to integrate top-down and bottom-up information to obtain valid analyses of learner language in context. While such combination of bottom-up form-based and top-down meaning-based analysis arguably is relevant for language interpretation in general, the particularly variable nature of the interlanguage forms of language learners and the language-based task context of reading comprehension activities made the data collected in the German learner corpus of reading comprehension questions (CREG) a good empirical test case, which we built on and extended in this project. To explore an architecture capable of integrating bottom-up processing of the learner answers with top-down guidance from the task context, in the project we conceptualized and implemented a new NLP architecture using Probabilistic Soft Logic. To our knowledge, it is the first general-purpose NLP system to use statistical relational learning as its inference backbone, which provides a new avenue to the integration of state-of-the-art statistical and neural NLP tools with linguistically informed hand-crafted linguistic rules. This combination becomes particularly relevant when considering the need for Explainable AI, where the internals of a trainable system become relevant and need to support the generation of meaningful feedback on the components of an analysis. The approach can accommodate standard NLP components building on one layer of analysis to derive one or several possible representations on another level. For instance, a part-of-speech tagger can output several likely alternatives for adding parts of speech to a given sequence of words. The prototype resulting from substantial engineering efforts integrates eight levels of linguistic modeling and is capable of performing the required reasoning for German learner language examples. To characterize the empirical hypothesis space, the architecture includes a very general interface for plugging in arbitrary surface-level variant generation modules. In the project, we extended tools for spelling and grammar corrections for German, integrating corrections extracted from Wikipedia edits (Boyd, 2018b). As gold standard normalization, we extended the CREG corpus with form-meaning target hypothesis (FMTH) annotations corresponding to the minimal number of form corrections needed to obtain well-formed utterances. The detailed FMTH annotation scheme supports the identification of normal forms taking into account the task context. The new CREG-MeanT corpus resource with 2574 annotated student answers is freely available for researchers as indicated on the project home page (http://purl.org/coalla). Given the relevance of the task contexts, we also produced a version of CREG-5K including the condensed context information: the questions, the target answers, the correct student answers, as well as short and long reading context passages extracted from the reading comprehension texts.
Projektbezogene Publikationen (Auswahl)
-
2015. Learner corpora and natural language processing. In Gaëtanelle Gilquin Sylviane Granger and Fanny Meunier, editors, The Cambridge Handbook of Learner Corpus Research, pages 537–566. Cambridge University Press
Detmar Meurers
-
2017. Evidence and interpretation in language learning research: Opportunities for collaboration with computational linguistics. Language Learning, 67(S1):66–95
Detmar Meurers and Markus Dickinson
-
2018. Normalization in context: Inter-annotator agreement for meaning-based target hypothesis annotation. In Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning (NLP4CALL), pages 10–22, Stockholm, Sweden
Adriane Boyd
-
2018b. Using Wikipedia edits in low resource grammatical error correction. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy Usergenerated Text, pages 79–84, Brussels, Belgium. Association for Computational Linguistics
Adriane Boyd
-
2020. CREG form-meaning target hypothesis u annotation manual. Technical report, 38pp.
Adriane Boyd and Franziska Linnenschmidt
-
2020. Exploring Probabilistic Soft Logic as a Framework for Integrating Top-down and Bottom-up Processing of Language in a Task Context. Technical report, 32pp.
Johannes Dellert