LangBank: Digital Infrastructure to Support the Study of Latin and Historical German
Final Report Abstract
The project showcases that a range of annotation concepts (tokens, ranges, trees, pointers) and distinct diplomatic and normalized text representations can successfully be integrated in a multi-layer architecture for classical and historical languages. The project illustrates the successful combination of manual and automatic data processing approaches for the compilation of language resources for research and teaching purposes. In the process, the project tackled a number of complex challenges: We addressed the important foundational issue of normalization by designing robust normalization guidelines for ENHG that make it possible to prepare data so that it becomes systematically retrievable for research, teaching, and the application of NLP tools. Using these annotation guidelines, we also managed to create a resource with the help of which we can successfully train OCR and automatic normalization models. This readily supports the creation of new or the extension of existing corpora. To facilitate the application of higher-level NLP steps, such as parsing, we addressed the crucial issue of identifying sentence boundaries. We formulated systematic ENHG sentence segmentation guidelines based on a syntactic sentence definition. Defining ENHG sentences in a systematic way and automating the segmentation accordingly turned out to be a complex but scientifically very fruitful endeavor. On the empirical side, we established the well-foundedness and robustness of our annotation guidelines using inter-annotator agreement testing. Based on these steps, we investigated the applicability of NLP models on normalized ENHG data by applying an elaborate NLP pipeline for the analysis of linguistic complexity. We substantially broadened the analysis capabilities of the existing technology by developing additional linguistic complexity features and extending the existing output options. We tested the application of NLP parsing tools with modern language models on historical corpora, which only became possible through the sophisticated normalization and sentence segmentation layer. The automatically-derived NLP annotations complementing the manual linguistic analysis are of high-enough quality to be useful additions for querying ENHG corpora. Complementing the broad analysis of linguistic complexity, we generate a variety of syntactic, morphological, and lexical annotations and encode them in a format that readily supports merging them with other manual or automatic annotations in one multi-layer corpus annotation. The resulting corpus has been made freely available via ANNIS Search and Visualization tool and LAUDATIO-Repository. Finally, we developed multiple reading views that provide flexible access to the corpus for users with varying needs by visualizing different linguistic annotations depending on the practical needs of different types of users. For example, our analytic text visualization supports the unique needs of language learners of ENHG by combining diplomatic and normalized texts enriched with linguistic information. Building on this attractive basis, we are continuing the work by developing teaching materials in ENHG.
Publications
-
(2016) CLARIN Resources for Classical Latin and Historical German. Proceedings of the CLARIN Annual Conference 2016, Aix-en-Provence
MacWhinney, Brian, John Kowalski, Anke Lüdeling, Uwe Springmann, Detmar Meurers, and Zarah Weiss
-
(2016) LatMor: A Latin Finite-State Morphology Encoding Vowel Quantity. Open Linguistics 2 (1): 386–92
Springmann, Uwe, Helmut Schmid, and Dietmar Najock
-
(2016): Early New High German Sentence Segmentation Annotation Guidelines. Version 4.0. Humboldt Universität zu Berlin and Universität Tübingen
Weiss, Zarah and Gohar Schnelle
-
(2017) Annotation of an Early New High German Corpus: The LangBank Pipeline. 39. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, AG 4: Encoding language and linguistic information in historical corpora. Workshop Presentation/Abstract. Saarbrücken, Germany
Weiss, Zarah and Gohar Schnelle
-
(2017) OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus. Digital Humanities Quarterly 11 (2)
Springmann, Uwe and Anke Lüdeling
-
(2017) Profiling of OCR'ed Historical Texts Revisited. Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage. ACM
Fink, Florian, Klaus U. Schulz, and Uwe Springmann
-
(2017) RIDGES Herbology – Designing a Diachronic Multi-Layer Corpus. Language Resources and Evaluation. 51 (3), 695–725
Odebrecht, Carolin, Malte Belz, Amir Zeldes, Anke Lüdeling, and Thomas Krause
-
(2017) Evidence and Interpretation in Language Learning Research: Opportunities for Collaboration with Computational Linguistics. Language Learning 67 (S1). 67-96
Meurers, Detmar; Dickinson, Markus
-
(2018) Modeling the Readability of German Targeting Adults and Children: An empirically broad analysis and its cross-corpus validation. Proceedings of the 27th International Conference on Computational Linguistics (COLING). 303–317
Weiss, Zarah and Detmar Meurers