Detailseite
Projekt Druckansicht

LangBank: Digital Infrastructure to Support the Study of Latin and Historical German

Antragstellerinnen / Antragsteller Professorin Dr. Anke Lüdeling; Professor Dr. Detmar Meurers
Förderung Förderung von 2015 bis 2018
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 273970187
 
Erstellungsjahr 2019

Zusammenfassung der Projektergebnisse

The project showcases that a range of annotation concepts (tokens, ranges, trees, pointers) and distinct diplomatic and normalized text representations can successfully be integrated in a multi-layer architecture for classical and historical languages. The project illustrates the successful combination of manual and automatic data processing approaches for the compilation of language resources for research and teaching purposes. In the process, the project tackled a number of complex challenges: We addressed the important foundational issue of normalization by designing robust normalization guidelines for ENHG that make it possible to prepare data so that it becomes systematically retrievable for research, teaching, and the application of NLP tools. Using these annotation guidelines, we also managed to create a resource with the help of which we can successfully train OCR and automatic normalization models. This readily supports the creation of new or the extension of existing corpora. To facilitate the application of higher-level NLP steps, such as parsing, we addressed the crucial issue of identifying sentence boundaries. We formulated systematic ENHG sentence segmentation guidelines based on a syntactic sentence definition. Defining ENHG sentences in a systematic way and automating the segmentation accordingly turned out to be a complex but scientifically very fruitful endeavor. On the empirical side, we established the well-foundedness and robustness of our annotation guidelines using inter-annotator agreement testing. Based on these steps, we investigated the applicability of NLP models on normalized ENHG data by applying an elaborate NLP pipeline for the analysis of linguistic complexity. We substantially broadened the analysis capabilities of the existing technology by developing additional linguistic complexity features and extending the existing output options. We tested the application of NLP parsing tools with modern language models on historical corpora, which only became possible through the sophisticated normalization and sentence segmentation layer. The automatically-derived NLP annotations complementing the manual linguistic analysis are of high-enough quality to be useful additions for querying ENHG corpora. Complementing the broad analysis of linguistic complexity, we generate a variety of syntactic, morphological, and lexical annotations and encode them in a format that readily supports merging them with other manual or automatic annotations in one multi-layer corpus annotation. The resulting corpus has been made freely available via ANNIS Search and Visualization tool and LAUDATIO-Repository. Finally, we developed multiple reading views that provide flexible access to the corpus for users with varying needs by visualizing different linguistic annotations depending on the practical needs of different types of users. For example, our analytic text visualization supports the unique needs of language learners of ENHG by combining diplomatic and normalized texts enriched with linguistic information. Building on this attractive basis, we are continuing the work by developing teaching materials in ENHG.

Projektbezogene Publikationen (Auswahl)

 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung