Detailseite
Projekt Druckansicht

Fremdwahrnehmungen in Reiseberichten 1500-1875 - eine computergestützte Analyse

Fachliche Zuordnung Neuere und Neueste Geschichte (einschl. Europäische Geschichte der Neuzeit und Außereuropäische Geschichte)
Datenmanagement, datenintensive Systeme, Informatik-Methoden in der Wirtschaftsinformatik
Förderung Förderung von 2018 bis 2021
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 398697847
 
Erstellungsjahr 2021

Zusammenfassung der Projektergebnisse

In many European libraries, historic corpora remain in printed or handwritten format, which offers minimal access to digital analysis. However, algorithms that leverage increased computational capacity promise to accelerate the large-scale historical corpus analysis process significantly and find hidden links across corpora, time, and locations while providing solid statistical proofs for their claims. The Travelogues project aimed to develop novel automatic and semi-automatic methods for serial analysis of large-scale historical text corpora from the Austrian Books Online (ABO) project (ca. 3,000 - 3,500 books) and meanwhile, keep the in-depth evaluation from the historian perspectives. Specifically, one of the aims of this project was to gain insight into the perception of the Other. For historical analyses, the essential tools are data-driven statistical models, aka Machine learning models. First, one crucial challenge when dealing with historical data is that training data is not present in abundance. Specifically, to train effective and robust machine learning models, one needs to build models that operate on small hand-annotated training datasets. Secondly, conducting computerized analysis requires digital scans of every page or OCRed outputs for text-related analysis. When the OCRed texts are noisy (which is the real case), in-depth analyses are hard to be continued. For instance, we found it challenging to extract the text pairs with intertextuality relationships for the second research question because those intensive OCR errors might heavily distort the sentence structure, meaning, and named entities. In this project, we proposed novel algorithms, datasets, and annotated corpora that improve the quality of digital historical analyses: 1. We created a manually verified travelogues corpus from the 16th–19th century, together with trained models to identify potential travelogue books. 2. We produced the corresponding corrected travelogues corpus by the post-hoc OCR correction model, together with trained models to conduct the OCR correction. 3. And as a first in literature, we created a German text-correction dataset of aligned sentence pairs and the methods to generate such customized datasets. We also found some fundamental challenges due to the lack of labeled datasets that led to slight course corrections in our objectives. Specifically, instead of conducting Research Question three and four, most of the efforts went to improve the data quality using post-hoc OCR correction. We strongly believe that our results and our state-of-the-art approaches on OCR correction greatly improve the capability of digitization and enable large-scale historical analysis. All source codes are also made publicly accessible for the benefit of the community.

Projektbezogene Publikationen (Auswahl)

 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung