Fremdwahrnehmungen in Reiseberichten 1500-1875 - eine computergestützte Analyse

Antragsteller Professor Dr. Wolfgang Nejdl

Fachliche Zuordnung Neuere und Neueste Geschichte (einschl. Europäische Geschichte der Neuzeit und Außereuropäische Geschichte)
Datenmanagement, datenintensive Systeme, Informatik-Methoden in der Wirtschaftsinformatik

Förderung Förderung von 2018 bis 2021

Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 398697847

Erstellungsjahr 2021

Zusammenfassung der Projektergebnisse

In many European libraries, historic corpora remain in printed or handwritten format, which oﬀers minimal access to digital analysis. However, algorithms that leverage increased computational capacity promise to accelerate the large-scale historical corpus analysis process signiﬁcantly and ﬁnd hidden links across corpora, time, and locations while providing solid statistical proofs for their claims. The Travelogues project aimed to develop novel automatic and semi-automatic methods for serial analysis of large-scale historical text corpora from the Austrian Books Online (ABO) project (ca. 3,000 - 3,500 books) and meanwhile, keep the in-depth evaluation from the historian perspectives. Speciﬁcally, one of the aims of this project was to gain insight into the perception of the Other. For historical analyses, the essential tools are data-driven statistical models, aka Machine learning models. First, one crucial challenge when dealing with historical data is that training data is not present in abundance. Speciﬁcally, to train eﬀective and robust machine learning models, one needs to build models that operate on small hand-annotated training datasets. Secondly, conducting computerized analysis requires digital scans of every page or OCRed outputs for text-related analysis. When the OCRed texts are noisy (which is the real case), in-depth analyses are hard to be continued. For instance, we found it challenging to extract the text pairs with intertextuality relationships for the second research question because those intensive OCR errors might heavily distort the sentence structure, meaning, and named entities. In this project, we proposed novel algorithms, datasets, and annotated corpora that improve the quality of digital historical analyses: 1. We created a manually veriﬁed travelogues corpus from the 16th–19th century, together with trained models to identify potential travelogue books. 2. We produced the corresponding corrected travelogues corpus by the post-hoc OCR correction model, together with trained models to conduct the OCR correction. 3. And as a ﬁrst in literature, we created a German text-correction dataset of aligned sentence pairs and the methods to generate such customized datasets. We also found some fundamental challenges due to the lack of labeled datasets that led to slight course corrections in our objectives. Speciﬁcally, instead of conducting Research Question three and four, most of the eﬀorts went to improve the data quality using post-hoc OCR correction. We strongly believe that our results and our state-of-the-art approaches on OCR correction greatly improve the capability of digitization and enable large-scale historical analysis. All source codes are also made publicly accessible for the beneﬁt of the community.

Projektbezogene Publikationen (Auswahl)

Traveling through Space and Time, or: Making Historical Travelogues Accessible. NKOS Workshop 2018
Jan Rörden, Bernhard Haslhofer, Rainer Simon, Sven Schlarb
Combining Convolution and Recurrent Neural Models for Post-Hoc OCR Correction of Low Resource Historical Corpora. EurNLP, London, 2019
Lijun Lyu, Besnik Fetahu
Austrian Books Online – Eight Years of Digitization of the Austrian National Library’s Historical Book Collection with Google. Bibliothek Forschung und Praxis 44/1 (2020), 89–99
Fritze, Christiane & Krickl, Martin
Identifying Historical Travelogues in Large Text Corpora Using Machine Learning. International Conference on Information. Springer, Cham, 2020 [Winner of the Lee Dirks Award for Best Full Research Paper]
Rörden, Jan; Gruber, Doris; Krickl, Martin & Haslhofer, Bernhard
Semi-Automatic Identiﬁcation of Travelogues. Book of Abstracts, DH2020 Conference, Ottowa, July 22–24, 2020
Doris Gruber, Martin Krickl, Jan Rörden, and Rainer Simon
Travelogues: Fremdwahrnehmungen in Reiseberichten 1500–1876. Austrian Academy of Sciences Press 2020, 62–66
Doris Gruber, Martin Krickl, Lijun Lyu, Jan Rörden Arno Strohmeyer
Europeans Encounter the World in Travelogues: 1450–1900. Europäische Geschichte Online (EGO), herausgegeben vom Leibniz Institut für Europäische Geschichtsforschung (IEG), Mainz, 2021
Doris Gruber
Neural OCR Post-Hoc Correction of Historical Corpora. Transactions of the Association for Computational Linguistics 2021: 479-493
Lyu, Lijun; Koutraki, Maria; Krickl, Martin & Fetahu, Besnik
On the Way to the ”(Un)Known”? The Ottoman Empire in Travelogues (c. 1450- 1900). Studies on Modern Orient, De Gruyter 2022
Doris Gruber, Arno Strohmeyer (eds.)

Servicenavigation

Hauptnavigation

Fremdwahrnehmungen in Reiseberichten 1500-1875 - eine computergestützte Analyse

Zusammenfassung der Projektergebnisse

Projektbezogene Publikationen (Auswahl)

Zusatzinformationen

Servicenavigation

Hauptnavigation

Fremdwahrnehmungen in Reiseberichten 1500-1875 - eine computergestützte Analyse

Zusammenfassung der Projektergebnisse

Projektbezogene Publikationen (Auswahl)

Zusatzinformationen

Textvergrößerung und Kontrastanpassung