Perceptions of the Other in Travelogues 1500-1875 - A Computerized Analysis
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Final Report Abstract
In many European libraries, historic corpora remain in printed or handwritten format, which offers minimal access to digital analysis. However, algorithms that leverage increased computational capacity promise to accelerate the large-scale historical corpus analysis process significantly and find hidden links across corpora, time, and locations while providing solid statistical proofs for their claims. The Travelogues project aimed to develop novel automatic and semi-automatic methods for serial analysis of large-scale historical text corpora from the Austrian Books Online (ABO) project (ca. 3,000 - 3,500 books) and meanwhile, keep the in-depth evaluation from the historian perspectives. Specifically, one of the aims of this project was to gain insight into the perception of the Other. For historical analyses, the essential tools are data-driven statistical models, aka Machine learning models. First, one crucial challenge when dealing with historical data is that training data is not present in abundance. Specifically, to train effective and robust machine learning models, one needs to build models that operate on small hand-annotated training datasets. Secondly, conducting computerized analysis requires digital scans of every page or OCRed outputs for text-related analysis. When the OCRed texts are noisy (which is the real case), in-depth analyses are hard to be continued. For instance, we found it challenging to extract the text pairs with intertextuality relationships for the second research question because those intensive OCR errors might heavily distort the sentence structure, meaning, and named entities. In this project, we proposed novel algorithms, datasets, and annotated corpora that improve the quality of digital historical analyses: 1. We created a manually verified travelogues corpus from the 16th–19th century, together with trained models to identify potential travelogue books. 2. We produced the corresponding corrected travelogues corpus by the post-hoc OCR correction model, together with trained models to conduct the OCR correction. 3. And as a first in literature, we created a German text-correction dataset of aligned sentence pairs and the methods to generate such customized datasets. We also found some fundamental challenges due to the lack of labeled datasets that led to slight course corrections in our objectives. Specifically, instead of conducting Research Question three and four, most of the efforts went to improve the data quality using post-hoc OCR correction. We strongly believe that our results and our state-of-the-art approaches on OCR correction greatly improve the capability of digitization and enable large-scale historical analysis. All source codes are also made publicly accessible for the benefit of the community.
Publications
- Traveling through Space and Time, or: Making Historical Travelogues Accessible. NKOS Workshop 2018
Jan Rörden, Bernhard Haslhofer, Rainer Simon, Sven Schlarb
- Combining Convolution and Recurrent Neural Models for Post-Hoc OCR Correction of Low Resource Historical Corpora. EurNLP, London, 2019
Lijun Lyu, Besnik Fetahu
- Austrian Books Online – Eight Years of Digitization of the Austrian National Library’s Historical Book Collection with Google. Bibliothek Forschung und Praxis 44/1 (2020), 89–99
Christiane Fritze, Martin Krickl
(See online at https://doi.org/10.1515/bfp-2020-0008) - Identifying Historical Travelogues in Large Text Corpora Using Machine Learning. International Conference on Information. Springer, Cham, 2020 [Winner of the Lee Dirks Award for Best Full Research Paper]
Jan Rörden, Doris Gruber, Martin Krickl and Bernhard Haslhofer
(See online at https://doi.org/10.1007/978-3-030-43687-2_67) - Semi-Automatic Identification of Travelogues. Book of Abstracts, DH2020 Conference, Ottowa, July 22–24, 2020
Doris Gruber, Martin Krickl, Jan Rörden, and Rainer Simon
(See online at https://doi.org/10.17613/d74t-2h79) - Travelogues: Fremdwahrnehmungen in Reiseberichten 1500–1876. Austrian Academy of Sciences Press 2020, 62–66
Doris Gruber, Martin Krickl, Lijun Lyu, Jan Rörden Arno Strohmeyer
(See online at https://doi.org/10.1553/dha-proceedings2018s62) - Europeans Encounter the World in Travelogues: 1450–1900. Europäische Geschichte Online (EGO), herausgegeben vom Leibniz Institut für Europäische Geschichtsforschung (IEG), Mainz, 2021
Doris Gruber
- Neural OCR Post-Hoc Correction of Historical Corpora. Transactions of the Association for Computational Linguistics 2021: 479-493
Lijun Lyu, Maria Koutraki, Martin Krickl, Besnik Fetahu
(See online at https://doi.org/10.1162/tacl_a_00379) - On the Way to the ”(Un)Known”? The Ottoman Empire in Travelogues (c. 1450- 1900). Studies on Modern Orient, De Gruyter 2022
Doris Gruber, Arno Strohmeyer (eds.)
(See online at https://doi.org/10.1515/9783110698046)