Project Details
Font Group Recognition for Improved OCR
Applicants
Dr.-Ing. Vincent Christlein; Professor Dr. Nikolaus Weichselbaumer, since 11/2021
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
German Literary and Cultural Studies (Modern German Literature)
German Literary and Cultural Studies (Modern German Literature)
Term
from 2021 to 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 460605811
Although OCR-D made huge progress in the last project phase in providing OCR for early printed books, it still faces two major problems: The huge variety of the material makes it extremely challenging to use generic OCR-models. Yet, selecting specific models is not possible as the sheer amount of material prevents a fully automatic workflow. This situation is further complicated by the lack of appropriate OCR training data. Current data sets consist overwhelmingly of texts in Fraktur, especially from the 19th century. This completely neglects the large typographic variety displayed by printing in the three previous centuries. Therefore, and in response to the demand from SLUB Dresden and ULB Halle, we propose to improve the current situation significantly1) fine tuning our font group recognition system to such a degree that it can be used at character level;2) transcribing more specific OCR training data for the 16th-18th century, which includes popular fonts such as Schwabacher, other bastards and old Fraktur styles; 3) training font-specific OCR models as well as integrated models that recognise both typeface and text simultaneously. This approach has ensured in other contexts that the network performs better on both individual tasks, as we can thus reduce overfitting during training. This project will improve OCR quality significantly, especially for books in non-Fraktur fonts. It will also provide a training data set of very high quality that can be reused in long term. Finally, the project will provide a more fine-grained font recognition tool that, beyond enabling font-specific OCR, also has important applications in text attribute recognition and layout analysis.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)
Ehemaliger Antragsteller
Privatdozent Dr. Christoph Reske, until 10/2021