Project Details
Robust and high-performance methods for layout analysis in OCR-D
Subject Area
History of Science
Term
since 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 517459941
The project aims to improve the quality and robustness of layout analysis for historical documents and thus ensure their aptitude for mass digitization. To achieve this, existing approaches will be optimized and extended, and promising new methods will be integrated. First, a sample-based analysis of the bibliography of books printed in the German-speaking countries in the 16th, 17th and 18th century (VD) will serve to identify (and quantify) those classes of documents for which the results of existing methods for layout analysis are still insufficient. Likewise, suitable training data will be identified and harmonized, and their preparation and generation will be organized more efficiently. The main focus of the work is the further development of complementary methods for layout analysis. On the one hand, a broad coverage for as many documents as possible in the VD is to be achieved by optimizing generic methods and models. On the other hand, this will be complemented by approaches that help to specifically address identified weaknesses by significantly improving the adaptability of the methods and models for new materials and challenges. Furthermore, heuristics are (further) developed in order to optimize the results of different deep learning methods in a rule-based manner. The developments will be accompanied by a detailed evaluation for which scientific standard metrics and tools for layout evaluation will be implemented and integrated in OCR-D, respectively. Last but not least, it must be ensured that all procedures are equipped as modular components with OCR-D interfaces for individual processing steps. This will allow the flexible combination of the procedures to achieve the best possible results and ensure adaptability and sustainability with regard to new developments.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)