Project Details
Development of a web-based system for the postcorrection of historical OCR'ed texts
Applicant
Professor Dr. Klaus U. Schulz
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
German Medieval Studies (Medieval German Literature)
German Medieval Studies (Medieval German Literature)
Term
from 2016 to 2017
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 314731081
This projects aims to develop a web based tool and system for the postcorrection ofOCR recognized historical texts. As any OCR result has its share of errors,the usefulness of OCR text output for many applications in the humanities crucially dependson a postcorrection facility. A standalone open source Java (non-web-based) version of such a postcorrectiontool named PoCoTo (Post-Correction Tool) was developed at CIS featuringadvanced language technology by which whole error series of documents withhistorical spellings can be displayed in a concordance view of OCR output and original image.PoCoTo has already made known to a wide public and is being used in Digital Humanities projectsin Germany for postcorrection purposes. Due to specific demands from its users we want to develop and distribute it as an open source web based, multi-user system. Instead of needingto be locally installed on one's own computer, it will be developed into a component of a server basedinfrastructure to support an institutional OCR workflow.Apart from this infrastructural change, the following additional goals will be pursued:1. User corrections will get used to calculate ever better statistical error profiles of putative errors series in the background, thus speeding up the correction. 2. Some simple augmentations of a Latin full form lexicon will provide the foundations to make the language technology (statistical error profiling) fully applicable to the postcorrection of the OCR output of Latin texts. 3. The flexibility of the system with regard to OCR engines will be increased by making it possible to treat the output of the open source OCR system OCRopus in addition to Abbyy and Tesseract output. Should the need arise, further improvements of the system and cooperations with other groups will be planned in a follow-up proposal within the context of the forthcoming DFG initiative for historical OCR.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)