Project Details
Projekt Print View

Ancient Arabic Document Analysis

Subject Area Islamic Studies, Arabian Studies, Semitic Studies
Term from 2009 to 2020
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 142173438
 
Final Report Year 2015

Final Report Abstract

The project aimed at the development of a system for the analysis of historical handwritten Arabic documents. This so-called HADARA system supports the identification, classification, and script and content analysis of historical Arabic manuscripts in an automatic manner. Each of the four partners contributed with their own expertise to the goals of the project: Historians and linguists identified typical documents that constitute the project database, created data labels in a time-consuming process, and, of course, shaped major parts of the requirements of the HADARA system, which was in turn developed by electrical engineers and computer scientists. The system itself consists of two parts: a mobile hardware and a software part. The mobile hardware part allows digitization of historical documents in remote places that are usually difficult to access. The most important outcome of the HADARA system’s software part is an easy-to-use historical document processing tool chain that allows to process digitized historical Arabic documents, and to perform a search for arbitrary keywords in such document images. It can be used by historians to systematically process and index historical Arabic scriptures. The HADARA system has a modular design. From the beginning of its first realization, the system had been able to be used by historians for research purposes. Only the degree of automation and complexity of tasks depended on the project’s progress. Several major applications have been developed on top of the HADARA system: (1) The text recognition technique can be used in our annotation application for automatic or semiautomatic transcription of a scanned document. In case of a semi-automatic transcription, a user and the system collaborate to obtain a full text transcription: The system learns from manual corrections by the user, iteratively improving its results, whereas the user saves time compared to a totally manual transcription. (2) The word spotting application provides the ability to detect the existence of selected words within a document even without performing text recognition at all. (3) Similar recognition methods as used in the text recognition module but different feature sets are the basis of the writer identification application. These features are more sensitive to shape variations since a differentiation between various writers is needed. The entire HADARA system including all applications are published under an OSI-approved open source license and, hence, are freely available for use as well as for modifications and further development. Preprocessing of historical documents was another important goal to satisfy the requirements of the aforementioned applications or to improve their results. Therefore, new methods were developed for the denoising, binarization, and text line segmentation of historical Arabic documents, supported by several theses. Furthermore, several alternatives to binarization techniques are developed. These studies resulted in a color segmentation technique that preserves some important information as it is able to distinguish between background and several text colors. This is very important when dealing with historical Arabic manuscripts due to their extensive use of red text color, for example. During the project, several datasets were created to develop and evaluate the applications. For word spotting, we developed the HADARA80P dataset that is based on an historical Arabic manuscript and provides extensive details on word-level. The complete chain of tools created in this project was used to build this dataset. The HADARA80P dataset has already been published and is freely available to research institutions. For text recognition purposes like the simulation of a semi-automatic transcription approach, we annotated all 250 pages of the same historical manuscript on text line level using our annotation tool and created the dataset HADARA250P. This dataset is again freely available to the research community. For the writer identification task, we created a subset of the dataset from the Islamic heritage project containing all of its manuscripts that are suitable for this task, i.e., the corresponding metadata information was known and verified. The final HADARA system as well as the path to its creation can be seen as the basis of the cultural vision of this project, which is deeply shared by all project partners. In particular for Arabic countries, there is considerable social and cultural impact expected by making an important part of their historical background, buried in books, accessible for research and to the public. Likewise, since Arabic history interferes with history of European societies, a better access to Arabic historical scriptures may also bring up new aspects for and views on European history.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung