Ancient Arabic Document Analysis
Final Report Abstract
The project aimed at the development of a system for the analysis of historical handwritten Arabic documents. This so-called HADARA system supports the identification, classification, and script and content analysis of historical Arabic manuscripts in an automatic manner. Each of the four partners contributed with their own expertise to the goals of the project: Historians and linguists identified typical documents that constitute the project database, created data labels in a time-consuming process, and, of course, shaped major parts of the requirements of the HADARA system, which was in turn developed by electrical engineers and computer scientists. The system itself consists of two parts: a mobile hardware and a software part. The mobile hardware part allows digitization of historical documents in remote places that are usually difficult to access. The most important outcome of the HADARA system’s software part is an easy-to-use historical document processing tool chain that allows to process digitized historical Arabic documents, and to perform a search for arbitrary keywords in such document images. It can be used by historians to systematically process and index historical Arabic scriptures. The HADARA system has a modular design. From the beginning of its first realization, the system had been able to be used by historians for research purposes. Only the degree of automation and complexity of tasks depended on the project’s progress. Several major applications have been developed on top of the HADARA system: (1) The text recognition technique can be used in our annotation application for automatic or semiautomatic transcription of a scanned document. In case of a semi-automatic transcription, a user and the system collaborate to obtain a full text transcription: The system learns from manual corrections by the user, iteratively improving its results, whereas the user saves time compared to a totally manual transcription. (2) The word spotting application provides the ability to detect the existence of selected words within a document even without performing text recognition at all. (3) Similar recognition methods as used in the text recognition module but different feature sets are the basis of the writer identification application. These features are more sensitive to shape variations since a differentiation between various writers is needed. The entire HADARA system including all applications are published under an OSI-approved open source license and, hence, are freely available for use as well as for modifications and further development. Preprocessing of historical documents was another important goal to satisfy the requirements of the aforementioned applications or to improve their results. Therefore, new methods were developed for the denoising, binarization, and text line segmentation of historical Arabic documents, supported by several theses. Furthermore, several alternatives to binarization techniques are developed. These studies resulted in a color segmentation technique that preserves some important information as it is able to distinguish between background and several text colors. This is very important when dealing with historical Arabic manuscripts due to their extensive use of red text color, for example. During the project, several datasets were created to develop and evaluate the applications. For word spotting, we developed the HADARA80P dataset that is based on an historical Arabic manuscript and provides extensive details on word-level. The complete chain of tools created in this project was used to build this dataset. The HADARA80P dataset has already been published and is freely available to research institutions. For text recognition purposes like the simulation of a semi-automatic transcription approach, we annotated all 250 pages of the same historical manuscript on text line level using our annotation tool and created the dataset HADARA250P. This dataset is again freely available to the research community. For the writer identification task, we created a subset of the dataset from the Islamic heritage project containing all of its manuscripts that are suitable for this task, i.e., the corresponding metadata information was known and verified. The final HADARA system as well as the path to its creation can be seen as the basis of the cultural vision of this project, which is deeply shared by all project partners. In particular for Arabic countries, there is considerable social and cultural impact expected by making an important part of their historical background, buried in books, accessible for research and to the public. Likewise, since Arabic history interferes with history of European societies, a better access to Arabic historical scriptures may also bring up new aspects for and views on European history.
Publications
-
“Text Line Segmentation for Gray Scale Historical Document Images,” in Proc. of the 2011 Workshop on Historical Document Imaging and Processing, Beijing, China, September 2011, pp. 120–126 (Best student paper award)
A. Asi, R. Saabni, and J. El-Sana
-
“Hierarchical Scheme for Arabic Text Recognition,” in Proc. of Int. Conf. on Information Science, Signal Processing and their Applications (ISSPA). Montreal, Canada: IEEE, July 2012, pp. 1266–1271
A. Asi, J. El-Sana, and V. Märgner
-
“Layout Analysis for Arabic Historical Document Images using Machine Learning,” in Proc. of Int. Conf. of Frontiers in Handwritting Recognition (ICFHR 2012), Bari, Italy, September 2012, pp. 635–640 (Best student paper award)
S.S.. Bukhari, A. Asi, T. Breuel, and J. El Sana
-
“Efficient Word Image Retrieval Using Earth Movers Distance Embedded to Wavelets Coefficients Domain,” in Proc. of Int. Conf. on Document Analysis and Recognition (ICDAR 2013), Washington, DC, USA, August 2013, pp. 1300–1304
R. Saabni
-
“HADARA – A Software System for Semi-Automatic Processing of Historical Handwritten Arabic Documents,” in Proc. of IS&T Conference Archiving, Washington DC, USA, April 2013, pp. 161–166
W. Pantke, V. Märgner, D. Fecker, T. Fingscheidt, A. Asi, O. Biller, J. El-Sana, R. Saabni, and M. Yehia
-
“On Evaluation of Segmentation-Free Word Spotting Approaches without Hard Decisions,” in Proc. of Int. Conf. on Document Analysis and Recognition (ICDAR 2013), Washington, DC, USA, August 2013, pp. 1300–1304
W. Pantke, V. Märgner, and T. Fingscheidt
-
“WebGT: An Interactive Web-Based System for Historical Document Ground Truth Generation,” in Proc. of Int. Conf. on Document Analysis and Recognition (ICDAR 2013), Washington, DC, USA, August 2013, pp. 305–308
O. Biller, A. Asi, K. Kedem, and I. Dinstein
-
“A Coarse-to-Fine Approach for Layout Analysis of Ancient Manuscripts,” in Proc. of Int. Conf. on Frontiers of Handwritting Recogn. (ICFHR), Crete Island, Greece, September 2014, pp. 140–145
A. Asi, R. Cohen, K. Kedem, J. El-Sana, and I. Dinstein
-
“An Historical Handwritten Arabic Dataset for Segmentation-free Word Spotting–HADARA80P,” in Proc. of Int. Conf. on Frontiers of Handwritting Recognition (ICFHR), Crete Island, Greece, September 2014, pp. 15–20
W. Pantke, M. Dennhardt, D. Fecker, V. Märgner, and T. Fingscheidt
-
“Color Segmentation for Historical Documents Using Markov Random Fields,” in Proc. of Int. Conf. on Soft Computing and Pattern Recognition (SoCPaR 2014), Tunis, Tunisia, August 2014, pp. 151–156
W. Pantke, A. Haak, and Märgner
-
“Document Writer Analysis with Rejection for Historical Arabic Manuscripts,” in Proc. of 14th Int. Conf. on Frontiers in Handwriting Recognition (ICFHR), Crete Island, Greece, September 2014, pp. 743–748
D. Fecker, A. Asi, W. Pantke, V. Märgner, J. El-Sana, and T. Fingscheidt
-
Vom Zeichen zur Schrift. Mit Mustererkennung zur automatisierten Schreiberhanderkennung in mittelalterlichen und frühneuzeitlichen Handschriften. In: Grenzen und Möglichkeiten der Digital Humanities. Hg. von Constanze Baum / Thomas Stäcker. 2015 (= Sonderband der Zeitschrift für digitale Geisteswissenschaften, 1)
D. Fecker, V. Märgner, and T. Schaßan