A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching
Camelia Lemnaru, Andreea Sin-Neamțiu, Mihai-Andrei Vereș, Rodica Potolea
2012
Abstract
Information contained in historical sources is highly important for the research of historians; yet, extracting it manually from documents written in difficult scripts is often an expensive and time-consuming process. This paper proposes a modular system for transcribing documents written in a challenging script (German Kurrent Schrift). The solution comprises of three main stages: Document Processing, Word Processing and Word Selector, chained together in a linear pipeline. The system is currently under development, with several modules in each stage already implemented and evaluated. The main focus so far has been on the character recognition module, where a hierarchical classifier is proposed. Preliminary evaluations on the character recognition module has yielded ~ 82% overall character recognition rate, and a series of groups of confusable characters, for which an additional identification model is currently investigated. Also, word composition based on a dictionary matching approach using the Levenshtein distance is presented.
References
- Bai, X., Latecki, L. J., Liu, W., 2005. Skeleton Pruning by Countour Partitioning with Discrete Curve Evolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29, No. 3, March 2007
- Fischer, A., Wüthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., Stolz, M., 2010. Automatic Transcription of Historical Medieval Documents. DAS 7810 Proc. of the 9th IARR.
- Juan, A., Romero, V., Sánchez, N. Serrano, J. A., Toselli, A. H., Vidal, E., 2010. Handwriting Text Recognition for Ancient Documents. Workshop and Conference Proceedings 11-Workshop on Applications of Pattern Analysis.
- Minert, R. P., 2001. Deciphering Handwriting in German Documents: Analyzing German, Latin and French in Vital Records Written in Germany. GRT Publications.
- Otsu, N.,1979. A threshold selection method from grey level histogram, IEEE Transactions on Systems, Man, and Cybernetics, vol SMC-9, No 1.
- Saeed, K., Tabedzki, M., Rybnik, M., Adamski, M., 2010. K3M: A Universal Algorithm For Image Skeletonization And A Review Of Thinning Techniques. Applied Mathematics and Computer Science, Vol. 20, Nr2, p. 317-335.
- Sun, C., Si, D., 1997. Skew and Slant Correction for Document Images Using Gradient Direction ICDAR 1997- 4th International Conference Document Analysis and Recognition.
- Vamvakas, G., 2007. Optical Handwritten Character Recognition. National Center for Scientific Research "Demokritos" Athens, Greece
- Frank de Zeeuw, 2006. Slant Correction using Histograms. Bachelor's Thesis in Artificial Intelligence.
Paper Citation
in Harvard Style
Lemnaru C., Sin-Neamțiu A., Vereș M. and Potolea R. (2012). A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 353-357. DOI: 10.5220/0004143003530357
in Bibtex Style
@conference{kdir12,
author={Camelia Lemnaru and Andreea Sin-Neamțiu and Mihai-Andrei Vereș and Rodica Potolea},
title={A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={353-357},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004143003530357},
isbn={978-989-8565-29-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching
SN - 978-989-8565-29-7
AU - Lemnaru C.
AU - Sin-Neamțiu A.
AU - Vereș M.
AU - Potolea R.
PY - 2012
SP - 353
EP - 357
DO - 10.5220/0004143003530357