Authors:
Mohammad Mohsin Reza
1
;
Md. Ajraf Rakib
1
;
Syed Saqib Bukhari
2
and
Andreas Dengel
3
Affiliations:
1
Department of Computer Science, University of Kaiserslautern and Germany
;
2
Deutsche Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Kaiserslautern and Germany
;
3
Department of Computer Science, University of Kaiserslautern, Germany, Deutsche Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Kaiserslautern and Germany
Keyword(s):
Document Analysis, Document Pre-processing, Complex Historical Document Analysis, 16th-19th Century German Books, Document Noise Removal, Page Frame.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Data Engineering
;
Information Retrieval
;
Ontologies and the Semantic Web
;
Pattern Recognition
;
Software Engineering
Abstract:
Document layout analysis is the most important part of converting scanned page images into search-able full text. An intensive amount of research is going on in the field of structured and semi-structured documents (journal articles, books, magazines, invoices) but not much in historical documents. Historical document digitization is a more challenging task than regular structured documents due to poor image quality, damaged characters, big amount of textual and non-textual noise. In the scientific community, the extraneous symbols from the neighboring page are considered as textual noise, while the appearances of black borders, speckles, ruler, different types of image etc. along the border of the documents are considered as non-textual noise. Existing historical document analysis method cannot handle all of this noise which is a very strong reason of getting undesired texts as a result from the output of Optical Character Recognition (OCR) that needs to be removed afterward with a
lot of extra afford. This paper presents a new perspective especially for the historical document image cleanup by detecting the page frame of the document. The goal of this method is to find actual contents area of the document and ignore noises along the page border. We use morphological transforms, the line segment detector, and geometric matching algorithm to find an ideal page frame of the document. After the implementation of page frame method, we also evaluate our approach over 16th-19th century printed historical documents. We have noticed in the result that OCR performance for the historical documents increased by 4.49% after applying our page frame detection method. In addition, we are able to increase the OCR accuracy around 6.69% for contemporary documents too.
(More)