loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Mohammad Mohsin Reza 1 ; Md. Ajraf Rakib 1 ; Syed Saqib Bukhari 2 and Andreas Dengel 3

Affiliations: 1 Department of Computer Science, University of Kaiserslautern and Germany ; 2 Deutsche Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Kaiserslautern and Germany ; 3 Department of Computer Science, University of Kaiserslautern, Germany, Deutsche Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Kaiserslautern and Germany

Keyword(s): Document Analysis, Document Pre-processing, Complex Historical Document Analysis, 16th-19th Century German Books, Document Noise Removal, Page Frame.

Related Ontology Subjects/Areas/Topics: Applications ; Data Engineering ; Information Retrieval ; Ontologies and the Semantic Web ; Pattern Recognition ; Software Engineering

Abstract: Document layout analysis is the most important part of converting scanned page images into search-able full text. An intensive amount of research is going on in the field of structured and semi-structured documents (journal articles, books, magazines, invoices) but not much in historical documents. Historical document digitization is a more challenging task than regular structured documents due to poor image quality, damaged characters, big amount of textual and non-textual noise. In the scientific community, the extraneous symbols from the neighboring page are considered as textual noise, while the appearances of black borders, speckles, ruler, different types of image etc. along the border of the documents are considered as non-textual noise. Existing historical document analysis method cannot handle all of this noise which is a very strong reason of getting undesired texts as a result from the output of Optical Character Recognition (OCR) that needs to be removed afterward with a lot of extra afford. This paper presents a new perspective especially for the historical document image cleanup by detecting the page frame of the document. The goal of this method is to find actual contents area of the document and ignore noises along the page border. We use morphological transforms, the line segment detector, and geometric matching algorithm to find an ideal page frame of the document. After the implementation of page frame method, we also evaluate our approach over 16th-19th century printed historical documents. We have noticed in the result that OCR performance for the historical documents increased by 4.49% after applying our page frame detection method. In addition, we are able to increase the OCR accuracy around 6.69% for contemporary documents too. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.145.75.103

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Reza, M. ; Rakib, M. ; Bukhari, S. and Dengel, A. (2019). A Robust Page Frame Detection Method for Complex Historical Document Images. In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM; ISBN 978-989-758-351-3; ISSN 2184-4313, SciTePress, pages 556-564. DOI: 10.5220/0007382405560564

@conference{icpram19,
author={Mohammad Mohsin Reza and Md. Ajraf Rakib and Syed Saqib Bukhari and Andreas Dengel},
title={A Robust Page Frame Detection Method for Complex Historical Document Images},
booktitle={Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM},
year={2019},
pages={556-564},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007382405560564},
isbn={978-989-758-351-3},
issn={2184-4313},
}

TY - CONF

JO - Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM
TI - A Robust Page Frame Detection Method for Complex Historical Document Images
SN - 978-989-758-351-3
IS - 2184-4313
AU - Reza, M.
AU - Rakib, M.
AU - Bukhari, S.
AU - Dengel, A.
PY - 2019
SP - 556
EP - 564
DO - 10.5220/0007382405560564
PB - SciTePress