loading
Documents

Research.Publish.Connect.

Paper

Authors: Mohammad Reza 1 ; Md. Rakib 1 ; Syed Bukhari 2 and Andreas Dengel 3

Affiliations: 1 Department of Computer Science, University of Kaiserslautern, Germany ; 2 Deutsche Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Kaiserslautern, Germany ; 3 Department of Computer Science, University of Kaiserslautern, Germany, Deutsche Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Kaiserslautern, Germany

ISBN: 978-989-758-351-3

Keyword(s): Document Analysis, Document Pre-processing, Complex Historical Document Analysis, 16th-19th Century German Books, Document Noise Removal, Page Frame.

Abstract: Document layout analysis is the most important part of converting scanned page images into search-able full text. An intensive amount of research is going on in the field of structured and semi-structured documents (journal articles, books, magazines, invoices) but not much in historical documents. Historical document digitization is a more challenging task than regular structured documents due to poor image quality, damaged characters, big amount of textual and non-textual noise. In the scientific community, the extraneous symbols from the neighboring page are considered as textual noise, while the appearances of black borders, speckles, ruler, different types of image etc. along the border of the documents are considered as non-textual noise. Existing historical document analysis method cannot handle all of this noise which is a very strong reason of getting undesired texts as a result from the output of Optical Character Recognition (OCR) that needs to be removed afterward with a l ot of extra afford. This paper presents a new perspective especially for the historical document image cleanup by detecting the page frame of the document. The goal of this method is to find actual contents area of the document and ignore noises along the page border. We use morphological transforms, the line segment detector, and geometric matching algorithm to find an ideal page frame of the document. After the implementation of page frame method, we also evaluate our approach over 16th-19th century printed historical documents. We have noticed in the result that OCR performance for the historical documents increased by 4.49% after applying our page frame detection method. In addition, we are able to increase the OCR accuracy around 6.69% for contemporary documents too. (More)

PDF ImageFull Text

Download
CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.227.254.12

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Reza, M.; Rakib, M.; Bukhari, S. and Dengel, A. (2019). A Robust Page Frame Detection Method for Complex Historical Document Images.In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-351-3, pages 556-564. DOI: 10.5220/0007382405560564

@conference{icpram19,
author={Mohammad Mohsin Reza. and Md. Ajraf Rakib. and Syed Saqib Bukhari. and Andreas Dengel.},
title={A Robust Page Frame Detection Method for Complex Historical Document Images},
booktitle={Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2019},
pages={556-564},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007382405560564},
isbn={978-989-758-351-3},
}

TY - CONF

JO - Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - A Robust Page Frame Detection Method for Complex Historical Document Images
SN - 978-989-758-351-3
AU - Reza, M.
AU - Rakib, M.
AU - Bukhari, S.
AU - Dengel, A.
PY - 2019
SP - 556
EP - 564
DO - 10.5220/0007382405560564

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.