loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Jiří Martínek and Pavel Král

Affiliation: Faculty of Applied Sciences and University of West Bohemia, Czech Republic

Keyword(s): Czech, Error Correction, Fulltext, Language Model, OCR.

Related Ontology Subjects/Areas/Topics: Applications ; Artificial Intelligence ; Computational Intelligence ; Evolutionary Computing ; Knowledge Discovery and Information Retrieval ; Knowledge Engineering and Ontology Development ; Knowledge-Based Systems ; Machine Learning ; Natural Language Processing ; Pattern Recognition ; Soft Computing ; Symbolic Systems

Abstract: This paper proposes a novel system for information retrieval over a set of scanned documents in the Czech language. The documents are in the form of raster images and thus they are first converted into the text form by optical character recognition (OCR). Then OCR errors are corrected and the corrected texts are indexed and stored into a fulltext database. The database provides a possibility of searching over these documents. This paper describes all components of the above mentioned system with a particular focus on the proposed OCR correction method. We experimentally show that the proposed approach is efficient, because it corrects a significant number of errors. We also create a small Czech corpus to evaluate OCR error correction methods which represent another contribution of this paper.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.217.144.32

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Martínek, J. and Král, P. (2018). Error Correction for Information Retrieval of Czech Documents. In Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART; ISBN 978-989-758-275-2; ISSN 2184-433X, SciTePress, pages 630-634. DOI: 10.5220/0006661906300634

@conference{icaart18,
author={Ji\v{r}í Martínek. and Pavel Král.},
title={Error Correction for Information Retrieval of Czech Documents},
booktitle={Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART},
year={2018},
pages={630-634},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006661906300634},
isbn={978-989-758-275-2},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART
TI - Error Correction for Information Retrieval of Czech Documents
SN - 978-989-758-275-2
IS - 2184-433X
AU - Martínek, J.
AU - Král, P.
PY - 2018
SP - 630
EP - 634
DO - 10.5220/0006661906300634
PB - SciTePress