sure to choice a closest word for replacing. The third
group of methods use usually statistical methods with
machine learning. The above mentioned approaches
are briefly described for instance in a survey (Kukich,
1992). We describe next more in detail some interest-
ing methods.
Zhidong et al. propose in (Zhidong et al., 1999)
a language-independent OCR system which recog-
nizes text from most of the world’s languages. Their
approach uses hidden Markov models (HMM) to
model each character. The authors empoy unsuper-
vised adaptation techniques to provide the language
independence. The paper also describes the relation-
ship between speech recognition and OCR.
Perez-Cortes et al. describe in (Perez-Cortes et al.,
2000) an interesting method to post-process the OCR
results in order to improve the accuracy. The authors
propose a solution based on finite-state Markov model
and modified Viterbi algorithm.
Another approach (Pal et al., 2000) focuses on the
Indian language and non–word errors. The authors
use for OCR error correction morphological parsing.
A set of rules for the morphological analysis is pre-
sented. Unfortunately, is is not clear, whether this ap-
proach is applicable for any language OCR.
The authors of (Afli et al., 2016) use language
models and statistical machine translation (SMT).
This work is focused on historical texts. The purpose
of the SMT is to translate words in source language
into another words in a target language. The main
idea is to translate OCR outputs into corrected texts
using both language models.
Kissos and Dershowitz propose (Kissos and
Dershowitz, 2016) a method involving a lexical
spellchecker, a confusion matrix and a regression
model. The confusion matrix and regression model
are used for choosing good correction candidates.
3 SYSTEM ARCHITECTURE
The proposed system has a modular architecture as
depicted in Figure 1 and is composed of three main
modules.
The first module is used for OCR conversion of
the document in the raster image form. Tesseract open
source OCR Engine
1
is used as a core of our OCR
analysis. The input of this module are raster images
and the output is a so called confidence matrix which
contains the possible recognized characters with con-
fidence values.
The second module is dedicated to the correction
1
https://github.com/tesseract-ocr
of the OCR errors. Its input is the confidence matrix
provided by the previous module and the output is the
corrected text. This module combines probabilities
of character language model with the values from the
confidence matrix. A rule-based approach with Lev-
enstein distance is also implemented in this module.
The methods integrated in this module are described
more in details in the following section.
The last module is used for document storage, in-
dexing and retrieval. The open source search engine
Apache Solr
2
is used for this task. The input is the
corrected text obtained by the previous module. This
module provides the possibilities of searching over
the pdf data.
2. Error correction
Proposed method
Figure 1: Modular architecture of the proposed system.
4 PROPOSED METHOD
The proposed error correction method is at the char-
acter level. It uses first a rule-based approach for cor-
rection of the regular errors. Then, we use a statistical
algorithm which combines the output of the Tesser-
act with language models. The last step consists in
using dictionary-based Levenstein method as a post-
processing of the previous step.
4.1 Rule-based Approach
This approach employs a set of manually defined rules
to replace some characters by the other ones. For ex-
ample the in-word character “0” (zero) is replaced
by the character “O” or the character “1” (one) is
replaced by the character “l”, etc. Then, the result
is checked against the manually defined dictionary.
This approach can reduce a set of incorrect words and
speed up the whole correction process.
2
http://lucene.apache.org/solr/
Error Correction for Information Retrieval of Czech Documents
631