context dependent error correction. In order to
correct such errors, powerful language processing
tools are needed. Examples of such attempts are
presented in (Meknavin et al., 1998 and Tong and
Evans, 1996), where sequences of parts of speech
are evaluated for likelihood of occurrence and
unlikely sequences are marked as possible errors.
3 A STATISTICAL APPROACH
FOR SOLVING THE OCR GAPS
PROBLEM
Unlike most of the research that is focused on
improving the detection rate of characters, in this
paper we are focusing on a different aspect: the
recovery of text that cannot be recognized, either
because it is too damaged or simply missing. This
paper tackles the issue of the reconstruction of
damaged documents based on the prediction of the
most plausible word sets that could fill in the
missing areas that resulted from the impossibility of
recognizing the original words used in the
documents. From now on, these missing areas will
be referred to as “gaps”. Every gap has a very
important property that is the most important factor
which influences the accuracy of the recovery
process: its dimension, usually expressed by the
number of characters or words if we consider the
text under analysis as a continuous stream of text.
The solution that we propose in this paper is
intended for the recovery of text chunks that
represent pieces of phrases from the original
document and it is based on two assumptions. The
first one is related to the intra-document similarity:
we assume that a model of the document can be built
based on the existing text and that the missing text
also respects this model. We considered that the
document model has two components: the style
model, representing the structure of the text and the
language model, depicting the vocabulary used by
the author, the n-grams that were built with these
words and the frequency of the n-grams. These two
models are combined in order to identify the word
sets that could fit in the gaps. Two heuristics have
been developed to allow us to benefit from the style
model. Regarding the language model, there is a
problem that sometimes new words that haven’t
been used before in the document could appear in
the gaps, but these words cannot be discovered using
only the language model of the document, since
these words are simply missing from it. This
problem leads us to the use of the Google corpus and
to the second assumption: the corpus dimension is
large enough to subsume most of the language
models of the documents posted on the Internet and
in the meantime, any word that does not appear in
this corpus, should not be considered as a possible
candidate to fill in the gaps.
Considering these two assumptions to be true,
our solution starts with the identified gaps and
follows a few steps in order to identify the missing
words. First of all, the style model of the document
is used in order to identify the dimension of the gap.
Therefore, we consider two heuristics: estimated
character count and estimated word count. The
estimated character count is a numeric value which
is determined based on the margins and indentation
of the recovered document format, on the existing
characters that were correctly identified and that are
in the gap’s vicinity and on some statistical
information regarding the document under analysis
(mean and deviation of the number of characters per
phrase). This value is used to determine a maximum
and a minimum number of characters that could fill
in the gap. The estimated word count is also a
numeric value, which uses the estimated character
count and some statistical information regarding the
mean and deviation of the number of characters per
word and the mean and deviation of the number of
words per phrase observed in the document. This
value is used to determine a range for the number of
words that we are looking for in order to fill in the
gap.
Once having estimated the number of words we
are looking for, we are able to start using the
language model. At this point, there are a couple of
heuristics that can be used. First of all, the gaps do
not usually start or end with whitespace characters
representing the limit between distinct words, so one
could scan the document for partial words at the
beginning or at the ending of the gaps. Using both
the n-grams corpus and the words that have been
correctly identified before and after the gap, it is
easier to detect the whole words starting from the
characters representing parts of them. Since the
maximum dimension for n-grams in the corpus is 5-
grams, the detection starts from the previous four
words before the gap in order to identify the first
word missing from the gap. We consider that these
four words represent the starting words from a 5-
gram, and we try to identify which is the most
probable word to follow this combination. The same
method is applied to the next four words after the
gap in order to determine the last word missing from
the gap, considering that these words represent the
ending words from a 5-gram, and trying to detect the
FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS
439