The rest of the paper is organised as follows:
In section 2, we discuss the related works. In sec-
tion 3, we describe our method. We present the results
in section 4. We conclude in section 5.
2 RELATED WORK
Some works have been reported on retrieval from
OCRed text. Among the earliest works, Taghva et
al. (K. Taghva and Condit, 1994) applied probabilis-
tic IR on OCRed text. Here, error correction was done
using a domain-specific dictionary. A. Singhal et al.
(A. Singhal and Buckley, 1996) showed that the linear
document normalization models were better suited to
collections containing OCR errors than the quadratic
(cosine normalization) models. TREC made a signif-
icant effort on the study and effect of OCR errors in
retrieval in their two tasks: the Confusion Track and
the Legal Track. The TREC Confusion track was a
part of the TREC 4 (1995) (Harman, 1995) and TREC
5 (1996) (Kantor and Voorhees, 1996). In TREC 4
Confusion Track, random character insertions, dele-
tions and substitutions were used to model degra-
dations. For the TREC 5 Confusion Track, 55,000
government announcement documents were printed,
scanned, OCRed and then were used. Electronic text
for the same documents was available for comparison.
Participants experimented with techniques that used
error modelling to alleviate OCR errors using charac-
ter n-gram matches.
A similar track, RISOT (Garain et al., 2013),
was offered in Forum for Information Retrieval Eval-
uation (FIRE) (www.isical.ac.in/∼fire) 2011. This
was aimed at improving retrieval performance from
OCRed text in Indic script. Here a set of FIRE Bangla
collection of 62,825 documents was available as the
“TEXT” or “clean” collection from leading Bangla
newspapers, Anandabazar Patrika. Each document
of the collection was scanned at a resolution of 300
dots per inch. Then, each scanned document was
converted to electronic text using a Bangla OCR sys-
tem that had about 92.5% accuracy. Ghosh et al.
(Ghosh and Parui, 2013) performed a two-fold error
modelling technique for OCR errors in Bangla script.
In 2012 RISOT, in addition to the Bangla collection
pair, a Hindi collection pair was also offered. The
error-free Hindi document collection is created from
leading Hindi newspapers Dainik Jagaran and Amar
Ujala. The OCRed Hindi collection was created using
a Hindi OCR system which also had 92.5% accuracy.
However, one can find substantial work in the lit-
erature on OCR error modelling and correction. Ko-
lak and Resnik (Kolak and Resnik, 2002) applied a
pattern recognition approach in detecting OCR errors.
Walid and Kareem (Magdy and Darwish, 2006) used
Character Segment Correction, Language modelling,
and Shallow Morphology techniques in error correc-
tion on OCRed Arabic texts. On error detection and
correction of Indic scripts, B.B. Chaudhuri and U. Pal
produced the very first report in 1996 (Chaudhuri and
Pal, 1996). This paper used morphological parsing
to detect and correct OCR errors. Separate lexicons
of root-words and suffixes were used. Fataicha et al.
(Fataicha et al., 2006) located confused characters in
erroneous words and performed to create a collection
of erroneous error-grams. Finally, they generated ad-
ditional query terms, identified appropriate matching
terms, and determined the degree of relevance of re-
trieved document images to the user’s query, based on
a vector space IR model.
3 OUR APPROACH
3.1 Key Terms
3.1.1 Word Cooccurrence
We say that two words, say, w
1
and w
2
, cooccur if
they appear in a window of size s (s > 0) words in
the same document d. Suppose, the words w
1
and
w
2
cooccur in a window of size 5 in a document d.
This means that there is at least one instance in the
document where at most 4 words (distinct from w
1
and w
2
) occur between w
1
and w
2
or between w
2
and
w
1
. Let cooccurFreq
(d,s)
(w
1
, w
2
) denote the number
of times w
1
and w
2
cooccur in d in a window of size
s. Then, we call cooccurFreq
(d,s)
(w
1
, w
2
) the cooc-
currence frequency of w
1
and w
2
in document d for a
window of size s. However, it is a common practice
to calculate cooccurFreq
(d,s)
(w
1
, w
2
) over all the doc-
uments in a collection. This is likely to give a more
robust measure of co-location of the words w
1
and w
2
.
Word cooccurrence gives a reliable measure of as-
sociation between two words as it reflects the degree
of context match between the two words. Usually,
the total cooccurrence between word pairs is calcu-
lated over a collection of documents by summing up
the document-wise cooccurrence frequencies. High
cooccurrence between a pair of words is an indicator
of high degree of relatedness of the words. This as-
sociation measure gets more strength when it is used
in conjunction with a string matching measure. For
example, two words sharing a long stem (prefix) is
likely to be variants of each other if they share the
same context as indicated by a high cooccurrence
value between them. The word industrious shares a
AWordAssociationBasedApproachforImprovingRetrievalPerformancefromNoisyOCRedText
451