corresponding image template and using pattern
classification techniques to find all of its instances in
the digitized image corpus; a technique commonly
known as ‘word spotting’ in the document analysis
community. This ability to search in ancient
historical documents will further enhance the
importance and utility of such digital libraries
providing users with instant access to the required
information.
Word spotting, especially on the Latin alphabet,
has been the focus of research in the last few years
and a variety of techniques have been proposed to
improve recognition rates. The field however,
remains challenging and inviting, especially if the
document base comprises low quality images, which
in fact will be the subject of our research. Rath et al.
(Rath 2007) introduced an approach for information
retrieval and image indexing which involves
grouping word images into clusters of similar words
using word image matching. Four profile features
for the word images are found which are then
matched using different methods. Khurshid et al.
(Khurshid 2008) have argued that features at
character level give better results for word spotting
as compared to word-level features. In (Rothfeder
2003), correspondences between the corner features
have been used to rank word images by similarity.
Harris corner detector is used to detect corners in the
word images. Correspondences between these points
are established by comparing local context window
using the Sum of Square Differences (SSD)
measure. Euclidean distance is then found between
the word correspondences to give a similarity
measure. Telugu scripts, a native Indian language,
have been characterized by wavelet representations
of the words in (Pujari 2002). Variations in image at
different scales are studied using the wavelets.
Wavelet representation exploits the inherent
characteristics of the Telugu character. This wavelet
representation does not give good results for the
Latin letters. Adamek et al. introduced word contour
matching in holistic word recognition for
information extraction. The closed word contours
are extracted and matched using an elastic contour
matching technique (Adamek 2007).
In this paper, we propose an effective method for
information retrieval from historical printed
documents using word spotting. The proposed
approach is based on matching the features of
character images using multistage dynamic time
warping (DTW). We first present an overview of the
proposed scheme followed by a detailed discussion
on indexing and retrieval phases. We then discuss
the results obtained with the proposed methodology
and a comparison with existing methods. Finally we
present our concluding remarks and some interesting
future research directions.
2 PROPOSED METHOD
The basis of our model for word spotting is the
extraction of a feature set from character images. As
opposed to (Rath 2007) where features are extracted
at word-level, we first find all the characters in
words and then extract features from these character
images, thus giving more precision in word spotting
than (Rath 2007) as later proved by the results. To
begin, document image is first binarized and then a
horizontal Run Length Smoothing Algorithm
(RLSA) is applied to separate text in the image from
graphics and extract the words. For each word, its
characters are segmented using connected
component analysis and applying a set of heuristics
on the extracted components to find the actual
characters. Once the characters are segmented, a set
of six features is extracted for each character.
Features of query word character images are
compared with candidate word characters using
multistage DTW.
Document processing is carried out offline,
and an index file is created for each document
image. The coordinates of each word, number and
position of characters in the word, and also the
features for each character image are stored in the
index files. One advantage of having the index files
beforehand is the capability it gives to crisply select
the query word by just clicking on the word in the
graphical interface of our document processing
system. Similar processing is done on this query
word to extract its character features. The features
corresponding to the characters of query word are
matched with the already stored features of character
images in the corpus. The words for which the
matching distance is less than a pre-defined
threshold are considered to be the resulting spotted
instances of the query word, representing the
retrieval of required information.
3 INDEXING
The first step towards the extraction of features from
a document image is separation of text and
background – the binarization. Binarization of
historical documents pose additional problems since
the quality of these collections is generally quite
INFORMATION RETRIEVAL FROM HISTORICAL DOCUMENT IMAGE BASE
189