High-Level Shape Representation in Printed Gujarati Characters

Mukesh M. Goswami

and Suman K. Mitra

Faculty of Technology, Dharmsinh Desai University, Nadiad, Gujarat, India

Dhirubhai Ambani Inst. of Information and Communication Tech., Gandhinagar, Gujarat, India

mgoswami.it@live.com, suman mitra@daiict.ac.in

Keywords:

Pattern Recognition, Character Shape Representation, Shape Similarity, Character Recognition, Gujarati

Characters.

Abstract:

This paper presents extraction and identiﬁcation of the high-level stroke (HLS) from printed Gujarati charac-

ters. The HLS feature describes a character as a sequence of predeﬁned high-level strokes. Such a high-level

shape representation enables approximate shape similarity computation between characters and can easily be

extended to word-level. The shape similarity based character and word matching have extensive application

in word-spotting based document image retrieval and character classiﬁcation. Therefore, the proposed fea-

tures were tested on printed Gujarati character database consisting of 12000 samples from 42 different symbol

classes. The classiﬁcation is performed using k-nearest neighbor with shape similarity measure. Also, a shape

similarity based printed Gujarati word matching experiment is reported on a small word image database and

the initial result are encouraging.

1 INTRODUCTION

India, being a multilingual country, has more than

22 ofﬁcially listed languages written in 12 different

scripts. Substantial work in character classiﬁcation,

OCR, and word-spotting is reported in the literature

for dominating Indian scripts, like Devanagari, Ben-

gali, Tamil, and Telugu. However, many scripts such

as Gujarati still lakes attention of researchers. Despite

years of efforts, the word-level accuracy of the OCR

system for many Indian scripts have remained low as

compared to western text mainly due to the large and

complex character set including base characters, mod-

iﬁers, and conjunct symbols (Kompalli et al., 2005).

Therefore, the researchers are motivated to explore

the recognition free approach for document image re-

trieval in many Indian scripts (Srihari et al., 2006;

Hassan et al., 2009; Tarafdar et al., 2010; Jawahar

et al., 2004). As a result, the shape similarity based

character and word recognition for Indian scripts have

gained considerable interest in recent time.

The majority of the recognition free systems de-

pends on the shape-based features for the matching

of characters and words (Doermann, 1998). Such

a system demands features that are compact yet ef-

ﬁcient in describing the high-level shape of charac-

ters. Also, it should be easy to compute the shape

similarity between characters and words (Yang et al.,

2008). Therefore, The current paper investigates a

technique for compact and high-level shape represen-

tation of characters using the sequence of predeﬁned

high-level strokes (HLS). Such a sequential represen-

tation facilitates an efﬁcient shape similarity match-

ing between characters and words using dynamic pro-

gramming based algorithm.

The rest of the paper is organized as follow: we

start with a brief discussion on the related work in

Section 2. Section 3 outlines the representation, ex-

traction, and identiﬁcation of high-level stroke. Sec-

tion 4 discusses shape similarity computation be-

tween characters. Section 5 describes the experimen-

tal setup for classiﬁcation of printed Gujarati char-

acters using k-nearest neighbor and shape similarity

measure as well as the shape similarity based word-

matching experiments. Finally, the paper is concluded

in Section 6.

2 RELATED WORK

Much work is found in the literature for the classiﬁ-

cation of characters from both North as well as South

Indian script families. The character classiﬁcation for

prominent North Indian scripts like Devanagari, Ben-

gali, Gurmukhi, and Oriya is discussed in (Chaud-

huri and Pal, 1998; Chaudhuri et al., 2001; Lehal

418

Goswami, M. and Mitra, S.

High Level Shape Representation in Printed Gujarati Character.

DOI: 10.5220/0006191104180425

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 418-425

ISBN: 978-989-758-222-6

and Singh, 2000), whereas the work on major Dra-

vidian languages such as Tamil, Telugu, and Kannada

is available in (Jawahar et al., 2003; Aparna and Ra-

makrishnan, 2002; Lakshmi and Patvardhan, 2002).

The work found for the printed Gujarati charac-

ter classiﬁcation is less compared to many other In-

dian scripts. Some of the early contributions include

work by Antani and Agnihotri (Antani and Agnihotri,

1999), who used moment features with minimum

Hamming distance and k-Nearest Neighbor (kNN)

classiﬁer to claim an accuracy of 67% on a small

database of 800 samples. The most noticeable con-

tributions include work by Dholakia et al. (Dholakia

et al., 2007) that uses wavelet features with Neural

Network and kNN classiﬁers. The overall accuracy

claimed was of 96-97% on a database of 4173 sym-

bols of 119 classes. Goswami et al. (Goswami et al.,

2011) have used Self-Organizing Map(SOM) projec-

tion with the k-NN classiﬁer and reported 84% ac-

curacy on the moderately sized database. Hassan et

al. (Hassan et al., 2014) have used Multiple Ker-

nel Learning based Support Vector Machine (MKL-

SVM) classiﬁer with multiple features, like fringe

distance map (FM), shape descriptor (SD), the his-

togram of gradients (HoG). The accuracy claimed was

97-98% on a database of 16000 symbols including

modiﬁers, base characters, and conjuncts. Recently,

Goswami and Mitra (Goswami and Mitra, 2015) have

used low-level stroke features with the k-NN classi-

ﬁer for printed Gujarati character classiﬁcation and

claimed an accuracy of 95-98%.

Some initial work reported for word image re-

trieval on Devanagari, Bengali and Sanskrit script in-

cludes (Chaudhury et al., 2003; Srihari et al., 2006;

Bhardwaj et al., 2008) which uses Geometric Graph,

GSC, and Moment features, respectively. (Kumar

et al., 2007) and (Meshesha and Jawahar, 2008) gives

major contribution in document image retrieval for

Telugu script. The experiment uses multiple fea-

tures like Fourier descriptor, projection proﬁles, mo-

ments, etc., and the word images are compared using

DTW (Rath and Manmatha, 2003). The experiments

were carried out on a huge word image database ex-

tracted from 1800 pages of 7 machine-printed Telugu

books. Other noticeable contribution includes (Has-

san et al., 2009) that uses shape descriptor features

with hierarchical locality sensitive indexing for word

image retrieval from Devanagari, Bengali, and Malay-

alam scripts. (Tarafdar et al., 2010) have used a se-

quence of shape code with string edit distance for

word image matching from Devanagari, Bengali, and

Gurmukhi script. To the best of our knowledge, no

work is reported in the literature for word image

matching and retrieval for Gujarati script.

Figure 1: Formation of character as a set of high-level

strokes where each high-level stroke is described as some

sequence of shape primitives(i.e. lines, curves, and points).

The majority of the work reported until now uses

transform domain, geometrical, or statistical features

that give only a local view of the character shape and

also generate a large feature vector. The global shape

of the character can well be described using the native

high-level strokes used to form a character. However,

the decomposition of character into native stroke is

not experimented much for the Indian script.

3 HIGH-LEVEL STROKE

FEATURES

In the case of off-line text, the HLS can be described

as a sequence of object pixel between two feature

points in one pixel wide thinned character image(as

shown in Figure 1). 48 major and minor HLS are

identiﬁed that are visually non-redundant and sufﬁ-

cient to describe any middle zone symbols in Gujarati

script (see Figure 2). Each HLS, in turn, is described

as a sequence of shape primitives like lines, curves,

junction points, and endpoints. Figure 3 shows the

outline of the proposed method. The process be-

gins by taking Low-Level Stroke (LLS) matrix com-

puted using template matching algorithm proposed

in (Goswami and Mitra, 2015). The HLS is extracted

from the LLS matrix using a junction point based

stroke scanning algorithm (discussed in Section 3.2).

The extracted HLS are identiﬁed using linear chain

Conditional Random Field (CRF) (Charles and Mc-

Callum, 2011) (discussed in Section 3.3).

3.1 Representation of High-Level

Stroke

The HLS used to describe a character in Gujarati

script are highly cursive and elongated (see Figure 2)

High Level Shape Representation in Printed Gujarati Character

419

Figure 2: The set of major and minor high-level strokes

present in Gujarati characters.

Figure 3: Outline of the high-level stroke extraction from

printed Gujarati characters.

hence direct mathematical representation is infeasi-

ble. However, any complex HLS shape can be de-

scribed as a sequence of shape primitives like points,

lines, and curves(Figure 1). A template matching

based algorithm was proposed in (Goswami and Mi-

tra, 2015) to extract such shape primitives called low-

level strokes (LLS). The algorithm takes MxN skew

corrected, binarized, thinned character image as an in-

put and generates a MxN matrix of LLS as output (as

shown in Figure 4).

Figure 4: Output of the Low-Level Stroke extraction algo-

rithm(Goswami and Mitra, 2015).

The LLS features can be used as elementary build-

ing blocks to represent a HLS (as shown in Figure 1).

Apart from the sequence of LLS, it is also needed to

know the direction information of LLS. For example,

as shown in Figure 5(a), the sequence of LLS is same,

but the direction is different resulting into two distinct

HLS. Thus, the HLS can be deﬁned as – a sequence

of Direction Encoded Low-Level Strokes (DLLS) be-

tween two feature points(i.e. junction point or end-

points). Figure 5(b) shows 10 basic LLS combine

with 8 direction code to obtained 18 different DLLS

(Figure 5(c)).

3.2 Extraction of High-Level Stroke

A junction point based stroke extraction algorithm is

proposed to obtain the direction and sequence infor-

mation. The algorithm takes MxN matrix of LLS ob-

tained from the input character image (shown in Fig-

ure 4) and extracts the high-level strokes present in

the character. Each LLS in the sequence is replaced

by corresponding DLLS depending on the direction

information obtained while scanning. The necessary

steps for the junction point based scanning algorithm

are described as follow.

1. STEP 1: If a character has more than one stroke

it has at least one junction point. During the ﬁrst

step, the middle region of LLS matrix is scanned

in left to right order to obtain the list of junction

points present in the matrix(Figure 6(a)).

2. STEP 2: The 3x3 neighborhood of each junction

point in the list is scanned in clockwise order to

obtain the starting point of each HLS originating

from the given junction point. The clockwise scan

resembles left to right and top to bottom writ-

ing order of Gujarati script. The touching junc-

tion points are handled by recursively invoking the

scanning algorithm for each junction in the 3x3

neighborhood of current junction point.

3. STEP 3: A contour tracing algorithm is used to

extract the HLS starting from the start point (ob-

tained in step 2) till either an end point or a junc-

tion point is not reached. Each LLS in the contour

has exactly two neighbors, already visited previ-

ous point and next unvisited point. The direction

of LLS is obtained by ﬁnding the relative position

of next point w.r.t. the current LLS. Thus, LLS

is combined with direction information to obtain

DLLS used to generate the output sequence. The

HLS so extracted are deleted from the LLS matrix

to avoid duplication.

4. STEP 4: Repeat Step 2 and 3 for each junction

point in the list obtained in Step 1

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

420

Figure 5: (a) Different HLS with same sequence of LLS but different direction (b) Direction encoding of LLS (c) Set of

Directional Embedded Low-Level Strokes (DLLS).

5. STEP 5: Many characters in the Gujarati script

are formed using single high-level strokes only,

hence does not have any junction point. In such

scenario, the starting point is obtained by ﬁnd-

ing the ﬁrst endpoint in the top-left quadrant of

the LLS matrix using zigzag scan. If the end-

point does not exist in the top-left quadrant then

top-right, bottom-left, and bottom-right quadrants

are scanned respectively, to obtained the starting

endpoint. If the character contains neither a junc-

tion point nor an endpoint (for example symbol

”zeros”), then the ﬁrst LLS obtained in zigzag or-

der from top-left quadrant is selected as starting

point(see Figure 6(b)). Once the starting point is

obtained the contour tracing algorithm discussed

in step 3 is used to extract the HLS.

3.3 Identiﬁcation of High-Level

Stroke(HLS)

The stroke extraction algorithm discussed in the pre-

vious section represent every character sample hav-

ing k HLS as a set of k order pairs (y

, X

), (y

, X

)

. . . , (y

, X

) where y

gives i

HLS label and X

gives corresponding DLLS vector. The identiﬁca-

tion of HLS refers to the problem of predicting the

sequence of values y

, y

, . . . , y

given the sequence

of DLLS vectors X

, X

, . . . , X

where each X

, x

, . . . , x

. The stroke label y

∈

{

1, 2, . . . , 48

}

and elements of DLLS vector x

i j

∈

{

1, 2, . . . , 18

}

Since each HLS y

in the character depends on

Figure 6: High-Level Stroke Extraction using (a) Junction

Point (b) Endpoint or (c) First LLS in zigzag order from

top-left corner.

corresponding DLLS vector X

as well as other y’s

present in the character, the Conditional Random

Fields (CRF) (Charles and McCallum, 2011), an

undirected probabilistic graphical model, is employed

next to identify the HLS from the DLLS vector. The

selection of CRF is justiﬁed in this context because

it not only captures the dependency between HLS y

and DLLS vector X

but also consider the dependency

between the current HLS y

with other HLS present

in the character. In the simplest case, known as linear

High Level Shape Representation in Printed Gujarati Character

421

chain CRF, the current HLS y

depends on DLLS vec-

tor X

and the previous HLS y

i−1

. The conditional

probability of the set of HLS Y =

{

, y

, . . . , y

}

given the set of corresponding DLLS vectors X =

{

, X

, . . . , X

}

can be computed using Equation 1.

P(Y |X) =

Z(X)

∏

i=1

exp

∑

, y

i−1

, X

)

(1)

where Z(X) gives normalizing factor, f

gives fea-

ture function and λ

gives parameters of CRF, which

needs to be learned from labeled training database us-

ing standard gradient optimization algorithm like L-

BFGS. The HLS y

in each character sample are la-

beled manually using stroke labeling tool to gener-

ated a labeled stroke database required for training the

CRF model.

4 SHAPE SIMILARITY

COMPUTATION

CRF model discussed in the previous section

takes as an input the sequence of DLLS vec-

tors

{

, X

, . . . , X

}

and labels corresponding HLS

{

, y

, . . . , y

}

for each character. Thus, a character,

after HLS identiﬁcation, is represented as a sequence

of HLS C =< y

, y

, . . . , y

>. Therefore, the approx-

imate shape similarity between the characters can be

obtained by ﬁnding the similarity between HLS se-

quences for the given characters. Since it is needed to

compare the entire HLS sequence (end to end align-

ment) to match the characters, the global sequence

alignment techniques proposed by (Needleman and

Wunsch, 1970) can be used to ﬁnd the regions of sim-

ilarity between two HLS sequences.

The algorithm, shown in Figure 7, gives the length

of the maximum matching sub-sequence between two

sequences. However, unlike traditional Longest Com-

mon Subsequence (LCS) algorithm which does not

assign a penalty to mismatch, the NW algorithm as-

signs a penalty of -1 to mismatch as well as the gap.

The similarity score is obtained by dividing the length

of maximum matching subsequence by the minimum

of the length of two sequences. The highest value

of the similarity score is 1 if both the sequences are

same and close to 0 if they are dissimilar. The set of

all high-level strokes can be divided into two groups

based on their importance in describing a character

class, namely major and minor strokes (see Figure 2).

Therefore, the indicator function, I, in the original al-

gorithm is replaced by a customized similarity score

matrix, S

48×48

, which gives more weight to major

strokes than minor strokes while computing the simi-

larity between the HLS sequences.

5 EXPERIMENTS AND RESULTS

Two different experiments are reported in the fol-

lowing Section, namely printed character classiﬁca-

tion and word matching, to show the effectiveness of

shape similarity measure computed using HLS repre-

sentation.

5.1 Printed Gujarati Character

Classiﬁcation

The experiment for printed Gujarati character clas-

siﬁcation is performed using k-NN classiﬁer with

shape similarity measure as a distance function to

demonstrate the applicability of shape similarity mea-

sure discussed above. The optimum value of k in

the k-NN depends on the distribution of samples in

database (Murphy, 2012). However, for any given

database, not all the classes have same distribution

(i.e. the samples of some class are denser than oth-

ers). Therefore, a single value of k may not give an

optimum result for all the classes. The current exper-

iment uses a simple heuristics to handle this issue. It

will ﬁrst ﬁnd all the neighbors within a tight radius

of the unknown sample (i.e. 80% similarity region in

this case) and predict the class label using majority

voting. However, if no data sample is found within

80% similarity region (i.e. the class has a sparse dis-

tribution of samples), then the 1

nearest neighbor is

used to predict the class label. The three-fold cross

validation technique is used to make results more au-

thentic. The average test accuracy over all three runs

is used as the primary performance measure.

The database used in the experiment consist of

12000 samples of 42 middle zone character sym-

bols from Gujarati script. The samples are col-

lected from three different sources namely, machine

printed books (BOOKDB), newspapers(NEWSDB),

and laser printed documents (LASERDB) to ensure

the varieties in terms of font type, style, size, ink

thickness, etc.

The results of the experiment are shown in Ta-

ble 1. The average test accuracy obtained on the com-

bined database is 94.97%. Table 2 shows the com-

parison of the results obtained with existing work. It

is evident that the results obtained are 2-3% lower

than the best-reported work (Hassan et al., 2014) in

the literature. The drop in the accuracy is mainly

due to the compactness of features since the size of

the feature vector in (Hassan et al., 2014) is almost

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

422

Figure 7: Compute shape similarity between characters using approximate string matching algorithm.

Table 1: Average test accuracy on BOOKDB, LASERDB,

and NEWSDB using k-NN classiﬁer with shape similarity

measure.

Database 1

Run 2

Run 3

Run Average

BOOKDB 98.43% 98.77% 98.50% 98.57%

LASERDB 94.73% 94.67% 90.64% 93.35%

NEWSDB 93.51% 94.10% 89.08% 92.23%

ALL 97.72% 95.07% 92.12% 94.97%

COMBINED

Table 2: Comparison of methods for classiﬁcation of

printed Gujarati characters.

Author Features/Classiﬁer Accuracy

Hassan et al. HoG, Fringe Map 97-98%

(2014) and Shape Descriptor/

MKL-SVM

Dholakia et al. Daubechies D4 97.59%

(2009) Wavelet Feature/

GRNN

Goswami Histogram of 95.35%

and Mitra Low-Level Stroke/

(2015) K-NN

Proposed High-Level Strokes/ 94.97%

Method K-NN with Shape

Similarity measure

Goswami et al. SOM projection 84%

(2011) k-NN

Antani and Moment features 67%

Agnihotri K-NN with Minimum

(1999) Hamming Distance

100 times the size in the proposed approach. Un-

like, the features used in other character classiﬁca-

tion methods, the HLS provides a high-level view

of the character shape and enables only approxi-

mate matching between character rather than exact.

Therefore, similar looking characters, as shown in

Figure 8, are very often misclassiﬁed. Since the

HLS features are extracted from thinned character

image, the structural noise added by thinning algo-

rithm also affects the performance. Therefore, the

accuracy reported on BOOKDB is higher(98.43%)

than LASERDB(94.73%) and NEWSDB(93.51%)

because the newspaper and the laser printed sym-

bols tend to have higher structural noise introduced

by thinning as compared to BOOKDB (Suthar et al.,

2014).

Figure 8: Characters with similar shape that are most com-

monly misclassiﬁed.

In summary, the HLS feature provides a compact

representation of the high-level shape of the char-

acter. The feature compactness allows an efﬁcient

comparison between character shape. However, they

provide only inexact matching hence not suitable for

character classiﬁcation and OCR application. Such,

approximate shape matching is desirable for word-

spotting based experiments where the objective is to

match a query word image with all morphological

variants. Moreover, also the sequential representa-

tion enables fast shape similarity computation using

dynamic programming based sequence matching al-

gorithms. Thus, the features could be useful in shape

similarity based word-matching application.

5.2 Printed Gujarati Word Matching

The idea of shape similarity computation using HLS

can easily be extended at the word-level. As shown in

Figure 9, the skew-corrected, binarized, and thinned

word-image is ﬁrst segmented into character sym-

bols using connected component analysis. The HLS

are extracted from each symbols using stroke extrac-

tion algorithm (Section 3.2) and identiﬁed using CRF

High Level Shape Representation in Printed Gujarati Character

423

Figure 9: High-level stroke representation of Word Image.

(Section 3.3). The HLS sequences of all symbols are

concatenated to generate a single HLS sequence rep-

resenting the word-image. Finally, the shape simi-

larity between word-images is computed by ﬁnding

the global alignment score between the HLS sequence

representing the word-images (Section 4).

Figure 10: Shape Similarity matrix computed using HLS

representation of 280 Gujarati word images of 48 different

word groups.

The effectiveness of shape similarity measure

for word-image was veriﬁed on a small word-group

database consist of 280 word-images of 48 different

word-groups. Each word-image in the database was

represented as HLS sequence and shape similarity be-

tween each pair of word-images was computed using

shape similarity measure discussed in Section 4. The

results were represented using a similarity matrix of

size 280× 280 (shown visually in Figure 10). It is ev-

ident from Figure 10 that the HLS based shape sim-

ilarity score between the pair of word-images in the

same group is higher the one in different groups.

Figure 11: Similarity Threshold vs Precision and Recall

graph for Gujarati word-group database.

In the word image retrieval experiment, a ran-

dom query image is selected from each word group,

and all matching word images are retrieved from the

database based on the shape similarity threshold. The

value of precision and recall were computed based

on the number of relevant and retrieved images for

each query and averaged over all queries. Figure 11

shows the value of precision and recall versus simi-

larity threshold graph. The optimum values of pre-

cision and recall were 77.61% and 80.91%, respec-

tively with similarity threshold equal to 0.62.

6 CONCLUSION AND FUTURE

WORK

The paper discusses representation, extraction, and

identiﬁcation of high-level strokes from printed Gu-

jarati characters. The salient characteristics of HLS

features are compactness, high-level shape descrip-

tion, easy to compute shape similarity, and extend-

ability at word-level. The features are tested on the

moderately sized symbol level database of printed

Gujarati characters with the font, size, style, and ink

thickness variations. The experiments were also per-

formed for shape-similarity based word-matching on

a small Gujarati word group database, and the ini-

tial results are encouraging. In future, more extensive

experiments can be carried out on large word image

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

424

database. Also, the HLS features can be combined

with other features based on the detailed analysis of

error and misclassiﬁcation to improve the retrieval re-

sults.

REFERENCES

Antani, S. and Agnihotri, L. (1999). Gujarati character

recognition. In Proc. of the 5th Int. Conf. on Doc-

ument Analysisand Recognition (ICDAR’99), pages

418–421.

Aparna, K. and Ramakrishnan, A. (2002). A complete tamil

optical character recognition system. In Lopresti, D.,

Hu, J., and Kashi, R., editors, Document Analysis Sys-

tems V, pages 53–57. Springer Berlin / Heidelberg.

Bhardwaj, A., Damien, J., and Govindaraju, V. (2008).

Script independent word spotting in multilingual doc-

uments. In Proc. of 2nd Int. Workshop on Cross Lin-

gual Information Access, pages 48–54.

Charles, S. and McCallum, A. (2011). Introduction to con-

ditional random ﬁelds. Foundation and Trends in Ma-

chine Learning, 4(4):267–373.

Chaudhuri, B. and Pal, U. (1998). A complete printed

bangla ocr system. Pattern Recognition, 31(5):531–

549.

Chaudhuri, B., Pal, U., and Mitra, M. (2001). Automatic

recognition of printed oriya script. In Proc. of the 6th

Int. Conf. on Document Analysis and Recognition (IC-

DAR’01), pages 795–799. IEE.

Chaudhury, S., Sethi, G., Vyas, A., and Harit, G. (2003).

Devising interactive access techniques for indian lan-

guage document images. In Proc. of the Int. Conf. on

Document Analysis and Recognition (ICDAR), pages

885–889.

Dholakia, J., Yajnik, A., and Negi, A. (2007). Wavelet fea-

ture based confusion character sets for gujarati script.

In Proc. of the Int. Conf. on Computational Intelli-

gence and Multimedia Applications, pages 366–370.

Doermann, D. (1998). The indexing and retrieval of docu-

ment images: A survey. Computer Vision and Image

Understanding, 70(3):287–298.

Goswami, M. and Mitra, S. K. (2015). Classiﬁcation of

printed gujarati characters using low-level stroke fea-

tures. ACM Trans. Asian Low-Resour. Lang. Inf. Pro-

cess., 15(4):25:1–26.

Goswami, M., Prajapati, H., and Dabhi, V. (2011). Classi-

ﬁcation of printed gujarati characters using som based

k-nearest neighbor classiﬁer. In Proc. of the Int. Conf.

on Image Information Processing, pages 1–5. IEEE.

Hassan, E., Chaudhury, S., and Gopal, M. (2009). Shape de-

scriptor based document image indexing and symbol

recognition. In Proc. of the 10th Int. Conf. on Doc-

ument Analysis and Recognition (ICDAR’09), pages

206–210.

Hassan, E., Chaudhury, S., and Gopal, M. (2014). Feature

combination for binary pattern classiﬁcation. Interna-

tional Journal of Document Analysis and Recognition

(IJDAR), 17(4):375–392.

Jawahar, C., Kumar, P., and Kiran, S. (2003). A bilingual

ocr for hindi-telugu documents and its applications. In

Proc. of the 7th Int. Conf. on Document Analysis and

Recognition (ICDAR’03), pages 408–412.

Jawahar, C. V., Balasubramanian, A., and M., M. (2004).

Word-level access to document image datasets. In

Proceedings of the workshop on computer vision,

graphics and image processing.

Kompalli, S., Setlur, S., and Govindaraju, V. (2005). Chal-

lenges in ocr of devanagari documents. In Proc. of the

8th Int. Conf. on Document Analysis and Recognition

(ICDAR’05), pages 1–5. IEEE.

Kumar, A., Jawahar, C., and Manmatha, R. (2007). Efﬁcient

search in document image collections. In Yagi, Y.,

editor, ACCV:LNCS, volume 1 of 4843, pages 586–

595. Springer-Verlag Berlin / Heidelberg.

Lakshmi, C. and Patvardhan, C. (2002). A multi-font ocr

system for printed telugu text. In Proc. of the Lan-

gauge Engineering Conference, pages 7–17.

Lehal, G. and Singh, C. (2000). A gurmukhi script recogni-

tion system. In Proc. of the 15th Int. Conf. on Pattern

Recognition (ICPR’00), pages 557–560.

Meshesha, M. and Jawahar, C. (2008). Matching of word

image for content-based retrieval from printed doc-

ument images. International Journal of Document

Analysis and Recognition (IJDAR), 11(1):29–38.

Murphy, K. (2012). Machine Learning: A Probabilistic Per-

spective. The MIT Press, Cambridge, Massachusetts

London, England.

Needleman, S. B. and Wunsch, C. D. (1970). A gen-

eral method applicable to the search for similarities

in the amino acid sequence of two proteins. Journal

of Molecular Biology, 48(3):443–453.

Rath, T. and Manmatha, R. (2003). Word image match-

ing using dynamic time wrapping. In Proc. of the

Int. Conf. on Computer Vision and Pattern Recogni-

tion (ICVRP), volume 2, pages 521–527.

Srihari, S., Srinivasan, H., Huang, C., and Shetty, S. (2006).

Spotting words in latin, devanagari and arabic scripts.

Vivek, 16(3):2–9.

Suthar, S., Goswami, M., and Thakkar, A. (2014). Empir-

ical study of thinning algorithms on printed gujarati

characters and handwritten numerals. In Meenakshi,

N., editor, Proc. of the 2nd Int. Conf. on Emerging Re-

search in Computing, Information, Communication,

and Applications (ERCICA’14), volume 2, pages 104–

110. ELSEVIER.

Tarafdar, A., Mondal, R., Pal, S., Pal, U., and Kimura, F.

(2010). Shape code based word-image matching for

retrieval of indian multi-lingual documents. In Proc.

of the Int. Conf. on Pattern Recognition (ICPR), pages

1989–1992.

Yang, M., Kpalma, K., and Ronsin, J. (2008). A survey of

shape feature extraction techniques. In Yin, P., editor,

Pattern Recognition, pages 43–90. IN-TECH.

High Level Shape Representation in Printed Gujarati Character

425