Document Image Classiﬁcation Via AdaBoost and ECOC Strategies

Based on SVM Learners

Mehmet Ahat

1,2

, Cagdas Ulas

and Onur Agin

R&D and Special Projects Department, Yapi Kredi Bank, Gebze, Kocaeli, Turkey

Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Istanbul, Turkey

Keywords:

Document Image Retrieval and Classiﬁcation, SVM, One-Versus-All, AdaBoost, ECOC, BoVW Model.

Abstract:

In this paper, we describe easily extractable features and an approach for document image retrieval and clas-

siﬁcation at spatial level. The approach is based on the content of the image and utilizing visual similar-

ity, it provides high speed classiﬁcation of noisy text document images without optical character recognition

(OCR). Our method involves a bag-of-visual words (BoVW) model on the designed descriptors and a Random-

Window (RW) technique to capture the structural relationships of the spatial layout. Using the features based

on these information, we analyze different multiclass classiﬁcation methods as well as ensemble classiﬁers

method with Support Vector Machine (SVM) as a base learner. The results demonstrate that the proposed

method for obtaining structural relations is competitive for noisy document image categorization.

1 INTRODUCTION

Document Image Retrieval is a crucial research area

dealing with the problem of retrieving structurally

similar document images from a large heterogeneous

collection given a relevant image, which is useful for

document database management, information extrac-

tion and document routing (Hu et al., 1999). In to-

day’s world, large quantities of paper documents are

converted into electronic form and stored as document

images in digital libraries. Storing scanned document

images alone does not sufﬁce, instead, it is only ben-

eﬁcial when the process of retrieving relevant docu-

ments should be done in an efﬁcient manner. How-

ever, the number of relevant documents provided for

retrieval is usually much lower than the number of ir-

relevant documents, resulting in imbalance problem

in the data and thus, the retrieval problem becomes

more challenging (Zheng et al., 2004).

Several methods have been practised for docu-

ment image retrieval and categorization. Most of

these methods are mainly based on layout (struc-

ture) or content of the documents. Content based ap-

proaches are highly dependent on the quality of OCR

and attaining OCR result on the entire document is

excessively expensive in terms of time (Kumar et al.,

2012). As shown in Figure 1, images from different

types of documents often have quite distinct spatial

layout styles. At very low resolutions, these distinc-

tions are also identiﬁable which allows us to develop

faster algorithms than that of content based tech-

niques for document classiﬁcation (Hu et al., 1999).

One of the well known methods on layout similarity

is based on the block segmentation where the image is

divided into several structural blocks(Fan et al., 2001)

and these blocks are then compared to their analogous

for the given type of documents. Another popular

method proposes the creation of spatial-pyramid fea-

tures (Lazebnik et al., 2006) by partitioning the image

into smaller grids and computing the density of fea-

tures in each region. Finding efﬁcient ways to capture

the structural relationships at local level is an impor-

tant research problem, and several methods (Yang and

Newsam, 2011; Kumar and Doermann, 2013) have

been proposed before on this issue.

In this work, we propose a method for the clas-

siﬁcation of noisy document images. All the docu-

ments that we consider in classiﬁcation are extracted

from our real-world bank dataset, consisting various

types of forms (see Figure 1) mostly used in loan

applications and scanned by bank branch employee

to be converted to digital format. Our approach in

this study employs different multiclass classiﬁcation

methods and AdaBoost based ensemble classiﬁers

with SVM as a base learner to make predictions on

document type by utilizing a set of features that repre-

sent the structural information at spatial layout level.

Our work differs from previous approaches in sev-

250

Ahat M., Ulas . and Agin O..

Document Image Classiﬁcation Via AdaBoost and ECOC Strategies Based on SVM Learners.

DOI: 10.5220/0005131502500255

In Proceedings of the International Conference on Neural Computation Theory and Applications (NCTA-2014), pages 250-255

ISBN: 978-989-758-054-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

eral ways: (1) We represent each small region of the

image with a very small number of feature descrip-

tors, which results in high speed for feature extraction

procedure (2) Images are represented with less num-

ber of visual codewords compared with other meth-

ods using SIFT features (Smith and Harvey, 2011)

as the key descriptor (3) Without any use of irrele-

vant images in training phase, the proposed method

can achieve high recall rate for correct detection of ir-

relevant images (4) We compare the performance of

different multiclass classiﬁcation strategies as well as

ensemble classiﬁers and demonstrate that the results

are especially competitive when SVM based ensem-

ble classiﬁers are used for this kind of imbalanced

dataset problem.

The remainder of this paper is structured as fol-

lows: In Section 2, the proposed noisy document im-

age classiﬁcation method is presented. We provide

a description of the experimental setup and demon-

strate the classiﬁcation results in Section 3 and ﬁnally

we give concluding remarks in Section 4.

2 PROPOSED METHOD

The proposed method in this work is composed of

following steps: the description of feature descrip-

tors, the utilization of BoVW model with Random-

Window approach and the analysis of SVM based

multiclass classiﬁcation strategies and ensemble clas-

siﬁers. Following sections will provide the details of

these steps.

2.1 Feature Descriptors

The document images in our business form database

are binary (monochrome) images that the pixel values

can take only one of two values (0,1), suffering from

lack of information as compared to color and gray-

scale images. Hence, speciﬁcally designed features

should be found for this type of images.

In this work, we divide each training and test im-

age into small square patches to determine a 60× 40

image layout with 2400 image patches and represent

each patch with 4 feature descriptors based on struc-

tural variations in small local areas.

If P

denotes the width (column number) of the

image patch; P

denotes the height (row number) of

the patch and w(i, j) represents the pixel value of the

pixel at ith row and jth column, the feature descriptors

are calculated as follows:

1. Column Standard Deviation (σ

)

Variance(X

) (1)

Figure 1: Sample business forms in the dataset. Each form

is an example from one of the relevant classes.

where X

= [X

...X

] and each X

∑

j=1

w( j,i)

2. Row Standard Deviation (σ

)

Variance(X

) (2)

where X

= [X

...X

] and each X

∑

j=1

w(i, j)

3. Patch Mean Value (m

)

∑

j=1

∑

i=1

w(i, j)

(3)

4. Pixel Transition Intensity (t

)

− 1)

∑

j=1

∑

i=1

trans( j,i)

(4)

where trans( j, i) is deﬁned as follow:

trans( j,i) =

(

1 if w( j,i− 1) = 1 & w( j,i) = 0

0 otherwise

2.2 Bag of Visual Words (BoVW) Model

In computer vision, the BoVW model (Csurka et al.,

2004) can be applied to image classiﬁcation and re-

lated tasks by treating image descriptors as words. A

bag of visual words is a sparse vector of mostly oc-

currence counts or presence of the visual words from

a vocabulary of local image features. A vocabulary

(or codebook) of visual models is obtained by clus-

tering local image descriptors extracted from training

images, which is also described as vector quantiza-

tion of image features into visual words. The vector

quantization process is generally done by a hard or

soft assignment (clustering) and a codebook of visual

words is obtained. Visual words (codewords) are then

deﬁned as the centers of learned clusters.

DocumentImageClassificationViaAdaBoostandECOCStrategiesBasedonSVMLearners

251

In this work, we use k-means clustering (Winn

et al., 2005) to determine the codebook of visual

words. The number of cluster is empirically set as 4

due to the intuition that the structural variations in-

side the local patches of text-document images are

small. After obtaining the visual words, each train

and test images are represented with a sequence of vi-

sual words in a 60 × 40 layout. This layout is given

as the input to “Random Window” generator where a

pre-deﬁned number of windows inside the layout is

selected and each window is represented with a set of

features based on structural relations.

2.3 Random Window (RW) Approach

As it was mentioned in Section 2.2, we represent our

document images by visual words. However, visual

words are not enough to discriminate the structure of

different documents. Usually each type contains par-

ticular image patterns inside sub-images whose coor-

dinates and sizes are unknown. In order to capture

spatial relationships, after converting all document

images into a sequence of visual words in a 60 × 40

layout, we randomly select rectangular windows in-

side the layout and extract layout features by using

the following approach:

Let RW is the set of randomly selected windows’

coordinates represented as

n

′

′′





′

′′

o

where 0 ≤ m

′

≤ m

′′

≤ 60 and 0 ≤ n

′

≤ n

′′

≤ 40.

Let V is the available number of visual words

in vocabulary and S is the set of occurrence counts

of the visual words for given RW’s. Hence, S



,...,s



where s

is the occurrence count of the

jth visual word in RW

. For given RW

, the feature

vector F

is deﬁned as; F



/η,s

/η,...,s

/η



where η is the normalization constant calculated as

η =

)

+ (s

)

+ ... + (s

)

for the set of S

2.4 Support Vector Machine (SVM)

SVM is basically conceived for binary classiﬁcation.

The idea is to separate two classes by calculating

the maximum margin hyperplane between the train-

ing examples (Vapnik, 1998). The decision function

of SVM for a binary classiﬁcation problem is

f(x) = hw,φ(x)i + b (5)

where φ(x) is a mapping of sample x from the input

space to a high dimensional feature space. h.,.i de-

notes the dot product in the feature space. The opti-

mal values of w and b can be determined by solving

the following optimization problem:

minimize g(w,ξ) =

kwk

∑

i=1

subject to y

(hw,φ(x

)i + b) ≥ 1− ξ

, ξ

≥ 0

(6)

where ξ

is the ith slack variable and C is the regular-

ization parameter. The minimization problem in (6)

can be written as

minimize W(α) = −

∑

i=1

∑

i=1

∑

j=1

k(x

)

subject to

∑

i=1

= 0, ∀i : 0 ≤ α

≤ C

(7)

where α

is a Lagrange multiplier corresponding to

sample x

, k(., .) is a kernel function which implicitly

maps the input vectors into a suitable feature space. In

this space, an optimal separating hyperplane is con-

structed by the support vectors.

k(x

) = hφ(x

),φ(x

)i (8)

In this work, we use the RBF kernel, k(x

) =

exp(−kx

− x

/2σ

). The performance of RBF-

SVM is mainly affected by the kernel parameters, for

example, σ, and the regularization parameter, C. By

using model selection techniques such as k-fold or

leave-one-out cross-validation (CV), a single best σ

and C can be found (Li et al., 2008).

The following brieﬂy describes several notations

used in this paper:

• T = {(x

),(x

),...,(x

)} : A training

set; where x

∈ R

; each label, y

is an integer

value belongs to Y = {l

,...,l

}, where N

the number of classes. h = {h

,...,h

} : A set

of n binary classiﬁers.

2.5 Multiclass SVM Classiﬁers

2.5.1 One-Versus-All (OVA)

The one-versus-all method constructs n binary clas-

siﬁers, one for each class. The ith classiﬁer, h

, is

trained with the data from class i as positive instances

and all data from other classes as negative instances

to discriminate among the patterns of the class and

the patterns of the remaining (Bagheri et al., 2012).

A new instance is classiﬁed as the class whose corre-

sponding classiﬁer output has the largest value (prob-

ability). Hence, the ensemble decision function, h, is

deﬁned as:

y = argmax

i∈{1,2,...,n}

(x) (9)

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

252

2.5.2 Error Correcting Output Codes (ECOC)

The ECOC framework is widely used for multiclass

classiﬁcation problems. It is based on combining bi-

nary classiﬁers and designing a codeword for each

class. Since each class is coded by different code-

words, it may exhibit error-correcting capabilities,

which increases accuracy of the multiclass problem

(Dietterich and Bakiri, 1995). Let M is a coding ma-

trix with a dimension of N

× L whose elements m

i, j

can be {−1,+1} for dense ECOC models. L is the

length of codewords which is used for class assign-

ments. Each column of M is a map of binary classi-

ﬁer that separates each class from the others. Later on,

the ECOC method was extended and elements can be

{−1, 0, +1} where “0” labeled elements in the coding

matrix is not considered during the training phase of

a particular binary classiﬁer (Allwein et al., 2001).







1,1

1,2

··· m

1,L

2,1

2,2

··· m

2,L

··· m







When an instance x is tested, each binary classiﬁer

predicts -1 or 1 for x and these predictions creates a

L long output code vector. x is labeled as the class

whose codewordhas minimum distance. The distance

is usually hamming distance between the output code

vector and class codewords.

2.6 AdaBoost SVM

Although the use of SVM as a component (base)

classiﬁer in AdaBoost may not seem to be in accor-

dance with Boosting principle since SVM itself is

not a weak classiﬁer, the proposed AdaBoostSVM in

(Li et al., 2008) demonstrates that it can show bet-

ter generalization performance than SVM on imbal-

anced classiﬁcation problems that we consider in this

study. The key idea of AdaBoostSVM is that for a se-

quence of trained SVM component classiﬁers, start-

ing with large σ values (implies week learning), the

σ values are reduced progressively as the Boosting

iteration proceeds which effectively produces a set

of component classiﬁers whose model parameters are

adaptively different resulting in better generalization

as compared to using a ﬁxed (optimal) σ value.

The AdaBoostSVM method (Li et al., 2008) which

was proposed for binary classiﬁcation problem can be

easily modiﬁed for OVA multiclass approach. The

pseudo-code of the OVA-AdaBoostSVM is provided

in Algorithm 1.

Algorithm 1: OVA-AdaBoostSVM.

Input: T, Y, number of training samples, N; the initial

σ, σ

ini

; the minimum σ, σ

min

; the step of σ, σ

step

for each l

∈ Y do

Apply the class binarization for class l

on T.

Initialize: the weights of training samples: w

while σ > σ

min

(1) Train a RBF-SVM component classiﬁer, h

on weighted T.

(2) Calculate the training error of h

: ε

∑

i=1

, y

6= h

(3) If ε

> 0.5, decrease σ value by σ

step

and go

to step (1).

(4) Set the weight of h

: α

ln(

1−ε

(5) Update the weights of training samples:

t+1

exp{−α

)}

where C

is a normal-

ization constant, and

∑

i=1

t+1

= 1.

end while

(x) = sign(

∑

t=1

(x))

end for

Output: y(x) = l

∈ Y, where f

(x) = 1.

3 EXPERIMENTAL RESULTS

In this work, we consider the classiﬁcation problem

of document images from 8 types (classes) of busi-

ness forms as shown in Figure 1. We use our own

data set to evaluate the performance of proposed clas-

siﬁcation approach. Our training data consists of 50

image samples for each class. The test data consists of

2066 image samples. 1267 of these images are irrele-

vant images whereas 799 of them are relevant images.

Most of the images in the training and test dataset

were contaminated with marginal (clutter) noise and

salt&pepper noise during scanning, transmission or

conversion to digital form.

Three different methods for classiﬁcation analy-

sis of this multiclass classiﬁcation problem are com-

pared in this study: OVA-SVM, OVA-AdaBoost-

SVM, Sparse ECOC-SVM. For OVA and ECOC

based SVM modeling, 5-fold CV conducted for both

parameter tuning and generalization capability. For

AdaBoost-SVM, the regularization parameter C is

empirically set as 10 for all experiments. The σ

min

computed as the average minimum distance between

any two training samples inside the subset of training

data and the σ

ini

is set as the L

norm of the aver-

age of the training samples in the input space. Lastly,

step

is determined to be 2. For ECOC-SVM, we use

Sparse ECOC models with Hamming Decoding. We

initialize the codebooks (coding matrix) with OVA

class binarization and generate randomly rest of the

codebooks each of whose row (codeword) can have

DocumentImageClassificationViaAdaBoostandECOCStrategiesBasedonSVMLearners

253

Table 1: Performance values for each method when window number is 400. (Irrelevant HR = Irrelevant class avg. hit rate

(recall), Relevant HR = Relevant classes avg. hit rate).

Method Precision Accuracy F

score Irrelevant HR Relevant HR

OVA 0.9293 0.9433 0.9176 0.9335 0.9414

AdaBoost 0.9351 0.9520 0.9261 0.9962 0.9144

Sparse ECOC 0.9209 0.9347 0.9062 0.9089 0.9176

zero elements with 0.3 probability. 15 × log(N

) bi-

nary SVMs are trained. Moreover, a fresh codebook

is generated with the same settings in each run.

Due to the variations in the results, we run each

method several times using the train and test data. The

ﬁnal performance of each algorithm on the data set is

the average of the results over all runs. The decision

procedure of each method is as follows: If the pre-

dicted label of a test image is “-1” for all binary clas-

siﬁers, then the test image class is assigned to be “0”

class which is also called as irrelevant class. If only

one of the OVA classiﬁers returns the label “+1”, the

class label of the test image is determined to be the

class holding that OVA classiﬁer. Lastly, If more than

one of the OVA classiﬁers return “+1”, then the class

satisfying the highest probability estimate or smallest

distance (for ECOC-SVM) is chosen as classiﬁcation

decision.

The performance evaluation of the proposed algo-

rithm on 3 different SVM based approaches depends

on the following important criteria: precision, recall,

accuracy, F

score, irrelevant class hit rate (recall) and

relevant classes hit rate. The macro-weighted aver-

age precision, recall, accuracy and F

score values

are calculated as in (Sokolova and Lapalme, 2009).

The irrelevant class hit rate is calculated by dividing

the number of irrelevant images which are correctly

classiﬁed by the total number of irrelevant image in

test data using the aforementioned decision proce-

dure. The relevant classes hit rate is determined as

in the same way when assuming that there is not any

irrelevant class and as opposed to aforementioned de-

cision procedure, the class decision for relevant im-

ages is made only according to the highest probability

estimate or smallest distance.

The performance values for each method when

randomly generated window number is ﬁxed to 400

are shown in Table 1. The best results are highlighted

with bold fonts. Results in Table 1 demonstrates that

AdaBoost-SVM method outperforms others on our

own data set in terms of all criteria except for rele-

vant classes hit rate. This method achieves a classi-

ﬁcation accuracy of 95.2% and 99.6% irrelevant hit

rate, which almost implies perfect detection of irrel-

evant images. Sparse ECOC-SVM based classiﬁca-

tion method exhibits the worst performance among

all 3 classiﬁcation approaches with F

score = 0.906.

One possible reason of this is that the performance of

ECOC method is drastically affected by the associa-

tion between class label and its codeword representa-

tion (Crammer and Singer, 2000). Generating random

code matrix with ﬁxed settings can create sub-optimal

codeword representations that lead to the lowest per-

formance. Figure 2 shows the average recall values

of all runs versus the number of randomly selected

windows considered as in the range of [50,600]. Gen-

erally, the recall value has an increasing trend with

the number of window due to having more informa-

tion on visual words’ statistics. The best value is

achieved by the case when the window number is 400

and AdaBoost performs the best with 92.8 % recall

rate. Next, the McNemar’s statistical test (Li et al.,

2008) is employed to determine the signiﬁcance of the

results presented in Table 1. Table 2 shows the Mc-

Nemar’s statistical test results for each method pair.

The results are obtained by averaging the test statis-

tics among all runs for each method pair. The test

results illustrate that the performance of AdaBoost

signiﬁcantly differs from that of other two methods

on this data set ({7.46,21.69} > 3.8414). The under-

lying reason for this is the Boosting mechanism that

forces several SVM component classiﬁers on imbal-

anced data sets to focus on the misclassiﬁed samples

from the minority class, and to prevent them from be-

ing wrongly classiﬁed (Li et al., 2008).

Our results in Table 1 indicates that none of the

three methods can achieve the best performance in

terms of Irrelevant HR and Relevant HR at the same

time. A good future direction of this work can be us-

ing an another strong classiﬁer such as Random De-

cision Forest (RDF) (Yao et al., 2011) combined with

SVM for better prediction of both irrelevant and rele-

vant classes.

Table 2: McNemar’s statistics between all method pairs

when window number is ﬁxed to 400.

Method Pairs McNemar’s statistic (χ

)

OVA - AdaBoost 7.46

AdaBoost - Sparse ECOC 21.69

OVA - Sparse ECOC 21.02

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

254

0 100 200 300 400 500 600

100

Number of randomly selected window

Average Recall (%)

OVA−SVM

OVA−AdaBoost−SVM

Sparse−ECOC−SVM

Figure 2: Average Recall values versus the number of ran-

domly selected window for each method.

4 CONCLUSION

In this paper, we have proposed a method for the clas-

siﬁcation and retrieval of business form type docu-

ment images. In our method, we incorporate BoVW

model using a set of features based on structural vari-

ations in local image patches and present an approach

to learn the visual words’ histogram at layout level.

Using a real-world bank data, we perform the analy-

sis of different multiclass classiﬁcation strategies and

ensemble classiﬁers (Boosting) method with SVM as

a base learner. Although initial results in this study

seem to be promising, we believe that the proposed

document image classiﬁcation approach should also

be investigated on real benchmark datasets. Further-

more, the effectiveness of the proposed local feature

descriptors in this work should also be compared with

that of the existing descriptors in literature, e.g., SIFT

and SURF. Both of these issues remain as a future

work to validate the robustness of the proposed ap-

proach.

ACKNOWLEDGEMENTS

This work was partially supported by the Scientiﬁc

and Technological Research Council of Turkey under

Grant 3120918 and by Yapi Kredi Bank under Grant

62609.

REFERENCES

Allwein, E., Schapire, R., and Singer, Y. (2001). Reducing

multiclass to binary: A unifying approach for margin

classiﬁers. J. Mach. Learn. Res., 1:113–141.

Bagheri, M., Montezar, G., and Escalera, S. (2012). Error

correcting output codes for multiclass classiﬁcation:

Application to two image vision problems. In CSI In-

ternatioanal Symposium on Artiﬁcial Intelligence and

Signal Processing (AISP), pages 508–513.

Crammer, K. and Singer, Y. (2000). On the learnability and

design of output codes for multiclass problems. In

Proceedings of the Thirteenth Annual Conference on

Computational Learning Theory, pages 35–46.

Csurka, G., Dance, C., Fan, F., Willamowski, F., and Bray,

C. (2004). Visual categorization with bags of key-

points. In Workshop on Statistical Learning in Com-

puter Vision, ECCV, pages 1–22.

Dietterich, T. and Bakiri, G. (1995). Solving multiclass

learning problems via error-correcting output codes.

J. Artif. Int. Res., 2(1):263–286.

Fan, K., Wang, Y., and Chang, M. (2001). Form docu-

ment identiﬁcation using line structure based features.

In Proc.of the 6th Int. Conf. on Document Anal. &

Recognition, pages 704–708.

Hu, J., Kashi, R., and Wilfong, G. (1999). Document im-

age layout comparision and classiﬁcation. In Proc. of

the 6th Int. Conf. on Document Anal. & Recognition,

pages 285–288.

Kumar, J. and Doermann, D. (2013). Unsupervised classi-

ﬁcation of structurally similar document images. In

Proc. of the 12th Int. Conf. on Document Anal. &

Recognition, pages 1225–1229.

Kumar, J., Ye, P., and Doermann, D. (2012). Learning Doc-

ument Structure for Retrieval and Classiﬁcation. In In-

ternational Conference on Pattern Recognition (ICPR

2012), pages 1558–1561.

Lazebnik, S., Schmid, C., and Ponse, J. (2006). Beyond

bags of features: Spatial pyramid matching for recog-

nizing natural scene categories. In Computer Vision

and Pattern Recognition, IEEE Conf, pages 2169–

2178.

Li, X., Wang, L., and Sung, E. (2008). Adaboost with svm-

based component classiﬁers. Eng. Appl. Artif. Intell.,

21(5):785–795.

Smith, D. and Harvey, R. (2011). Document retrieval using

sift image features. 17(1):3–15.

Sokolova, M. and Lapalme, G. (2009). A systematic anal-

ysis of performance measures for classiﬁcation tasks.

Inf. Process. Manage., 45(4):427–437.

Vapnik, V. (1998). Statictal Learning Theory. John Wiley

and Sons, Inc.,New York.

Winn, J., Criminisi, A., and Minka, T. (2005). Object cat-

egorization by learned universal visual dictionary. In

ICCV, pages 1800–1807.

Yang, Y. and Newsam, S. (2011). Spatial pyramid co-

occurrence for image classiﬁcation. In Computer

Vision (ICCV),International Conference on, pages

1465–1472.

Yao, B., Khaslo, A., and Fei-Fei, L. (2011). Combining ran-

domization and discrimination for ﬁne-grained image

categorization. In Proc. CVPR.

Zheng, Z., Wu, X., and Srihari, R. (2004). Feature selection

for text categorization on imbalanced data. SIGKDD

Explor. Newsl., 6(1):80–89.

DocumentImageClassificationViaAdaBoostandECOCStrategiesBasedonSVMLearners

255