CATEGORIZATION OF SIMILAR OBJECTS USING BAG OF

VISUAL WORDS AND SUPPORT VECTOR MACHINES

Przemysław G´orecki, Piotr Artiemjew, Paweł Drozda and Krzysztof Sopyła

Department of Mathematics and Computer Sciences, University of Warmia and Mazury, Olsztyn, Poland

Keywords:

Visual dictionaries, Classiﬁcation, Bag of words, SVM, SIFT.

Abstract:

This paper studies the problem of visual subcategorization of objects within a larger category. Such categoriza-

tion seems more challenging than categorization of objects from visually distinctive categories, previously pre-

sented in the literature. The proposed methodology is based on ”Bag of Visual Words” using Scale-Invariant

Feature Transform (SIFT) descriptors and Support Vector Machines (SVM). We present the results of the ex-

perimental session, both for categorization of visually similar and visually distinctive objects. In addition, we

attempt to empirically identify the most effective visual dictionary size and the feature vector normalization

scheme.

1 INTRODUCTION

This paper is devoted to the problem of generic vi-

sual categorization within the same class of objects.

In particular, the goal is to subcategorize the images

depicting objects which belong to the same high level

visual category. As an example, let us consider a set

of shoe images, such as those presented in Figure 1.

Then, the objective is to subcategorize the shoe

images by type (i.e. as sneakers or trekking shoes).

It can be noted that although both types of shoes

are generally similar in shape and appearance, human

observers can easily identify many ﬁne details, which

are decisive for categorization.

In recent years, Bag of Visual Words (BoVW)

image representation method has received much at-

tention in solving generic visual categorization prob-

lems (Csurka et al., 2004). The approach is derived

from bag-of-words approach, successively applied in

Figure 1: A sample of shoe images, retrieved from Internet,

that can be subcategorized as sneakers or trekking shoes.

text categorization (Lewis, 1998; Joachims, 1998),

where the idea is to describe the text document using

the frequencies of the words occurrences. The image

can be described in a similar way using the frequently

occurring local image patches, so also known as ’vi-

sual words’.

The ﬁrst step in representing an image using vi-

sual words is to detect and describe image keypoints

- small image patches that contain relevant local in-

formation about the image. The choice of keypoint

detector is important as it has a signiﬁcant impact on

the successive phases of the categorization process.

Among many descriptors proposed in the lit-

erature, Scale-Invariant Feature Transform (SIFT)

(Lowe, 2004) and Speeded Up Robust Features

(SURF) (Bay et al., 2008) are reported to be the most

effective, since they provide keypoints invariant to

image rotation, scale, perspective and illumination

changes. For a comprehensive survey of keypoint de-

tectors see (Tuytelaars and Mikolajczyk, 2008; Miko-

lajczyk et al., 2005).

In the next step, a dictionary of visual words

is constructed by means of unsupervised clustering.

Each visual word is a subset of image patches that are

similar to each other. Hence, a visual word represents

some local pattern which is shared across many im-

ages. Typically, k-means algorithm or similar is ap-

plied to build the visual dictionary from image key-

points.

Given the dictionary, the representation of an image

is obtained by assigning its image patches to the cor-

231

Górecki P., Artiemjew P., Drozda P. and Sopyła K..

CATEGORIZATION OF SIMILAR OBJECTS USING BAG OF VISUAL WORDS AND SUPPORT VECTOR MACHINES.

DOI: 10.5220/0003714702310236

In Proceedings of the 4th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2012), pages 231-236

ISBN: 978-989-8425-95-9

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

responding visual words and then by building a his-

togram of these words. From this point, the image can

be categorized in a similar way to a text document.

To categorize the images, a multi-class classiﬁer can

be employed, using visual word histograms as feature

vectors (Winn et al., 2005; Csurka et al., 2004; Cai

et al., 2010).

This paper is organized as follows. Section 2 de-

scribes the proposed methodology which consists of

keypoint identiﬁcation, visual dictionary construction

and categorization using Support Vector Machines

(SVM) (Boser et al., 1992; Chang and Lin, 2011).

Section 3 presents the details of the experimental

sessions. Section 4 concludes the paper.

2 METHODOLOGY

The following section presents the details of the im-

age categorization process, based on BoVW represen-

tation. Our approach consists of three main phases.

Firstly, the SIFT algorithm was applied for key-

point identiﬁcation in imagery datasets. Secondly,

these keypoints was exploited to create instances of

visual words by means of unsupervised learning tech-

nique.

It was achieved by the k-means clustering al-

gorithm. For each image the vector representation

was obtained with different normalization schemes

and the components of that vector correspond to

”visual words” from dictionary. During the last

phase, datasets were classiﬁed by means of the SVM

method.

2.1 Keypoint Identiﬁcation

For keypoint identiﬁcation the SIFT algorithm de-

scribed in (Lowe, 2004) was chosen. This method

is proven to be resistant against the changes in image

scale, rotation, illumination and 3D viewpoint (Kleek,

; Mikolajczyk and Schmid, 2005). Regardless of how

the image was transformed, any of the descriptors

found using the SIFT algorithm retains its original

features. This makes it possible to ﬁnd correspond-

ing points in the images containing similar objects,

but in a different scale, perspective or with different

light intensity.

The process of key point identiﬁcation is divided

into four phases. Initially, ”Scale-space extrema de-

tection” is performed. In this stage all scales and

image locations are searched for potential interest

points, with the use of a difference-of-Gaussian tech-

nique. In the stage named ”Keypoint localization”,

the keypoint candidates with the worst stability mea-

sure are discarded. During the third phase called ”Ori-

entation assignment”, each keypoint is enriched with

information about its relative orientation based on lo-

cal image gradients. Finally, keypoint descriptors,

which are robust on local distortion and change in il-

lumination, are created from the local image regions

around the keypoints. This phase is called ”Keypoint

descriptor”.

The result of the SIFT algorithm execution is a

set of keypoints which captures important details of

the image. Each keypoint contains information about

scale, orientation and location and its descriptor is

represented as a numerical vector. The size of the vec-

tor is ﬁxed in advance and depends on the choice of

the local region size. The vector usually has 128 di-

mensions, which is determined by the choice of a 4x4

descriptor region. Depending on the image size and

complexity the number of obtained keypoints varies

from a hundred to a few thousand.

2.2 Visual Dictionary Construction

Due to the fact that many keypoints retrieved by the

SIFT algorithm are similar, it’s necessary to gener-

alize and group the points into clusters which repre-

sent ”visual words”. For this purpose, k-means, an

unsupervised learning algorithm, was used due to its

simplicity and satisfactory performance. The idea of

k-means lies in a division of the observation into pre-

deﬁned number of sub-sets, so that the sum of the dis-

tances from each keypoint to the center of particular

cluster is minimized. This can be formalized using

the following formula:

argmin

∑

i=1

∑

∈S



− µ



, (1)

where (x

, x

, . . . , x

) are observation vectors, µ

mean of i-th centroid and k is the number of clusters.

As a result of clusterization process, k ”visual

words” are obtained, which allows the assignment of

the particular ”visual word” for each descriptor. An

important issue is the choice of parameter k, which

affects the performance and accuracy. If the number

of clusters is too small, the algorithm will assign dis-

tinctive keypoints to the same ”visual word”.

Thus, classiﬁcation accuracy would be signiﬁ-

cantly reduced. On the other hand, too big k leads

to over-representation, so that similar keypoints are

represented by different ”visual words”, which results

in a decrease of performance and accuracy. Tests for

different values of parameter k were performed, all

details are described in the experimental section.

ICAART 2012 - International Conference on Agents and Artificial Intelligence

232

2.3 Categorization

On the basis of image keypoint assignment to visual

words (clusters), the histogram of incidence is created

for each image from imaginary dataset. As a result, a

k-dimensional feature vector is obtained.

This allows us to unify the representation of the

images, reducing the problem of a visual categoriza-

tion process to a simpler task of feature vector classi-

ﬁcation.

However, the results of classiﬁcation based solely

on the histograms are rarely satisfactory. Therefore,

the different forms of vector normalization and word

weighting can be applied to increase the classiﬁca-

tion accuracy. In this paper, for the feature vector

x = (x

, . . . , x

), three schemes of normalization are

considered, in particular:

1. Max norm

kxk

∞

= max(|x

|, . . . , |x

|), (2)

2. Euclidean norm

kxk



∑

i=1



, (3)

3. Manhattan norm

kxk

∑

i=1

|. (4)

The normalized feature vector ˆx is given by:

ˆx =

kxk

, (5)

where kxk can be kxk

, kxk

∞

The SVM (Boser et al., 1992; Chang and Lin,

2011) was used as a classiﬁcation method. The SVM

is a binary classiﬁer that searches for the optimal

hyperplane which separates observations from both

classes of training set by solving the quadratic opti-

mization task.

Given a set of instance-label pairs (x

, y

); i =

1, . . . , l; x

∈ R

∈ {−1, +1}, SVM solves the fol-

lowing dual problem (6) derived from the primal

problem described in (Chang and Lin, 2011):

min

Qα− e

α, (6)

subject to

α = 0; 0 ≤ α

≤ C;i = 1, . . . ,l; (7)

where C > 0 is a penalty parameter that determines

the tradeoff between the margin size and the amount

of error in training, α is a vector of Lagrange mul-

tipliers introduced during conversion form the primal

to dual problem, e is the unit vector, Q is an l by l pos-

itive semideﬁnite matrix such that Q

= y

K(x

, x

)

and K(x

, x

) = φ(x

)

φ(x

) is the kernel function,

which maps training vectors into a higher dimensional

space via function φ.

The problems which are non-linearly separable

can be solved by the SVM using the ”Kernel Trick”.

Apart form the linear, the most frequently used

kernels of the SVM are RBF and polynomial. Dur-

ing the experimental session the RBF kernel (8) was

chosen.

K(x

, x

) = exp(−γkx

− x

), (8)

where x

, x

are observations and γ > 0.

The label F(x) of the feature vector x can be pre-

dicted using the following equation:

F(x

new

) = sign

∑

i=1

K(x

, x

new

) + b

. (9)

The results of the SVM image classiﬁcation with the

RBF kernel are described in the experimental section.

3 RESULTS OF EXPERIMENTS

The goal of the experimental session was to assess

the performance of the classiﬁcation of visually sim-

ilar objects compared with the classiﬁcation of visu-

ally distinctive objects. For this purpose, six datasets

were created by combining images depicting objects

from different visual categories. Each main category

consisted of 60 images downloaded from the Inter-

net. The datasets contained the following categories

of images:

1. tulips vs. roses - 120 images (26368 image key-

points),

2. sneakers vs. trekking boots - 120 images (19499

image keypoints),

3. men’s watches vs. women’s watches - 120 images

(75846 image keypoints),

4. ﬂowers (dataset 1) vs. shoes (dataset 2) - 240 im-

ages (45867 image keypoints),

5. shoes (dataset 2) vs. watches and ﬂowers (datasets

3 and 4) - 360 images (121713 157881 image key-

points),

6. ﬂowers (dataset 1) vs. watches (dataset 3) - 240

images (10214 image keypoints).

The number of images for dataset 1–3 in each class

was even (i.e. dataset 1 contained 60 tulip and 60 rose

images) only dataset 5 contains unbalance number of

the objects in each class (120 vs 240). In addition, an

CATEGORIZATION OF SIMILAR OBJECTS USING BAG OF VISUAL WORDS AND SUPPORT VECTOR

MACHINES

233

50 100 200 500 1000 2000 5000

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of Visual Words

Accuracy

none

man

max

eucl

(a) Classiﬁcation of tulips vs. roses

50 100 200 500 1000 2000 5000

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Number of Visual Words

Accuracy

none

man

max

eucl

(b) Classiﬁcation of sneakers vs. trekking boots

50 100 200 500 1000 2000 5000

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Number of Visual Words

Accuracy

none

man

max

eucl

50 100 200 500 1000 2000 5000

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Number of Visual Words

Accuracy

none

man

max

eucl

(d) Classiﬁcation of ﬂowers vs. shoes.

50 100 200 500 1000 2000 5000

Number of Visual Words

Accuracy

none

man

max

eucl

(e) Classiﬁcation of shoes vs. watches and ﬂowers.

50 100 200 500 1000 2000 5000

0.4

0.5

0.6

0.7

0.8

0.9

Number of Visual Words

Accuracy

none

man

max

eucl

(f) Classiﬁcation of ﬂowers vs. watches

Figure 2: Final classiﬁcation results for datasets 1 (a) – 6 (f), none = results for none normalization case, man = results for

Manhattan norm, max = results for max norm, eucl = results for Euclidean norm.

overall number of keypoints found by SIFT in each

dataset is reported in parentheses.

For each dataset, the visual dictionary was con-

structed by extracting keypoints from all images in

the dataset using SIFT and by clustering with the use

of k-means, as described in section 2.2. To identify

the most suitable dictionary size, the process of dic-

tionary construction was repeated for the following

values of k: 50, 100, 200, 500, 1000, 2000, 5000.

Additionally, for each of the dictionaries, datasets

were classiﬁed using various normalization tech-

niques (described in section 2.3) in order to identify

ICAART 2012 - International Conference on Agents and Artificial Intelligence

234

the most effective one for the categorization task.

SVM with RBF kernel (with defaults parameters)

was employed to classify the datasets. The accuracy

of the classiﬁcation was estimated using 5-fold cross

validation. The results of the classiﬁcation for all

datasets and all metrics, as well as none metric case,

are presented in Figure 2, which shows the depen-

dence between number of visual words and classiﬁ-

cation accuracy.

It can be noted that the Euclidean normaliza-

tion outperforms all other normalization techniques in

terms of classiﬁcation accuracy, especially for a big-

ger dictionary sizes from the interval[200, 5000]. The

exception to that rule is a fully comparable result for

200 visual words of max norm in case of classifying

men’s vs womens’s watches - see (c) in Figure 2. For

a smaller size of dictionary in range of 50,100 visual

words, max and Euclidean normalization works fully

comparable.

Detailed classiﬁcation results for our best Eu-

clidean normalization are presented in Tables 1 and

2. Interestingly, when feature vectors are not normal-

ized, the classiﬁcation is completely ineffective since

an average accuracy is close to 50%.

Table 1: Final classiﬁcation results for datasets 1–3 using

Euclidean normalization.

No. of visual words

Dataset

1 2 3

50 61.27% 88.92% 77.88%

100 64.60% 89.40% 80.41%

200 69.67% 89.42% 76.40%

500 72.97% 88.41% 76.88%

1000 73.80% 89.91% 78.90%

2000 77.17% 90.42% 80.92%

5000 58.80% 93.95% 79.91%

Table 2: Final classiﬁcation results for datasets 4–6 using

Euclidean normalization.

No. of visual words

Dataset

4 5 6

50 87.89% 87.75% 96.65%

100 92.05% 91.63% 97.70%

200 92.06% 92.2% 98.74%

500 93.30% 92.2% 98.32%

1000 92.87% 93.59% 98.74%

2000 92.05% 92.48% 99.58%

5000 93.30% 92.19% 98.32%

Regarding the most suitable dictionary size, for

most cases where k = 2000, the accuracy of classi-

ﬁcation is the highest for datasets 1, 3 and 6. It can

be noted that, for visually distinctive objects (datasets

4–6), the choice of k is not that important - any dictio-

nary larger than 500 visual words seems reasonable.

In contrast, for visually similar objects (datasets 1–3),

the most optimal dictionary size is 2000 words.

To compare classiﬁcation results between visually

similar and distinctive datasets, k = 2000 and Eu-

clidean normalization were chosen. For this case, an

average classiﬁcation accuracy for datasets 1–3 was

82.84% and an average classiﬁcation accuracy for

datasets 4–6 was 96.37%. As expected, the classiﬁ-

cation of visually similar objects turned out be more

challenging compared to visually distinctive ones,

and in our case the difference in classiﬁcation accu-

racy was 13.53%.

4 CONCLUSIONS AND FUTURE

WORK

The main aim of the article was to compare classiﬁ-

cation images belonging to the same domain and im-

age classiﬁcation from various categories by means

of ”Bag of Visual Word” technique. The obtained re-

sults show that the former problem is harder to solve

and in the majority of cases classiﬁcation accuracy is

lower than in the latter. In addition, studies on the im-

pact of the number of ”visual words” in dictionary on

the accuracy of classiﬁcation were undertaken. Tak-

ing into account the tradeoffbetween the performance

and effectiveness, the optimal results for 2000 ”visual

words” were obtained for all datasets. However, for

these datasets with visual distinctive objects, the sat-

isfactory results were achieved using at least 500 ”vi-

sual words”. Additionally, the best method of normal-

ization among those studied in terms of classiﬁcation

accuracy proved to be Euclidean normalization.

Bag of visual words opens up opportunities for

taking some techniques from classic Information Re-

trieval discipline, especially the variousmechanism of

dictionary building and the term weighting methods,

which we plan to investigate. Moreover, in the future

we are going to pay more attention to the spatial im-

age feature correlation and its impact on classiﬁcation

accuracy.

ACKNOWLEDGEMENTS

The research has been supported by grant N N516

480940 from The National Science Center of the Re-

public of Poland.

CATEGORIZATION OF SIMILAR OBJECTS USING BAG OF VISUAL WORDS AND SUPPORT VECTOR

MACHINES

235

REFERENCES

Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008).

Speeded-up robust features (surf). Comput. Vis. Image

Underst., 110:346–359.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A

training algorithm for optimal margin classiﬁers. In

Proceedings of the ﬁfth annual workshop on Compu-

tational learning theory, COLT ’92, pages 144–152,

New York, NY, USA. ACM.

Cai, H., Yan, F., and Mikolajczyk, K. (2010). Learning

weights for codebook in image classiﬁcation and re-

trieval. In CVPR, pages 2320–2327. IEEE.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library

for support vector machines. ACM Transactions on

Intelligent Systems and Technology, 2:27:1–27:27.

Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and

Bray, C. (2004). Visual categorization with bags of

keypoints. In In Workshop on Statistical Learning in

Computer Vision, ECCV, pages 1–22.

Joachims, T. (1998). Text categorization with support vec-

tor machines: learning with many relevant features. In

N´edellec, C. and Rouveirol, C., editors, Proceedings

of ECML-98, 10th European Conference on Machine

Learning, number 1398, pages 137–142. Springer

Verlag, Heidelberg, DE.

Kleek, M. V. Evaluating the stability of sift keypoints across

cameras.

Lewis, D. D. (1998). Naive (bayes) at forty: The indepen-

dence assumption in information retrieval. pages 4–

15. Springer Verlag.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. Int. J. Comput. Vision, 60:91–

110.

Mikolajczyk, K., Leibe, B., and Schiele, B. (2005). Lo-

cal Features for Object Class Recognition. In Tenth

IEEE International Conference on Computer Vision

(ICCV’05) Volume 1, pages 1792–1799. IEEE.

Mikolajczyk, K. and Schmid, C. (2005). A perfor-

mance evaluation of local descriptors. IEEE Trans-

actions on Pattern Analysis & Machine Intelligence,

27(10):1615–1630.

Tuytelaars, T. and Mikolajczyk, K. (2008). Local invariant

feature detectors: a survey. Found. Trends. Comput.

Graph. Vis., 3:177–280.

Winn, J., Criminisi, A., and Minka, T. (2005). Object cat-

egorization by learned universal visual dictionary. In

Proceedings of the Tenth IEEE International Confer-

ence on Computer Vision - Volume 2, ICCV ’05, pages

1800–1807, Washington, DC, USA. IEEE Computer

Society.

ICAART 2012 - International Conference on Agents and Artificial Intelligence

236