responding visual words and then by building a his-
togram of these words. From this point, the image can
be categorized in a similar way to a text document.
To categorize the images, a multi-class classifier can
be employed, using visual word histograms as feature
vectors (Winn et al., 2005; Csurka et al., 2004; Cai
et al., 2010).
This paper is organized as follows. Section 2 de-
scribes the proposed methodology which consists of
keypoint identification, visual dictionary construction
and categorization using Support Vector Machines
(SVM) (Boser et al., 1992; Chang and Lin, 2011).
Section 3 presents the details of the experimental
sessions. Section 4 concludes the paper.
2 METHODOLOGY
The following section presents the details of the im-
age categorization process, based on BoVW represen-
tation. Our approach consists of three main phases.
Firstly, the SIFT algorithm was applied for key-
point identification in imagery datasets. Secondly,
these keypoints was exploited to create instances of
visual words by means of unsupervised learning tech-
nique.
It was achieved by the k-means clustering al-
gorithm. For each image the vector representation
was obtained with different normalization schemes
and the components of that vector correspond to
”visual words” from dictionary. During the last
phase, datasets were classified by means of the SVM
method.
2.1 Keypoint Identification
For keypoint identification the SIFT algorithm de-
scribed in (Lowe, 2004) was chosen. This method
is proven to be resistant against the changes in image
scale, rotation, illumination and 3D viewpoint (Kleek,
; Mikolajczyk and Schmid, 2005). Regardless of how
the image was transformed, any of the descriptors
found using the SIFT algorithm retains its original
features. This makes it possible to find correspond-
ing points in the images containing similar objects,
but in a different scale, perspective or with different
light intensity.
The process of key point identification is divided
into four phases. Initially, ”Scale-space extrema de-
tection” is performed. In this stage all scales and
image locations are searched for potential interest
points, with the use of a difference-of-Gaussian tech-
nique. In the stage named ”Keypoint localization”,
the keypoint candidates with the worst stability mea-
sure are discarded. During the third phase called ”Ori-
entation assignment”, each keypoint is enriched with
information about its relative orientation based on lo-
cal image gradients. Finally, keypoint descriptors,
which are robust on local distortion and change in il-
lumination, are created from the local image regions
around the keypoints. This phase is called ”Keypoint
descriptor”.
The result of the SIFT algorithm execution is a
set of keypoints which captures important details of
the image. Each keypoint contains information about
scale, orientation and location and its descriptor is
represented as a numerical vector. The size of the vec-
tor is fixed in advance and depends on the choice of
the local region size. The vector usually has 128 di-
mensions, which is determined by the choice of a 4x4
descriptor region. Depending on the image size and
complexity the number of obtained keypoints varies
from a hundred to a few thousand.
2.2 Visual Dictionary Construction
Due to the fact that many keypoints retrieved by the
SIFT algorithm are similar, it’s necessary to gener-
alize and group the points into clusters which repre-
sent ”visual words”. For this purpose, k-means, an
unsupervised learning algorithm, was used due to its
simplicity and satisfactory performance. The idea of
k-means lies in a division of the observation into pre-
defined number of sub-sets, so that the sum of the dis-
tances from each keypoint to the center of particular
cluster is minimized. This can be formalized using
the following formula:
argmin
S
k
∑
i=1
∑
x
j
∈S
i
x
j
− µ
i
2
, (1)
where (x
1
, x
2
, . . . , x
n
) are observation vectors, µ
i
is
mean of i-th centroid and k is the number of clusters.
As a result of clusterization process, k ”visual
words” are obtained, which allows the assignment of
the particular ”visual word” for each descriptor. An
important issue is the choice of parameter k, which
affects the performance and accuracy. If the number
of clusters is too small, the algorithm will assign dis-
tinctive keypoints to the same ”visual word”.
Thus, classification accuracy would be signifi-
cantly reduced. On the other hand, too big k leads
to over-representation, so that similar keypoints are
represented by different ”visual words”, which results
in a decrease of performance and accuracy. Tests for
different values of parameter k were performed, all
details are described in the experimental section.
ICAART 2012 - International Conference on Agents and Artificial Intelligence
232