this work our classifier is formulated as a linear re-
gression and it is defined as
f
C
w
= argmin
f
1
n
∑
x
|x
T
f − ¯y
C
|
+ λ| f |
2
. (3)
Here the x is chosen only from the features with
l(x) = w and ¯y
C
represents the binary labeling of these
features with respect to class C. The value of λ can
be obtained through cross-validation. Figure 2 shows
how more detailed information can be extracted from
the features that were treated equally by the bag-of-
words model. In this figure we can see that the two
features x
i
and x
j
are both labeled as w
3
have a differ-
ent behavior with respect to the hyperplanes f
C
1
w
3
, f
C
2
w
3
and f
C
3
w
3
which encode class properties in this section
of the space. To estimate the quality of a feature (the
likelihood of belonging to class C while assigned to
the word w), we use the logistic function
P
C
w
(x) =
1
1+ exp(−a(x
T
f
C
w
))
. (4)
For any set of features extracted from an image we
wish to build a descriptor based on their quality rather
than their visual word quantity. To do so, the features
are initially labeled using a vocabulary D. As previ-
ously argued each word in D captures a certain struc-
ture on the image. Hence, the role of P
C
w
(x
i
) function,
Eq. 4, is to measure the quality the discovered struc-
tures assigned to the word w with respect to class C.
This is a one dimensional measurement correspond-
ing to the models confidence. To that end it is possi-
ble construct a (N.M) dimensional descriptor D, with
N being the size of the vocabulary and M the number
of classes. Each dimension of this vector corresponds
to responses associated with a certain word (w
n
) with
respect to a certain class (C
m
). The question here is
how one can summarize these values into a number
that can capture the qualitative properties of features
seen in the image. Here we analyze the max descrip-
tor defined as
D
max
[i] = max
P
C
m
w
n
(x) : x ∈ I, l(x) = w
n
, (5)
which focuses on pooling the features with the high-
est likelihood rather than relying on the quantitative
properties of the their labeling. This can also be seen
as a feature selection problem, where the highest like-
lihood features are used for describing the image. The
max pooling is dependent on the accuracy of P
C
w
(x)
functions and increasing their accuracy will result in
a better description of the image. Similar to max de-
scriptor it is also possible to define the mean descrip-
tor D
mean
by replacing the max operation in Eq. 5
with mean operation.
4 EXPERIMENTS
In this section we compare the performance of the
proposed descriptor with the standard bag-of-words
histogram as the baseline. For both descriptors the
same vocabulary is used for summarizing the image.
To compare the performance we use vocabularies of
different sizes, since the size of a vocabulary is usu-
ally associated with the performance of the bag-of-
words histogram as a descriptor. In this experiment
the sift features (Vedaldi and Fulkerson, 2008) as base
features which are densely sampled from an image
pyramid and the image pyramid consists of eight lev-
els. The visual vocabularies are computed using stan-
dard k-means algorithm.
The experiments of this paper are conducted on
the MSRCv2 dataset (Winn et al., 2005). Although
this dataset is relatively small compared to other
datasets, it is considered as a challenging and difficult
dataset due to its high intra-class variability. In this
work we have followed the experiments setup used in
(Zhang and Chen, 2009; Morioka and Satoh, 2010),
with a denser sampling of sift features from different
scale levels. In our experiment nine of fifteen classes
are chosen ({ cow, airplanes, faces, cars, bikes, books,
signs, sheep, chairs}) with each class containing 30
images. For each experiment, the images of each class
were randomly divided into 15 training and 15 test-
ing images and no background was removed from the
images. The random sampling of training and test-
ing images were repeated 5 times. In our experiment
a one-against-all linear SVM (Chang and Lin, 2011)
was learnt for each class and the test images were
classified to the class with the highest probability.
To compare the performance of bag-of-words his-
togram with the proposed descriptor, visual vocabu-
laries with different sizes {50, 100, 200, 300, 400,
500, 1000, 1500} were computed over the training
subset using standard k-means algorithm. Figure 3
shows how the max descriptor (Eq. 5) is either out
performing or has a comparable accuracy to the bag-
of-words histogram in all vocabulary sizes. Tables 1
and 2 compare the confusion matrices of best classifi-
cations using a vocabulary of size 1500 for both max
descriptor and bag-of-words histogram. These tables
show how a richer descriptor is obtained when quality
of word are measured rather in oppose to their quan-
tity. We also evaluate the mean summarizing, which
is created by replacing the max operator in Eq. 5 with
the mean operator. As it can be seen in figure 3 the
mean descriptor has a very low performance when
the size of the vocabulary is low. This low accuracy is
due to the fact that the background was not removed
and lots of low quality instances of visual words were
QualitativeVocabularybasedDescriptor
191