Qualitative Vocabulary based Descriptor

Heydar Maboudi Afkham, Carl Henrik Ek and Stefan Carlsson

Computer Vision and Active Perception Lab., KTH, Stockholm, Sweden

Keywords:

Bag-of-Words Model, Image Classiﬁcation.

Abstract:

Creating a single feature descriptors from a collection of feature responses is an often occurring task. As such

the bag-of-words descriptors have been very successful and applied to data from a large range of different

domains. Central to this approach is making an association of features to words. In this paper we present a

new and novel approach to feature to word association problem. The proposed method creates a more robust

representation when data is noisy and requires less words compared to the traditional methods while retaining

similar performance. We experimentally evaluate the method on a challenging image classiﬁcation data-set

and show signiﬁcant improvement to the state of the art.

1 INTRODUCTION

Most learning and inference algorithms require data

to be presented as points in a vector space. However,

in many scenarios data does not naturally lend itself to

such representations. One such example is when data

is represented as a collection of feature responses as

is common in Computer vision and natural language

processing. To be able to access and beneﬁt from the

traditional learning techniques a common approach is

to translate the set of points and create a vector rep-

resenting the full collection of responses. When the

number of responses are ﬁxed and can be assigned to

a speciﬁc order, the na¨ıve approach would be to con-

catenate the responses into one large vector. However,

this does not acknowledge that the responses comes

from the same domain. Neither is it likely that we

can induce a consistent order nor consistently recover

the same number of responses. A very simple, but

yet powerful method that overcomes these issues is

the bag-of-words model. The method requires that a

notion of similarity relating the elements in the col-

lection exist. Using this measure the space of features

can be discretized as a mixture of representative re-

sponses referred to as words. By associating each re-

sponse with a word a single feature vector can be con-

structed as the distribution of associations for the con-

stellation. This means that a single vector of a ﬁxed

dimension can be constructed from constellation from

a varying number of elements. The standard bag-of-

words model used in computer vision is inspired from

the text and documents processing models (Russell et

al., 2006)(Sivic et al., 2005; Lazebnik et al., 2006).

These models usually assume that the effect of noise

with respect to word association and discovery is

neglect-able. While this might be a reasonable as-

sumption in text processing (Wang et al., 2005), it is

widely known that visual word discovery can be chal-

lenging due to the low level noise that exists in the im-

ages. The visual dictionary used to describe images is

often calculated using a clustering algorithm. Due to

the ambiguity that exist in the clustering algorithms

these words are not as well-deﬁned as the words used

in text processing, which are usually selected from a

text dictionary. While the bag-of-word model relies

on the frequency of different words seen in the data,

resolving the ambiguity in visual word discovery can

be very beneﬁcial.

In this paper we present a novel approach to con-

struct a ﬁxed dimensional descriptor from a constel-

lation of feature responses within the bag-of-words

framework. Speciﬁcally, we will address issues re-

lating to the feature response to word association that

commonly present themselves for vision data by ac-

knowledging that word discovery is uncertain. The

proposed approach is related with feature pooling

methods such as (Boureau et al., 2010; Jarrett et al.,

2009) with the difference that in our work feature

pooling is not done on the statistics of occurrences

of the words but rather on feature responses that are

related to these words. These feature responses elim-

inate the ambiguity of features being assigned to dif-

ferent words in a way that each feature has a unique

response toward its associated word. In Sections 2

188

Maboudi Afkham H., Henrik Ek C. and Carlsson S. (2013).

Qualitative Vocabulary based Descriptor.

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 188-193

DOI: 10.5220/0004266901880193

 SciTePress

Clustering

Local

Classiﬁers

Optimization

Vocabulary

Train

Dataset

Sift Features

Extract

Features

Image

Labeled

Features

Local

Evaluation

Feature

Pooling

Standard BOW

Proposed Descriptor

Label

Bagging

Figure 1: This ﬁgure compares the steps required for computing the bag-of-words histogram and the proposed descriptor.

Both methods use the same vocabulary for summarizing the image. Unlike bag-of-words model that relies on the frequency

of the words in the image, our method uses labeling produced by vocabulary for a local evaluation of the features and uses the

responses from this evaluation to summarize the image.

and 3 we will motivate and describe the proposed

descriptor. Section 4 we present a thorough experi-

mental evaluation and Section 5 concludes the paper.

2 RELATED WORKS

For a better understanding of the problem lets assume

that I is an image with {x

, . . . , x

} being a collec-

tion of features (Vedaldi and Fulkerson, 2008) ex-

tracted from it. In all different sections of this work it

is assumed that an already trained visual vocabulary

D = {w

, . . . , w

} is provided. Given this vocabulary

the mapping

l(x

) = argmin

w∈D

− w|

, (1)

assigns each x

to its closest visual word in D. Having

this mapping each image is described using a N bin

histogram H, where the value of its i

bin is deter-

mined by

H[i] =

∑

∈I

(l(x

) == w

). (2)

As it can be seen every x

with similar l(x

) is

treated equally in this formulation. The down side

of this treatment is the fact the differences between

the features assigned to the same visual word are ne-

glected. It should be mentioned that this difference

does not appear in the bag-of-word models built on

text datasets since the words in the dictionary are

well-deﬁned and are not the result of a generic clus-

tering algorithm. This ambiguity in assignment of

visual words has been addressed in many researches

(Zhang and Chen, 2009; Morioka and Satoh, 2010)

since the introduction of visual words in computer vi-

sion. Among the most inﬂuential works is the work

by (Savarese et al., 2006) in which they build a more

well-deﬁned visual vocabulary by introducing rela-

tional spatial constraints in calculation of the vocab-

ulary. In their work they show that building such vo-

cabulary signiﬁcantly improves the results. In other

approaches such as (Bouachir et al., 2009) the dif-

ferences between the features assigned to the same

words were highlighted by measuring the distance to

the closest cluster as a weight in calculation of Eq. 2

or using soft assigning each feature to several visual

words.

In this paper we take a slightly different approach

toward the bag-of-word models. While using the

same dictionary D, our approach is based on gath-

ering statistics of the quality of features assigned to

a certain visual word rather than their quantity. The

fundamental principle underpinning a bag-of-words

approach is that the elements of the dictionary D cap-

ture the local structures of the image. Here the goal is

to measure the quality of these structures with respect

to different target classes in a discriminative manner

and use this information to describe the image. In

other words the question being asked in this paper is

”How representative underpinning of the word is the

feature?” rather than ”How often a word is seen?”.

The assumption behind this work is that structures la-

beled as a certain visual word appear on different ob-

jects. This means there can be a signiﬁcant difference

between them due to the fact that they have appeared

on completely difference objects. The difference be-

tween our method and the standard bag-of-words his-

togram can be seen in ﬁgure 1.

QualitativeVocabularybasedDescriptor

189

Figure 2: (Left) This ﬁgure describes how in a standard bag-of-words model in which two features x

and x

are assigned to

the word w

and their differences are neglected while is possible. (Right) to pull out their differences after being assigned to

using the f

hyperplanes. These differences are later used for constructing a feature descriptor.

Table 1: Confusion Matrix for max operator (Accuracy 94%) using a vocabulary of size 1500.

1 2 3 4 5 6 7 8 9

1-Cow 12 0 0 0 0 0 0 3 0

2-Plane 0 15 0 0 0 0 0 0 0

3-Face 0 0 15 0 0 0 0 0 0

4-Car 0 0 0 15 0 0 0 0 0

5-Bike 0 0 0 0 15 0 0 0 0

6-Book 0 0 0 1 0 14 0 0 0

7-Sign 0 0 0 0 0 2 13 0 0

8-Sheep 1 0 0 0 0 0 0 14 0

9-Chair 1 0 0 0 0 0 0 0 14

Table 2: Confusion matrix for bag-of-words histogram (Accuracy 88%) using a vocabulary of size 1500.

1 2 3 4 5 6 7 8 9

1-Cow 13 0 0 0 0 0 0 2 0

2-Plane 0 15 0 0 0 0 0 0 0

3-Face 0 0 14 0 0 0 1 0 0

4-Car 0 0 0 14 0 0 1 0 0

5-Bike 0 0 0 0 15 0 0 0 0

6-Book 0 0 0 0 0 15 0 0 0

7-Sign 1 0 0 0 0 1 11 0 2

8-Sheep 2 0 0 0 0 0 0 13 0

9-Chair 0 0 0 0 3 0 2 0 10

3 METHODOLOGY

To measure the quality of the features assigned

to the different visual words lets assume that

{(x

, y

), ..., (x

, y

)} is a set of labeled features ex-

tracted from an image dataset with y

∈ {C

, ..., C

}

and D is an already trained vocabulary with N words.

The goal here is to train class speciﬁc classiﬁers, f

for the features that have been assigned to each visual

word. These classiﬁers are trained by selecting as-

signed features and creating a binary labeling by as-

signing features with y

= C to 1 and others to −1. In

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

190

this work our classiﬁer is formulated as a linear re-

gression and it is deﬁned as

= argmin



∑

f − ¯y



+ λ| f |

. (3)

Here the x is chosen only from the features with

l(x) = w and ¯y

represents the binary labeling of these

features with respect to class C. The value of λ can

be obtained through cross-validation. Figure 2 shows

how more detailed information can be extracted from

the features that were treated equally by the bag-of-

words model. In this ﬁgure we can see that the two

features x

and x

are both labeled as w

have a differ-

ent behavior with respect to the hyperplanes f

, f

and f

which encode class properties in this section

of the space. To estimate the quality of a feature (the

likelihood of belonging to class C while assigned to

the word w), we use the logistic function

(x) =

1+ exp(−a(x

))

. (4)

For any set of features extracted from an image we

wish to build a descriptor based on their quality rather

than their visual word quantity. To do so, the features

are initially labeled using a vocabulary D. As previ-

ously argued each word in D captures a certain struc-

ture on the image. Hence, the role of P

) function,

Eq. 4, is to measure the quality the discovered struc-

tures assigned to the word w with respect to class C.

This is a one dimensional measurement correspond-

ing to the models conﬁdence. To that end it is possi-

ble construct a (N.M) dimensional descriptor D, with

N being the size of the vocabulary and M the number

of classes. Each dimension of this vector corresponds

to responses associated with a certain word (w

) with

respect to a certain class (C

). The question here is

how one can summarize these values into a number

that can capture the qualitative properties of features

seen in the image. Here we analyze the max descrip-

tor deﬁned as

max

[i] = max



(x) : x ∈ I, l(x) = w



, (5)

which focuses on pooling the features with the high-

est likelihood rather than relying on the quantitative

properties of the their labeling. This can also be seen

as a feature selection problem, where the highest like-

lihood features are used for describing the image. The

max pooling is dependent on the accuracy of P

(x)

functions and increasing their accuracy will result in

a better description of the image. Similar to max de-

scriptor it is also possible to deﬁne the mean descrip-

tor D

mean

by replacing the max operation in Eq. 5

with mean operation.

4 EXPERIMENTS

In this section we compare the performance of the

proposed descriptor with the standard bag-of-words

histogram as the baseline. For both descriptors the

same vocabulary is used for summarizing the image.

To compare the performance we use vocabularies of

different sizes, since the size of a vocabulary is usu-

ally associated with the performance of the bag-of-

words histogram as a descriptor. In this experiment

the sift features (Vedaldi and Fulkerson, 2008) as base

features which are densely sampled from an image

pyramid and the image pyramid consists of eight lev-

els. The visual vocabularies are computed using stan-

dard k-means algorithm.

The experiments of this paper are conducted on

the MSRCv2 dataset (Winn et al., 2005). Although

this dataset is relatively small compared to other

datasets, it is considered as a challenging and difﬁcult

dataset due to its high intra-class variability. In this

work we have followed the experiments setup used in

(Zhang and Chen, 2009; Morioka and Satoh, 2010),

with a denser sampling of sift features from different

scale levels. In our experiment nine of ﬁfteen classes

are chosen ({ cow, airplanes, faces, cars, bikes, books,

signs, sheep, chairs}) with each class containing 30

images. For each experiment, the images of each class

were randomly divided into 15 training and 15 test-

ing images and no background was removed from the

images. The random sampling of training and test-

ing images were repeated 5 times. In our experiment

a one-against-all linear SVM (Chang and Lin, 2011)

was learnt for each class and the test images were

classiﬁed to the class with the highest probability.

To compare the performance of bag-of-words his-

togram with the proposed descriptor, visual vocabu-

laries with different sizes {50, 100, 200, 300, 400,

500, 1000, 1500} were computed over the training

subset using standard k-means algorithm. Figure 3

shows how the max descriptor (Eq. 5) is either out

performing or has a comparable accuracy to the bag-

of-words histogram in all vocabulary sizes. Tables 1

and 2 compare the confusion matrices of best classiﬁ-

cations using a vocabulary of size 1500 for both max

descriptor and bag-of-words histogram. These tables

show how a richer descriptor is obtained when quality

of word are measured rather in oppose to their quan-

tity. We also evaluate the mean summarizing, which

is created by replacing the max operator in Eq. 5 with

the mean operator. As it can be seen in ﬁgure 3 the

mean descriptor has a very low performance when

the size of the vocabulary is low. This low accuracy is

due to the fact that the background was not removed

and lots of low quality instances of visual words were

QualitativeVocabularybasedDescriptor

191

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 200 400 600 800 1000 1200 1400

Accuracy

Vocabulary Size

Histogram

Max Descriptor

Mean Descriptor

Figure 3: This ﬁgure compares the average performance of different descriptors with respect to the size of the vocabulary.

Here the max descriptor shows a better performance than bag-of-words histogram in almost all vocabulary sizes. Since no

background was removed from the test images the performance of the mean descriptor is expected to be low when the size of

the vocabulary is small. It can be seen how with larger vocabulary sizes, where more sophisticated appear in the vocabulary,

the mean descriptor out performs the bag-of-words histogram by a large margin.

found in the test image. Meanwhile with increase of

the size of the vocabulary the accuracy of this descrip-

tor signiﬁcantly increases. This increase is due to the

fact that with increase of the size of vocabulary more

sophisticated words are discovered.

The MSRCv2 has appeared in a variety of re-

searches as a testing platform for different algorithms.

Among those one can focus on (Zhang and Chen,

2009; Morioka and Satoh, 2010) where the authors

tend to build a more sophisticated vocabulary by

spatially combining local features into higher order

features. These higher order features either con-

sist of several visual words couples together (Zhang

and Chen, 2009) or are joint feature representations

(Morioka and Satoh, 2010). The difference between

their approach and the proposed approach is that fact

that our approach summarises the images using ﬁrst

order statistics in a more discriminative manner. Ta-

ble 3 shows how our method is out preforming the

previously published methods on this dataset.

5 CONCLUSIONS

We proposed a method that looks at the bag-of-words

models from a qualitative perspective rather than a

quantitative perspective. We experimentally show

that by describing images based on the quality of the

Table 3: Comparison between the classiﬁcation rates ob-

tained by the proposed method and the previously published

methods on MSRCv2 dataset.

Method Acc %

order spatial (Zhang and Chen, 2009) 78.3± 2.6%

order spatial (Zhang and Chen, 2009) 80.4± 2.5%

QPC (Morioka and Satoh, 2010) 81.8± 3.4%

LPC (Morioka and Satoh, 2010) 83.9± 2.9%

Multi-Scale BOW 85.3± 3.2%

Mean Descriptor 88.5± 4.3%

Max Descriptor 89.2± 3.9%

visual words provides a better descriptor for image

classiﬁcation. In this work a series of linear regres-

sions were used for measuring the quality of the local

features assigned to different visual words. Although

the performance of these local regressions are not dis-

cussed in this paper it is clear that their performance

has a direct effect on the resulting descriptor. This

facts provides a new tool for improving the perfor-

mance of vocabulary based models. Studies such as

(Afkham et al., 2012) have shown that it is possible to

improve the performance of these local classiﬁers by

combining severallocal features. Due to the complex-

ity, combining our method with such local classiﬁers

is left to the future works of this paper.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

192

ACKNOWLEDGEMENTS

This work was supported by The Swedish Foundation

for Strategic Research in the project “Wearable Visual

Information Systems”.

REFERENCES

Afkham, H. M., Carlsson, S., and Sullivan, J. (2012). Im-

proving feature level likelihoods using cloud features.

In ICPRAM (2), pages 431–437.

Bouachir, W., Kardouchi, M., and Belacel, N. (2009). Im-

proving bag of visual words image retrieval: A fuzzy

weighting scheme for efﬁcient indexation. In Pro-

ceedings of the 2009 Fifth International Conference

on Signal Image Technology and Internet Based Sys-

tems, SITIS ’09, pages 215–220, Washington, DC,

USA. IEEE Computer Society.

Boureau, Y.-L., Ponce, J., and LeCun, Y. (2010). A theoret-

ical analysis of feature pooling in visual recognition.

In ICML, pages 111–118.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM:

A library for support vector machines. ACM

Transactions on Intelligent Systems and Tech-

nology, 2:27:1–27:27. Software available at

http://www.csie.ntu.edu.tw/ cjlin/libsvm.

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y.

(2009). What is the best multi-stage architecture for

object recognition? In Proc. International Conference

on Computer Vision (ICCV’09). IEEE.

Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond

bags of features: Spatial pyramid matching for rec-

ognizing natural scene categories. In Proceedings

of the 2006 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition - Volume 2,

CVPR ’06, pages 2169–2178, Washington, DC, USA.

IEEE Computer Society.

Morioka, N. and Satoh, S. (2010). Building compact local

pairwise codebook with joint feature space clustering.

In Proceedings of the 11th European conference on

Computer vision: Part I, ECCV’10, pages 692–705,

Berlin, Heidelberg. Springer-Verlag.

Russell, B. C., Efros, A. A., Sivic, J., Freeman, W. T.,

and Zisserman, A. (2006). Using multiple segmen-

tations to discover objects and their extent in image

collections. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition.

Savarese, S., Winn, J., and Criminisi, A. (2006). Discrimi-

native object class models of appearance and shape by

correlatons. In Proceedings of the 2006 IEEE Com-

puter Society Conference on Computer Vision and

Pattern Recognition - Volume 2, CVPR ’06, pages

2033–2040, Washington, DC, USA. IEEE Computer

Society.

Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., and

Freeman, W. T. (2005). Discovering object categories

in image collections. In Proceedings of the Interna-

tional Conference on Computer Vision.

Vedaldi, A. and Fulkerson, B. (2008). VLFeat: An open

and portable library of computer vision algorithms.

http://www.vlfeat.org/.

Wang, X., Mohanty, N., and Mccallum, A. (2005). Group

and topic discovery from relations and text. In KDD

Workshop on Link Discovery: Issues, Approaches and

Applications (LinkKDD).

Winn, J., Criminisi, A., and Minka, T. (2005). Object cat-

egorization by learned universal visual dictionary. In

Proceedings of the Tenth IEEE International Confer-

ence on Computer Vision - Volume 2, ICCV ’05, pages

1800–1807, Washington, DC, USA. IEEE Computer

Society.

Zhang, Y. and Chen, T. (2009). Efﬁcient kernels for iden-

tifying unbounded-order spatial features. In CVPR,

pages 1762–1769.

QualitativeVocabularybasedDescriptor

193