BRINGING ORDER IN THE BAG OF WORDS

Shihong Zhang, Rahat Khan, Damien Muselet and Alain Tr´emeau

Universit´e de Lyon, F-42023, Saint-

Etienne, France

CNRS, UMR 5516, Laboratoire Hubert Curien, F-42000, Saint-

Etienne, France

Universit´e de Saint-

Etienne, Jean-Monnet, F-42000, Saint-

Etienne, France

Keywords:

Bag-of-words, Object Categorization, Spatial Information.

Abstract:

This paper presents a method to infuse spatial information in the bag of words (BOW) framework for object

categorization. The main idea is to account the local spatial distribution of the visual words. Rather than

ﬁnding rigid local patterns, we consider the visual words in close spatial proximity as a pouch of words and

we represent the image as a bag of word-pouches. For this purpose, sub-windows are extracted from the

images and characterized by local bags of words. Then a clustering step is applied in the local bag of words

space to construct the word-pouches. We show that this representation is complementary to the classical BOW.

Thus a concatenation of these two representations is used as the ﬁnal descriptor. Experiments are conducted

on two very well known image datasets.

1 INTRODUCTION

In this paper, we deal with the problem of category-

level classiﬁcation in the images. This is a chal-

lenging problem in computer vision and one of the

successful solutions is the Bag-of-Words (BOW) ap-

proach (Csurka et al., 2004), which employs the his-

togram of particular image patterns (the visual words)

in a given image. However, one major limitation of

the BOW model is that, it does not retain the spa-

tial relationship among the visual words. Different

methods have been proposed to take advantage of

the spatial distribution of visual words to improve

classiﬁcation accuracy. For example, Lazebnik et

al. employed the pyramid match kernel proposed

by (Grauman and Darrell, 2005) into BOW frame-

work to account the global distribution of the visual

words among the image and achieved very high clas-

siﬁcation accuracy (Lazebnik et al., 2006). Among

local approaches, Zhang et al. (Zhang and Mayo,

2008) improved the classiﬁcation performance of the

BOW model by discovering intermediate represen-

tations for each object class. Speciﬁcally, their ap-

proach includes the spatial relationships between all

the frequent and informative image keypoints in the

smaller regions of the image. A group of works in-

tents to model the co-occurrence patterns of visual

words. Among them, (Sivic et al., 2005) extended the

BOW model using spatial information in their work.

The spatial information, which they term as ”dou-

blets”, is formed from spatially neighboring word

pairs. In (Bhatti and Hanbury, 2010), Bhatti et al.

introduced the pair-wise relations between image fea-

tures. In their work, the image is represented by a con-

catenation of independentvisual words with pair-wise

visual words. Yuan et al. (Yuan et al., 2007) deﬁned

co-occurrence pattern occurring in local proximity as

visual phrase and use this information for classiﬁca-

tion.

Most of the local approaches only consider pairs

of visual words and we argue that we should not

restrict the number of words accounted in the local

neighborhoods. Unfortunately, increasing the number

of words considered in each neighborhoodtends to in-

crease the dimension of the ﬁnal descriptor. Hence,

we propose an alternative that considers the visual

words in close spatial proximity as a pouch of words

and represents the image as a bag of word-pouches.

The originality of this approach is that it applies a

clustering step in the BOW space in order to ex-

tract the most representative pouches. Bag of word-

pouches is also an orderless representation but in-

terestingly it encodes some spatial information be-

cause each pouch is representative of a group of words

which reside close to each other in the image space.

Unlike the classical methods that introduce spatial in-

formation in the BOW, our approach accounts the spa-

tial distribution of the visual words without increasing

the dimension of the ﬁnal descriptor. Furthermore our

method, detailed in next Section, is complementary

723

Zhang S., Khan R., Muselet D. and Trémeau A..

BRINGING ORDER IN THE BAG OF WORDS.

DOI: 10.5220/0003859307230726

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 723-726

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

to BOW representation and when concatenated with,

provides superior classiﬁcation accuracy.

2 BAG OF WORD-POUCHES

For image and object classiﬁcation, the bag of words

approach is the most widely used. The idea consists

in characterizing each image by a histogram of quan-

tized descriptors (Csurka et al., 2004), that are ex-

tracted from the descriptor space thanks to a cluster-

ing algorithm (e.g. k-means). Figure 1 illustrates the

bag of words construction of 3 different images. In

this example, we can see that we have 3 representative

clusters in the descriptor space, each one associated

with one visual word (square, circle and star). Then,

each image is characterized by the histogram of these

”visual words”. This toy example underlines that no

spatial information is accounted in the ﬁnal represen-

tation. Indeed, whereas the spatial distributions of the

visual words are different between the 3 images, their

respective bag of words are the same.

Image 1 Image 2

Image 3

BOW 1 BOW 2

BOW 3

3 visual words

after kmeans

clustering

Descriptor space

Figure 1: The bag of words (BOW) representation of 3 dif-

ferent images.

In order to infuse some spatial information in the

bag of words approach, we propose to account the

spatial distributions of the visual words in local ar-

eas of the images. The idea is to check if some sets

of visual words often occur in the same neighborhood

for some object classes and not for some others in or-

der to add this discriminative information in the ﬁnal

representation of the images. Therefore, we propose

to evaluate local bags of words from sub-windows ex-

tracted from the images and to characterize each im-

age by a bag of local bags of words. Following the

metaphor of the bag of words, our approach consists

in bringing some order by adding some pouches to

the bag, so that we put the words into these pouches

(that are inside the bag) instead of mixing them in the

bag. Consequently, the representatives of the most

frequent local bags of words are called word-pouches.

The original part of this work is in the creation of

the word-pouches. These word-pouches are the his-

tograms of visual words that often occur in the same

neighborhood. In order to deﬁne these word-pouches,

we extract several sub-windows from all the images

and for each of these sub-windows, we evaluate their

bag of words (local BOW). Then, we create the local

BOW space whose dimension is the number of words

and in which each local BOW is one point. Finally,

we apply clustering in the local BOW space and the

cluster representatives are called the word-pouches.

Once the word-pouches have been deﬁned, for one

image, we extract several sub-windows, evaluate the

local BOW of each of these sub-windows and asso-

ciate it with the nearest word-pouch. Hence, the im-

age is characterized by its bag of word-pouches.

Figure 2 displays the image 2 (see ﬁgure 1) from

which we have extracted 3 sub-windows. For each of

these sub-windows, we have evaluated its BOW and

shown the corresponding position in the local BOW

space.

Image 2

Local BOW space

Figure 2: 3 sub-windows extracted from image 2 of ﬁgure 1,

their respective local BOW and the corresponding points in

the local BOW space.

Figure 3 shows the BOWP representations ob-

tained for the 3 images of ﬁgure 1. In ﬁgure 3, we

can see, in the local BOW space, the points associ-

ated with the sub-windows extracted from the 3 im-

ages. After clustering, we obtain 5 word-pouches that

are the bins of the bags of word-pouches. We note

that the BOWP is more representative of the image

contents than the classical BOW of ﬁgure 1since the

BOWP of the images 2 and 3 are similar to each other

while being different from this of the image 1.

However, since the clustering step applied in the

local BOW space tends to loose details about the vi-

sual words and since spatial information is more or

less discriminative, depending on the considered cate-

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

724

Image 1 Image 2

Image 3

BOWP 1

LBOW space

sliding window

BOWP 2

BOWP 3

5 word-pouches after

kmeans clustering

Figure 3: The bag of word-pouches (BOWP) representation

of the 3 images from ﬁgure 1.

gory, we propose not to restrict the image description

to the BOWP and to create a descriptor BOW&WP

that is a concatenation of the BOW and the BOWP.

This follows the intuition that rather than putting D

words in a single bag without pouches, it is better

to use less words (D

< D) and put them in 2 bags,

the ﬁrst bag without pouches (BOW representation)

and the second one with pouches (BOWP represen-

tation). The number of pouches in the second bag is

equal to D

and the relative numbers of words (D

)

and pouches (D

) in the two bags are discussed in the

next Section.

3 EXPERIMENTAL RESULTS

For the experiments, we have used two datasets. The

ﬁrst one is the Caltech101 image dataset (Fei-fei,

2004) from which we consider only the 10 most fre-

quent categories (Caltech10). The second dataset is

the Graz01 image dataset (Opelt et al., 2004) that con-

tains two object categories namely bikes and persons

and a background class. In this section, we present

the average classiﬁcation accuracy over 10 individual

runs for both datasets.

16 × 16 patches are densely extracted, with 8 pix-

els overlap, and SIFT is used as descriptor (Lowe,

1999). We run a K-means clustering on a random

subset of 50, 000 descriptors to construct the visual

vocabulary. To create the bag of word-pouches, we

empirically ﬁnd that a square window size of 48 ×

48 pixels with 8 pixels overlap works the best for the

considered datasets. This size of window can accom-

modate 25 visual words given the parameter we have

used in the dense sampling step. A SVM classiﬁer

with intersection kernel (Lazebnik et al., 2006) is used

for all the experiments with the cost parameter (C)

value set to 1.

For the ﬁrst experiment, the used descriptor

BOW&WP is constituted by D values and is a con-

catenation of a BOW and a BOWP. Then, we evaluate

the average accuracy of this descriptor while vary-

ing the relative numbers of words (D

) and word-

pouches (D

) within it, i.e. the k’s in the two k-means

algorithms. Since we want to compare descriptors

with the same size, we choose D

and D

so that

+ D

= D. Hence, we show the result evolution

while increasing D

from 0 to 0.75 × D (D = 800)

in ﬁgure 4. The aim of this experiment is to de-

ﬁne the best relative amounts of words and word-

pouches for the considered dataset and for a constant-

size descriptor. We can see that, whatever the consid-

ered dataset, accounting spatial information increases

the average accuracy even if the number of words

is decreasing. Since the best overall trade-off is ob-

tained for equal numbers of words and word-pouches,

for the rest of the experiments, we choose these ra-

tios (0.5 ∗ D

+ 0.5 ∗ D

) and we call this descriptor

BOW&WP.

81,5

82,5

83,5

84,5

85,5

86,5

87,5

1*Dw

0*Dp

0.75*Dw

0.25*Dp

0.5*Dw

0.5*Dp

0.25*Dw

0.75*Dp

Caltech

Graz-person

Graz-bike

Figure 4: Average accuracy obtained with 800 dimensional

descriptors on Caltech101 and Graz01 datasets.

Then, we propose to compare the results of our

BOW&WP descriptor and the classical BOW for dif-

ferent values of D. We recall that the dimensions of

BOW&WP and BOW are both equal to D. Figure 5

shows the results for D varying from 200 to 1000. We

can see that for the 3 tested datasets and whatever the

value of D the proposed BOW&WP outperforms the

classical BOW.

Finally, we propose to analyze the relative aver-

age accuracy improvement (RAAI) of BOW&WP re-

BRINGING ORDER IN THE BAG OF WORDS

725

82,5

83,5

84,5

85,5

86,5

200

400

600

800

1000

80,5

81,5

82,5

83,5

84,5

200

400

600

800

1000

83,5

84,5

85,5

86,5

87,5

200

400

600

800

1000

BOW

BOW&WP

Caltech10 Graz - Bikes Graz - persons

Figure 5: Average accuracies on Caltech10 and Graz01.

gards to classical BOW for each category of CAL-

TECH10. The higher the value of RAAI the more the

BOW&WP outperformsthe BOW. Figure 6 shows the

mean RAAI over all the values of D from 200 to 1000

for each category. The categories are ranked in the

order of decreasing RAAI, so that BOW&WP per-

forms better (compared to BOW) on the categories

on the left (in green) than on the categories on the

right (in red). This ﬁgure shows that, depending on

the category, the improvement provided by the pro-

posed BOW&WP varies a lot. It can be very high

(more than 10%) for some categories such as chan-

deliers or bonsais that are characterized by rigid and

stable structures and can be negative for some oth-

ers such as leopards that are non-rigid objects whose

poses and viewpoints are highly varying between the

images.

-11

-9

-7

-5

-3

-1

ketch

acc

ord

s_e

hawksbil

_si

leop

rds

bonsai

tch

motorbikes

Figure 6: Relative average accuracy improvements of

BOW&WP regards to classical BOW for the 10 most fre-

quent CALTECH101 categories.

4 CONCLUSIONS AND FUTURE

WORKS

In this paper, we have proposed an original and efﬁ-

cient way to account spatial distribution of the visual

words in the image representations. The idea consists

in accounting the way the visual words are locally or-

ganized. We have shown that adding this information

in the bag of words can help in decreasing the number

of visual words accounted for the construction of the

descriptor while improving the average accuracies.

REFERENCES

Bhatti, N. A. and Hanbury, A. (2010). Co-occurrence bag

of words for object recognition. In Proceedings of the

15th Computer Vision Winter Workshop (CVWW).

Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and

Bray, C. (2004). Visual categorization with bags of

keypoints. In Workshop on Statistical Learning in

Computer Vision, ECCV.

Fei-fei, L. (2004). Learning generative visual models from

few training examples: an incremental bayesian ap-

proach tested on 101 object categories. In Workshop

on Generative-Model Based Vision, CVPR.

Grauman, K. and Darrell, T. (2005). The pyramid match

kernel: discriminative classiﬁcation with sets of im-

age features. In International Conference of Computer

Vision, pages 1458–1465.

Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond

bags of features: Spatial pyramid matching for recog-

nizing natural scene categories. In Computer Vision

and Pattern Recognition (CVPR).

Lowe, D. G. (1999). Object recognition from local scale-

invariant features. In International Conference on

Computer Vision (ICCV).

Opelt, A., Fussenegger, M., Pinz, A., and Auer, P. (2004).

Weak hypotheses and boosting for generic object de-

tection and recognition. In Pajdla, T. and Matas, J.,

editors, ECCV (2), volume 3022 of Lecture Notes in

Computer Science, pages 71–84. Springer.

Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., and

Freeman, W. T. (2005). Discovering objects and their

location in images. In IEEE Intl. Conf. on Computer

Vision.

Yuan, J., Wu, Y., and Yang, M. (2007). Discovery of collo-

cation patterns: from visual words to visual phrases.

In Computer Vision and Pattern Recognition (CVPR).

Zhang, E. and Mayo, M. (2008). Pattern discovery for

object categorization. In 23rd International Con-

ference Image and Vision Computing New Zealand

2008(IVCNZ 2008).

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

726