A Survey of Extended Methods to the Bag of Visual Words for Image

Categorization and Retrieval

Mouna Dammak, Mahmoud Mejdoub and Chokri Ben Amar

REGIM-REsearch Groups on Intelligent Machines, National Engineering School of Sfax,

University of Sfax, BP 1173, 3038 Sfax, Tunisia

Keywords:

Image Representation, Spatial Neighboring Relation, Bag of Visual Words, Encoding and Pooling, Graph

Representation, Image Categorization.

Abstract:

The semantic gap is a crucial issue in the enhancement of computer vision. The user longs for retrieving images

on a semantic level, but the image characterizations can only give a low-level similarity. As a result, recording

a stage medium between high-level semantic concepts and low-level visual features is a stimulating task. A

recent work, called Bag of visual Words (BoW) have arisen to resolve this difﬁculty in greater generality

through the conception of techniques genius relevantly learning semantic vocabularies. In spite of its clarity

and effectiveness, the building of a codebook is a critical step which is ordinarily performed by coding and

pooling step. Yet, it is still difﬁcult to build a compact codebook with shortened calculation cost. For that,

several approaches try to overcome these difﬁculties and to improve image representation. In this paper, we

introduce a survey investigates to cover the inadequacy of a full description of the most important public

approaches for image categorization and retrieval.

1 INTRODUCTION

The Bag of visual Words (BoW) is a method which

offers a Mid-Level Descriptors (MLD) which fa-

cilitates the reduction of the semantic gap (Smeul-

ders et al., 2000) between the Low-Level Descrip-

tors (LLD), withdrawn from an image, and the High-

Level Descriptors (HLD) concepts to be classiﬁed.

The building of the BoW model can be fractured into

chained stages of encoding and pooling . The encod-

ing step assigns the local descriptors onto the code-

book components while the pooling step aggregates

the assigned words into a vector. We can distinguish

three problems in the standard visual word, which

may be the core factors of their restricted descriptive

competence:

1. K-means method based visual vocabulary build-

ing can not conduct to very efﬁcient and compact

visual word assembly;

2. An individual visual word includes restricted de-

tails. So, it is not effective in describing the fea-

tures of objects and scenes;

3. Ignore spatial information.

There exist some challenges that have been advanced

to improve the performance of the conventional BoW

paradigm and to integrate spatial information. We can

classify these methods into two main categories: the

ﬁrst category attempts to improve the generation of

the visual vocabulary (Farquhar et al., 2005; Avrithis

and Kalantidis, 2012); the second category contains

techniques that add spatial information over the BoW

which have been proven to enhance the performance

of scene classiﬁcation and retrieval (Xie et al., 2012;

Jiang et al., 2012). Therefore, the aim of this paper is

to review the most developed work of BoW for image

categorization and retrieval.

The paper starts with describing the general pro-

cess of building the bag of visual words. Subsequent

sections discuss the advanced approaches for bag of

visual word model. Step one of the review presents

recent approaches based on coding step. Step two of

the review presents recent approaches based on pool-

ing step. In the last section, we present a conclusion.

2 THE BASELINE SYSTEMS OF

BAG OF WORDS

REPRESENTATION

The standard pipeline to obtain the Bag of visual

676

Dammak M., Mejdoub M. and Ben Amar C..

A Survey of Extended Methods to the Bag of Visual Words for Image Categorization and Retrieval.

DOI: 10.5220/0004750506760683

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 676-683

ISBN: 978-989-758-004-8

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: General schematic overview of the Bag of visual

Words framework

Words, consists ﬁrstly to group a large prototype of

low level descriptor from a collection of training im-

ages. The second stage is based on clustering these

descriptors using the K-means clustering. The K clus-

ter centers draws the visual words. The K value is a

user-supplied parameter and represent the size of the

vocabulary. When the codebook is builded, a new im-

age is calculated in the following way: extraction of

low level descriptor, attribution of the descriptors on

the codebook established on a training collection, and

computation of a histogram that counts the number of

times of occurrence of the codebook visual words (see

ﬁgure 1) . In this section, we will detail these stages

specifying the different approaches.

2.0.1 Point/Region Detection

In the literature, we can distinguish two different

types of patch based image representations: Interest

Points (IP) and dense sampling. On the one hand, IP

concentrate on interesting positions in the image and

contain diverse ranks of viewpoint and illumination

invariance, bringing about improved repeatability out-

come, such as corners or blobs position, whose scale

and shape are determined by an algorithm of feature

detection. Dense sampling, on the other hand, which

is composed of patches of adjusted size and shape are

located on a constant grid and can be repeated on var-

ious scales, provides improved coverage of the image,

a prevailing number of features per image, and simple

spatial relations between features. By integrating both

criteria, the authors (Tuytelaars, 2010) have proposed

a hybrid scheme called dense interest points which

they have started from densely sampled patches even

enhance their location and scale parameters position-

ally.

2.1 Feature Extraction

2.1.1 Local Descriptor

Various and recent feature descriptors have been

greatly drawn in the general visual recognitions such

as Scale Invariant Feature Transform (SIFT) (Lowe,

2004), Speed Up Robust Features (SURF) (Bay et al.,

2006), Binary Robust Independent Elementary Fea-

tures (BRIEF) (Su and Jurie, 2011; Calonder et al.,

2012), etc. Due to the achievement of SIFT, image

local features have been greatly applied in a variety of

computer vision and image processing applications.

The SIFT descriptor proposed by Lowe describes

the local shape of a region using edge orientation his-

tograms. The gradient of an image is shift-invariant:

taking the derivative cancels out offsets. Under

light intensity changes, i.e . a scaling of the inten-

sity channel, the gradient direction and the relative

gradient magnitude remain the same. Because the

SIFT descriptor is normalized, the gradient magni-

tude changes have no effect on the ﬁnal descriptor.

Particularly, various recent works have taken advan-

tage of SIFT to develop advanced object classiﬁers.

The SIFT descriptor is not invariant to light color

changes, because the intensity channel is a combina-

tion of the R , G and B channels. Color features pro-

vide powerful information for object and scene classi-

ﬁcation, indexing and retrieval. Due to these two im-

portant causes, several descriptor color extensions of

SIFT are proposed including HSV-SIFT (Bosch et al.,

2008), OpponentSIFT (van de Sande et al., 2010),

RGB-SIFT (van de Sande et al., 2010). Furthermore,

SIFT is not a ﬂip invariant. As a consequence, the

descriptors extracted from two identical but ﬂipped

local patches could be completely different in feature

space. For that, several invariant descriptors are based

on improvement partitioning scheme of local region

including Mirror and Invert invariant SIFT (MI-SIFT)

(Ma et al., 2010) and Neat Flip Invariant Descriptor

(FIND)(Guo and Cao, 2010).

2.2 Codebook Generation

The authors (Sivic and Zisserman, 2003; Csurka et al.,

2004) have creatively proposed to cluster the low-

level features with the K-means clustering, which

is the most dominant method, to get the Bag of

visual Words. Given a set x

, x

, . . . , x

∈ R

N training descriptors. K-means searchs K vec-

tors µ

, µ

, . . . , µ

∈ R

and a data-to-means assign-

ments q

, q

, . . . , q

∈

{

1, 2, . . . , K

}

such that the ad-

ditive approximation error

∑

i=1

− µ

is mini-

mized. An extended clustering based on a generative

ASurveyofExtendedMethodstotheBagofVisualWordsforImageCategorizationandRetrieval

677

model, called the Gaussian Mixture Model (GMM)

have been proposed by (Farquhar et al., 2005). This

model is characterized by a continuous histogram rep-

resentation contrast to a discrete histogram represen-

tation caused by features which are assigned to all

words probabilistically. GMM utilizes a mixture of

gaussians gathering a linear combination of gaussian

densities. Each gaussian density has its own mean and

covariance. The number of clusters is proportional to

the number of gaussians. So, clustering of data can

be accomplished by means of estimating the param-

eters connected to the latent variable of the Gaussian

mixture. Expectation–Maximization (EM) algorithm

(G. Mclachlan, 2000) can be utilized to determine

maximum likelihood estimators in gaussian mixtures

with latent variables. GMM method clusters popula-

tion and shape, but it considers pairwise interaction of

all data with all clusters and it is tardier to converge.

To overcome this limitation, the authors (Avrithis and

Kalantidis, 2012) have proposed an approximate ver-

sion, called Approximate Gaussian Mixtures (AGM),

to large scale visual vocabulary learning. In this case,

opposite to the usage of model GMM, descriptors of

indexed images are adequate only to their nearest vi-

sual word to retain enough index sparse. They sug-

gest a variant of EM that can converge rapidly while

dynamically estimating the number of components.

They employed approximate nearest neighbor search

to speed-up the Expectation step and exploit its itera-

tive nature to make search incremental, boosting both

speed and precision.

2.3 Encoding and Pooling Phase

Having the keypoints detected, the features extracted

and the visual words generated, the ﬁnal step of ex-

tracting the representation from images is based on

two successive stages of coding and pooling, main-

taining the discriminating potential of the local de-

scriptors. The coding step assigns the local descrip-

tors onto the bag of visual words while the pool-

ing step aggregates the assigned words into a vector.

Let X be a set of D-dimensional local descriptors ex-

tracted from an image, i.e. X = [x

, x

, . . . , x

] ∈ R

D×N

. Given a visual dictionary with K visual words, i.e.

X = [x

, x

, . . . , x

] ∈ R

D×N

. We use α

to denote the

code vector. The dimension of α

is the same as the

size of D except Fisher kernel representation.

The coding step can be modeled by an activation

function for the codebook, stimulating each of the vi-

sual words corresponding to the local descriptor. In

the BoW model, the coding function stimulates only

the closest codeword to the descriptor.

n,k

= 1 i f f k = arg min

k∈

{

1···K

}

q x

− d

(1)

where α

n,k

is the k

element of the encoded vector α

This scheme corresponds to a hard coding or Vector

Quantization (VQ) over the dictionary. The generat-

ing binary vector is very sparse, but it undergoes sen-

sitivities when the descriptor is coded on the boundary

of proximity of diverse bag of visual words (Gemert

et al., 2010).

As a result, new methods to that approaches have

been recently emerged. Sparse coding (Yang et al.,

2009) modiﬁes the optimization function by jointly

considering reconstruction error and sparsity of the

vector, using the famous attribute that regularization

with the l1 − norm, for a large enough regularization

parameter λ, yields sparsity:

= arg min

α∈R

q x

− Dα q

+λkαk

where λ penalizes the l1 − norm regularizes, which

controls the sparsity of α . Powerful tools have

been suggested to get tractable settlings (Mairal et al.,

2010). Another way based on a soft assignment to

each visual word, called soft coding (Gemert et al.,

2010). It gives weight according to similarities be-

tween descriptors and codewords. In the pooling step,

soft coding leads to ambiguities because of the super-

position of the elements. This results in dense code

vectors, which is unfavorable. So, diverse interme-

diate approaches, called semi-soft coding (Liu et al.,

2011), have been suggested, often performing the soft

assignment just to the K nearest neighbors of the in-

put feature. Contrary to the sparse coding, Locality-

constrained Linear Coding (LLC), proposed by Wang

et al. (Wang et al., 2010), enforces locality instead

of sparsity and this leads to smaller coefﬁcient for the

basis vectors far away from the local feature x

. The

coding coefﬁcients are obtained by solving the fol-

lowing optimization

= arg min

α∈R

q x

− Dα q +λ q β

 α q

(2)

where  denotes the element-wise multiplication

and β

is the locality adaptor that gives weights for

each basis vector proportional to its similarity to the

input descriptor x

. The distance metric used :

= exp



dist (x

, D)



where dist (x

, D) =

[dist (x

, d

), dist (x

, d

), . . . , dist (x

, d

)]

and

dist (x

, d

), is the Euclidean distance between x

and

. σ is used for adjusting the weighted decay speed

for the locality adaptor.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

678

The distance regularization of LLC effectively

performs feature selection, and in practice only those

bases close to x

in feature space have non-zero coefﬁ-

cients. This suggests to develop a fast approximation

of LLC by removing the regularization completely

and instead using the K-nearest neighbours

(K) of x



K < D < K



as a set of local bases D

= arg min

α∈R

q x

− D

q st. = 1∀i (3)

This reduces the computation complexity from θ







K +



and the nearest neighbours can be found

using ANN methods such as kd-trees. Perronnin et

al. (Csurka and Perronnin, 2010) have presented the

Fisher Vector (FV) extending the BOW. They have

hypothesized that the descriptors can be modeled by a

probability density function. A BoW has been learnt

by a GMM model. They have captured the average

ﬁrst and second order differences between the image

descriptors and the centres of a GMM. They have

concatenated the mean and the second order for all

K Gaussian components, giving an encoding of size

2DK where D represents local descriptor size. They

infer that FV describes how the set of descriptors

deviates from an average distribution of descriptors,

modeled by a parametric generative model. In (Zhou

et al., 2010), the authors have proposed a Super Vector

(SV) approach to extend VQ by a function approxi-

mation scheme and it is similar to the ﬁsher encoding.

There are two variants of SV, based on hard assign-

ment to the nearest codeword or soft assignment to

several near neighbours. For the hard SV, a feature

is assigned to the nearest visual word µ

, which ob-

tained K-means clustering algorithm, and for the soft

SV, this method eventually result in aggregating the

difference vectors x

− d

around the visual word d

This provides an encoding of size K(D + 1). Com-

pared to the ﬁsher encoding, the super vector encod-

ing: (1) investigates only the ﬁrst order distinctions d

between features and cluster centres; (2) accumulates

the elements s

which represent the weight of each

cluster; (3) normalizes each cluster by the square root

of the posterior probability instead of the prior proba-

bility.

The BoW approach involves a large codebook of

several thousands of visual words. However, cluster-

ing high dimensional feature spaces and large scale is

not an easy task. To tackle this problem, Jgou et al.

(H.Jeou et al., 2012) have suggested vector represen-

tations called Vector of Locally Aggregated Descrip-

tors (VLAD) that use smaller codebooks. Instead of

using GMM to model the feature distribution, VLAD

uses a K-means clustering algorithm to build a code-

book. Then, the descriptors are voted on their near-

est codewords. A vector for each visual word is the

accumulation of the differences between the nearest

descriptors and itself resulting a vector size D. the ﬁ-

nal image representation concatenates the K vectors

on the codebook and generates a vector of size KD.

The authors (Picard and Gosselin, 2013) have ex-

tended the VLAD approach, called Vectors of Lo-

cally Aggregated Tensors (VLAT), by adding an ag-

gregation of the tensor product of descriptors. As in

VLAD, the visual codebook has been built by clus-

tering the descriptor using k-means. This representa-

tion is included several formulas. The ﬁrst formula,

as in VLAD, is the aggregate of differences between

the nearest descriptors and a visual word. The sec-

ond formula is the total of self tensor product of the

nearest descriptors assigned to a visual word. in gen-

eral, a higher order p of tensor products on centred

descriptors can be calculated to add tensor formulas

to the signature. In practice, the number of formulas

is limited to the second order. The last image repre-

sentation is the vector concatenating all the formulas

in vectors for all visual words. The images having dif-

ferent numbers of descriptors to be compared, a fur-

ther l − 2 normalization of the signatures is achieved.

Notice that DK-dimensional vector representation.

3 INVESTIGATIONS OF

INTEGRATION SPATIAL

INFORMATION INTO BAG OF

VISUAL WORDS METHODS

3.1 Approaches based on Pairwise

Features

Recent works (Zhang and Mayo, 2010; Morioka and

Satoh, 2010b; Wang et al., 2012; Khan et al., 2012;

Morioka and Satoh, 2010a; Herve and Boujemaa,

2009; Morioka and Satoh, 2011) have been founded

on pairs of visual words. A codebook of size K

{

}

has been learned using unsupervised learning, from a

haphazardly sampled collection relevantly to the de-

scriptors. Then, every descriptor d

is assigned to the

closest cluster c

in the feature space. To integrate

the spatial information, their approaches are differ-

tents: In (Herve and Boujemaa, 2009), ﬁrstly, Herve

et al. have constructed a base vocabulary containing

K words, then, they built a K



K+1



word pair vo-

cabulary, called Quadratic Pairwise Codebook (QPC),

to capture spatial information between words. They

have considered the pairs which the distance between

the two patches centers is below the given radius.

ASurveyofExtendedMethodstotheBagofVisualWordsforImageCategorizationandRetrieval

679

They have simply accumulated the pairs in an his-

togram. To overcome quadratic number of possible

pairs of visual words, Morioka and Satoh (Morioka

and Satoh, 2010a) have proposed a compact codebook

called a Local Pairwise Codebook (LPC). Contrust

to previous appraoch based on quantize the descrip-

tors to learn a set of visual words, they have started

by joint feature space. After that, they have applied

a clustering algorithm to build a compact codebook.

Then, they have computed a histogram. Then, they

have combined it with spatial pyramid matching ker-

nel to demonstrate that local and global spatial infor-

mation complement each other.

These authors (Morioka and Satoh, 2010b) have

extended LPC by adding directional information to

the representation to produce two new approaches

called Directional Local Pairwise Bases (DLPB) and

Directional Local Pairwise Codebook (DLPC). In

DLPB, they have used a sparse coding to learn a com-

pact collection of bases appropriating interconnection

between descriptors. Moreover, these bases have been

learned for each quantized direction. Thus, it adds

to the representation explicit directional information.

For DLPC, which is a variant of DLPB, a K-means is

used to replace sparse coding to build speciﬁc direc-

tional codebooks. For every directional codebooks,

the authors have computed spatial pyramid matching

kernel to extract the average of the kernels.

LPC achieves a compact codebook of pairs of spa-

tially close local descriptors. As a result, it is not

considered as scale invariant and it is also appropriate

only for densely sampled local features. Contrary to

that, the Proximity Distribution Kernel (PDK) method

is characterized by a scale invariant and robust rep-

resentation. It, then, captures rich spatial proxim-

ity information between local features, but the num-

ber of visual words increases quadratically. Inspired

by the two above mentioned techniques, the authors

(Morioka and Satoh, 2011) have uniﬁed of the LPC

and the PDK to represent a new method called the

Compact Correlation Coding (CCC) to combine the

powers of both techniques. Compared to the PDK,

CCC performs a more general and compact code-

book. Yet, it captures robust spatial proximity dis-

tribution of local features and scale invariant which

cannot be achieved under the properties of LPC.

3.2 Approaches based on High Orders

Features (Visual Phrases)

Although the studies of second order features are in-

tensive, to capture more the spatial information, some

works (Zheng et al., 2008; Zhang and Chen, 2009;

Bingbing et al., 2013) are particularly interested in

how to model high-order local features. In (Zheng

et al., 2008), the local spatial neighborhood have been

extracted for each local region. The FP-growth algo-

rithm has been applied to perform the Frequent Item-

set Mining (FIM) task to present the visual words.

In order to integrate both the local proximity of vi-

sual words and co-occurrence information, the au-

thors have deﬁned the visual synset as a probabilistic

concept of visual words, in which the latter has been

learned through supervised learning. In (Zhang et al.,

2011a), the authors have suggested the descriptive vi-

sual words and descriptive visual phrases as the vi-

sual analogical to text words and phrases, when visual

phrases attribute to the repeatedly occurring visual

word pairs. The co-occurring is computed between

two visual words inside a short distance. For each

image category, they have deﬁned the Descriptive Vi-

sual Words (DVW) candidates as the comprised vi-

sual words and they have deﬁned the Descriptive Vi-

sual Phrases (DVP) candidate generation by agreeing

it with the rotation invariant spatial histogram. The

co-occurrence frequency can be computed by count-

ing the frequency of co-occurrence within the spatial

distance between two visual words in the same cate-

gory. In (Xie et al., 2012), Xie et al. have extracted

SIFT and Edge-SIFT descriptors and they have com-

bined them to build codebook. Then, they have gen-

erated a geometric visual phrases by taking a phrase

as a set disordered neighbors of visual words. A

max-pooling step is performed on the whole phrase

and they have applied spatial weighting based on a

smoothed edgemap. The authors (Cao et al., 2010)

have proposed an ordered bag of features based on

projecting features onto certain lines or circles which

are able to capture basic geometric information in im-

ages. These representations are the basis of the spatial

bag of words. They have treated the same operations

for histogram features, i.e. calibration, equalization

and decomposition to capture more and more typical

transformations of image including translation, rota-

tion and scaling. Then, they have adopted the Rank-

Boost algorithm (Freund et al., 2003) to select the

most effective conﬁguration. In (Zhang et al., 2011b),

the authors have integrated the algorithm proposed in

(Zhang and Chen, 2009) to identify the co-occurring

the geometry-preserving visual phrases (GVP) in two

images. Added to co-occurrences, the GVP method

captures the local and long-range spatial layouts of

the visual words. To measure the GVP similarity

value of two images, they have calculated the offset

(x,y)

for each pair of the same word in these images.

Then, a vote has been yielded on the offset space at

(x,y)

. On the offset space, K votes locating at the

same place corresponding to a co-occurring GVP of

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

680

length K. After obtaining the dot product, the similar-

ity of the two images is the dot product dividing the

L2-norms.

The authors (Jiang et al., 2012) have proposed a

new visual phrase selection approach based on ran-

dom partition of images. After extracting local in-

variant features, they have randomly split the image

multiple times to form a pool of overlapping im-

age patches. Each patch groups the local features

inside it and is described by a visual phrase. pri-

marily, for each local descriptor, they have yielded a

number of Randomized Visual Phrases (RVP) varying

shapes and sizes according to its spatial contexts. For

each RVP, they have independently computed match-

ing score between the test image and the query image,

and they have dealed with it as the voting weight of

the appropriate patch. The ﬁnal reliability score of

each pixel has been computed as the expectation of

the voting weights of all patches that comprise this

pixel. By determining the pixel-wise voting map, the

similarity image can eventually be recognized. In

(Bingbing et al., 2013), the authors have proposed a

model of high-order local spatial context called Spa-

tialized Random Forest (SRF). SRF can explore much

more complicated and informative local spatial pat-

terns randomly, applying spatially random neighbor

selection and random histogram-bin partition during

the tree construction. A set of informative high-order

local spatial patterns are drifted, because of the dis-

criminative capability test for the random partition in

each tree node’s division procedure. Consequently,

new images have been encoded by calculting the rep-

etitions of these discriminative local spatial patterns.

3.3 Approaches based on Graph and

Graph Matching

To take into account the spatial constraints in im-

ages, several authors (Quack et al., 2007; Bowen

et al., 2012; Kisku et al., 2010; Jaechul Kim, 2010;

Duchenne et al., 2011) have proposed the graph

matching technique to establish the correspondences

between images. Visual graphs provide powerful

structural models but their use in image classiﬁcation

has been limited due to the difﬁculties of matching

between graphs. In (Jaechul Kim, 2010), Kim et al.

proposed a dense feature matching. To match two im-

ages, they segment one and unsegment another. Then,

they ﬁnd correspondences between points within each

region of the segmented image and some subsets of

those within the unsegmented image. Layout con-

sistency is meanwhile efﬁciently enforced in each

of the region-to-image match group via an objective

solvable with dynamic programming. This method

was extended by Duchenne et al. (Duchenne et al.,

2011). They formulated image graph matching as an

energy optimization problem. The graph nodes and

edges represent the regions associated with a coarse

image grid and their adjacency relationships. Vi-

sual graphs supply competent compositional patterns,

however their application in image classiﬁcation is re-

stricted ensuing to the complexities of matching be-

tween graphs which is known to be NP-complete.

In (Pham et al., 2012), an image is represented by

SIFT descriptors for each keypoint extracted, color

histograms and edge descriptor where a region is de-

ﬁned by grid partition , and HSV color value where a

region is deﬁned by sampling pixel. The second step

represents each image as a graph generated by a set

of weighted concepts and a set of weighted relations.

For that, they build a visual vocabulary for each type

of image representation by k-means clustering. Each

region is deﬁned by a visual word and two relation

sets left of and top of are extracted from the two con-

nected region for integrate the relationships between

the regions. The third step is related to the fact that

we want to retrieve relevant images to a given query.

Therefore, they take into account the different types

of image representations and spatial relations during

matching by computing likelihood of two graphs us-

ing a language model framework. Visual graphs sup-

ply competent compositional patterns, however their

application in image classiﬁcation is restricted ensu-

ing to the complexities of matching between graphs

which is known to be NP-complete. In order to relax

the graph matching condition, (Quack et al., 2007),

(Wu et al., 2013) proposed to identify the similarity

between two image graphs comparing subgraphs ex-

tracted from them rather than using graph matching.

Quak et. al. (Quack et al., 2007) have used FIM to

discover a set of distinctive spatial conﬁgurations of

visual words to learn different object categories. In

(Wu et al., 2013), the authors divide an image into

a sets of spatial grids on several levels. Then, they

deﬁned a directed graph to describe the relationship

between these grids which the grids are represented

by the nodes, and the relation of grids is represented

by the edges. After that, they construct a histogram

on node reﬂects the occurrence of features in a block,

and a histogram on edge reﬂects the occurrence of

features which lie in one block and tend to shift into

another.

4 CONCLUSIONS

The Bag of Visual Words has successfully been ap-

plied to various computer vision applications include

ASurveyofExtendedMethodstotheBagofVisualWordsforImageCategorizationandRetrieval

681

image categorization and retrieval. The construc-

tion of BoW starts by the construction of local fea-

tures. After that, two steps are necessary: Encoding

and pooling. Despite their simplicity, the spatial in-

formation is ignored. In this paper, we introduces

the methods that improve the construction of BoW

such as LLC, Fisher vector, VLAD, VLAT and the

methods that integrate the spatial information such as

approches based on pairwise features (LPC, DLPC,

CCC, ...), approaches based on visual phrases (Phre-

selet), and approaches based on graph.

REFERENCES

Avrithis, S. and Kalantidis, Y. (2012). Approximate gaus-

sian mixtures for large scale vocabularies. In Euro-

pean Conference on Computer Vision, volume 7574,

pages 15–28. Springer.

Bay, H., Tuytelaars, T., and Gool, L. (2006). Surf speeded

up robust features. In European Conference on Com-

puter Vision.

Bingbing, N., Shuicheng, Y., Meng, W., Kassim, A., and Qi,

T. (2013). High order local spatial context modeling

by spatialized random forest. IEEE Transactions on

Image Processing, 22(2).

Bosch, A., Zisserman, A., and Muoz., X. (2008). Scene

classiﬁcation using a hybrid generative/discriminative

approach. IEEE Transactions Pattern Analysis and

Machine Intelligents, 30:712–727.

Bowen, F., Du, E. Y., and Hu, J. (2012). A novel graph-

based invariant region descriptor for image matching.

In EIT.

Calonder, M., Lepetit, V., Ozuysal, M., Trzcinski, T.,

Strecha, C., and Fua, P. (2012). Computing a local

binary descriptor very fast. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 34(7):1281–

1298.

Cao, Y., Wang, C., Li, Z., Zhang, L., and Zhang, L. (2010).

Spatial bag of features. In CVPR.

Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray,

C. (2004). Visual categorization with bags of key-

points. In In Workshop on Statistical Learning in

Computer Vision, ECCV, pages 1–22.

Csurka, G. and Perronnin, F. (2010). Fisher vectors: Be-

yond bag-of-visual-words image representations. In

VISIGRAPP, pages 28–42.

Duchenne, O., Joulin, A., and Ponce, J. (2011). A graph-

matching kernel for object categorization. In ICCV.

Farquhar, J., Szedmak, S., Meng, H., and Shawe-Taylor, J.

(2005). Improving bag-of keypoints image categori-

sation. Technical report, University of Southampton.

Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003).

An efﬁcient boosting algorithm for combining prefer-

ences. J. Mach. Learn. Res., 4:933–969.

G. Mclachlan, D. P. (2000). Finite mixture models.

Gemert, J., Veenman, C., and Geusebroek, J. (2010). Visu-

alword ambiguity. TPAMI.

Guo, X. and Cao, X. (2010). Find: A neat ﬂip invariant de-

scriptor. In 20th International Conference on Pattern

Recognition, pages 515–518.

Herve, N. and Boujemaa, N. (2009). Visual word pairs for

automatic image annotation. In Proceedings of the

2009 IEEE international conference on Multimedia

and Expo, ICME 09.

H.Jeou, Perronnin, F., Douze, M., Sanchez, J., Perez, P.,

and Schmid, C. (2012). Aggregating local image de-

scriptors into compact codes. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 34(9).

Jaechul Kim, K. G. (2010). Asymmetric region to image

matching for comparing images with generic object

categories. In CVPR.

Jiang, Y., Meng, J., and Yuan, J. (2012). Randomized visual

phrases for object search. In CVPR.

Khan, R., Barat, C., Muselet, D., and Ducottet, C. (2012).

Spatial orientations of visual word pairs to improve

bag-of-visual-words model. In BMVC.

Kisku, D. R., Rattani, A., Grosso, E., and Tistarelli, M.

(2010). Face identiﬁcation by sift-based complete

graph topology. CoRR.

Leordeanu, M. and Hebert, M. (2005). A spectral tech-

nique for correspondence problems using pairwise

constraints. In Tenth IEEE International Conference

on Computer Vision, pages 1482–1489.

Liu, L., Wang, L., and Liu, X. (2011). In defense of soft-

assignment coding. In International Conference on

Computer Vision, ICCV ’11, pages 2486–2493.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal on Com-

puter Vision, 60(2).

Ma, R., Chen, J., and Su, Z. (2010). Mi-sift: mirror and in-

version invariant generalization for sift descriptor. In-

ternational Conf. on Image and Video Retrieval, pages

228–236.

Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). On-

line learning for matrix factorization and sparse cod-

ing. Journal of Machine Learning Research, 11:19–

60.

Morioka, N. and Satoh, S. (2010a). Building compact lo-

cal pairwise codebook with joint feature space cluster-

ing. In 11th European conference on Computer vision,

ECCV10.

Morioka, N. and Satoh, S. (2010b). Learning directional

local pairwise bases with sparse coding. In BMVC.

Morioka, N. and Satoh, S. (2011). Compact correlation cod-

ing for visual object categorization. In ICCV.

Pham, T., Mulhem, P., Maisonnasse, L., Gaussier, E., and

Lim, J. (2012). Visual graph modeling for scene

recognition and mobile robot localization. Multime-

dia Tools Appl., 60(2).

Picard, D. and Gosselin, P. (2013). Efﬁcient image signa-

tures and similarities using tensor products of local de-

scriptors. Computer Vision and Image Understanding,

117(6):680–687.

Quack, T., Ferrari, V., Leibe, B., and Gool, L. V. (2007). Ef-

ﬁcient mining of frequent and distinctive feature con-

ﬁgurations. In International Conference on Computer

Vision).

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

682

Sivic, J. and Zisserman, A. (2003). Video google: A text re-

trieval approach to object matching in videos. In Inter-

national Conference on Computer Vision, volume 2,

pages 1470–1477.

Smeulders, A., M.Worring, Santini, S., Gupta, A., and Jain,

R. (2000). Content-based image retrieval at the end of

the early years. IEEE Transactions on Pattern Analy-

sis and Machine Intelligence, 22(12):1349–1380.

Su, Y. and Jurie, F. (2011). Semantic contexts and ﬁsher

vectors for the imageclef 2011 photo annotation task.

In CLEF (Notebook Papers/Labs/Workshop).

Tuytelaars, T. (2010). Dense interest points. In Computer

Vision and Pattern Recognition, pages 2281–2288.

van de Sande, K. E. A., Gevers, T., and Snoek, C. G. M.

(2010). Evaluating color descriptors for object and

scene recognition. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 32(9):1582–1596.

Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y.

(2010). Locality-constrained linear coding for image

classiﬁcation. In CVPR, pages 3360–3367.

Wang, L., Song, D., and Elyan, E. (2012). Improving bag-

of-visual-words model with spatial-temporal correla-

tion for video retrieval. In 21st ACM international

conference on Information and knowledge manage-

ment, CIKM 12.

Wu, Z., Huang, Y., Wang, L., and Tan, T. (2013). Spatial

graph for image classiﬁcation. In 11th Asian confer-

ence on Computer Vision, ACCV 12.

Xie, L., Tian, Q., and Zhang, B. (2012). Spatial pooling of

heterogeneous features for image applications. In 20th

ACM international conference on Multimedia, MM

12.

Yang, J., Yu, K., Gong, Y., and Huang, T. (2009). Lin-

ear spatial pyramid matching using sparse coding for

image classiﬁcation. In Computer Vision and Pattern

Recognition.

Zhang, E. and Mayo, M. (2010). Improving bag-of-words

model with spatial information. In 25th Interna-

tional Conference of Image and Vision Computing

New Zealand.

Zhang, S., Tian, Q., Hua, G., Huang, Q., and Gao, W.

(2011a). Generating descriptive visual words and vi-

sual phrases for large scale image applications. IEE

Transacton on Imgage Processing, 20(9).

Zhang, Y. and Chen, T. (2009). Efﬁcient kernels for identi-

fying unbounded-order spatial features. In CVPR.

Zhang, Y., Jia, Z., and Chen, T. (2011b). Image retrieval

with geometry-preserving visual phrases. In CVPR.

Zheng, Y., Zhao, M., Neo, S., Chua, T., and Tian, Q. (2008).

Visual synset: Towards a higher level visual represen-

tation. In Computer Vision and Pattern Recognition.

Zhou, X., Yu, K., Zhang, T., and Huang, T. (2010). Im-

age classiﬁcation using super-vector coding of local

image descriptors. In 11th European conference on

Computer vision, ECCV’10, pages 141–154.

ASurveyofExtendedMethodstotheBagofVisualWordsforImageCategorizationandRetrieval

683