USING ASSOCIATION RULES AND SPATIAL WEIGHTING FOR AN

EFFECTIVE CONTENT BASED-IMAGE RETRIEVAL

Ismail Elsayad, Jean Martinet, Thierry Urruty, Taner Danisman, Haidar Sharif and Chabane Djeraba

LIFL-CNRS, Lille 1 University, Villeneuve d’ascq, France

Keywords:

SURF, Bag-of-visual-words, Visual phrases, Gaussian mixture model, Spatial weighting.

Abstract:

Nowadays, having effective methods for accessing the desired images is essential with the huge amount of

digital images. The aim of this paper is to build a meaningful mid-level representation of visual documents to

be used later for matching between the query image and other images in the desired database. The approach

is based ﬁrstly on constructing different visual words using local patch extraction and fusion of descriptors.

Then, we represent the spatial constitution of an image as a mixture of n Gaussians in the feature space.

Finally, we extract different association rules between frequent visual words in the local context of the image

to construct visual phrases. Experimental results show that our approach outperforms the results of traditional

image retrieval techniques.

1 INTRODUCTION

In typical Content-Based Image Retrieval (CBIR)

systems, it is always important to select an appro-

priate representation for documents (Baeza-Yates and

Ribeiro-Neto, 1999). Indeed, the quality of the re-

trieval depends on the quality of the internal repre-

sentation for the content of the documents.

A popular approach (bag-of-visual-words) that

appeared recently is to consider images as a col-

lection of quantized local patches. This approach

achieves good results in representing variable object

appearances caused by changes in pose, scale and

translation, etc. Despite the success of the bag-of-

visual-words approach in recent studies (Sivic and

Zisserman, 2003; Willamowski et al., 2004; Jurie and

Triggs, 2005), there are still three important draw-

backs and this paper aims to resolve them.

Firstly, most of the local descriptors are based on

the intensity or gradient information of images, so

neither shape nor color information is used. In the

proposed approach, in addition to the SURF descrip-

tor (Bay et al., 2008), we introduce a novel descriptor

(edge context) that is based on the distribution of edge

points.

Secondly, since the bag-of-visual-words approach

represents an image as a collection of local descrip-

tors, ignoring their order within the image, the re-

sulting model provides a rare amount of information

about the spatial structure of the image. In this paper

we propose a new spatial weighting scheme that con-

sists of weighting visual words according to the prob-

ability of each visual word belonging to one of the n

Gaussians in the 5 dimensional color-spatial feature

space.

Thirdly, the low discrimination power of visual

words leads to low correlations between the image

features and their semantics. In our work, we build

a higher-level representation, namely: visual phrase

from groups of adjacent words using association

rules extracted with the Apriori algorithm (Agrawal

et al., 1993). Having a higher-level representation,

from mining the occurrence of groups of lower-level

features (visual words), enhances the image represen-

tation with more discriminative power since structural

information will be added.

The remainder of the article is structured as fol-

lows: in Section 2, we describe the method for con-

structing visual words from images and mining vi-

sual phrases from visual words to obtain the ﬁnal im-

age presentation. In Section 3, we present an image

similarity method based on visual words and visual

phrases. We report on the experimental results in Sec-

tion 4, and we give a conclusion to this paper in Sec-

tion 5.

112

Elsayad I., Martinet J., Urruty T., Danisman T., Sharif H. and Djeraba C. (2010).

USING ASSOCIATION RULES AND SPATIAL WEIGHTING FOR AN EFFECTIVE CONTENT BASED-IMAGE RETRIEVAL.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 112-117

DOI: 10.5220/0002836101120117

 SciTePress

2 IMAGE REPRESENTATION

In this section, we describe different components of

the chain of processes in constructing the image rep-

resentation (see Figure 1).

Figure 1: Flow of information in the visual document rep-

resentation model.

2.1 Visual Word Construction

We use the fast Hessian detector (Bay et al., 2008) to

extract interest points. In addition , the Canny edge

detector (Canny, 1986) is used to detect edge points.

From both sets of interest and edge points, we use a

clustering algorithm to group these points into differ-

ent clusters in the 5 Dimensional color-spatial feature

space (see the visual construction part in Figure 1).

The clustering result is necessary to extract our edge

context descriptor (to be discussed later in this paper)

and to estimate the spatial weighting scheme for the

visual words.

2.1.1 Gaussian Mixture Model

In our approach, based on Gaussian mixture model

(GMM) (Bilmes, 1997), we model our color and po-

sition feature space to cluster the set of interest and

edge points in different clusters. The objective is to

cluster the salient structure of the image on some in-

formation that can be extracted from the image, rather

than intensity and the appearance information that is

used in the description process. In addition, by us-

ing the GMM, we present a novel spatial weighting

scheme for visual words as follows:

Firstly, a 5 Dimensional color-spatial feature vec-

tor, built from the 3 Dimensional RGB color features

plus 2 Dimensional (x,y) spatial position, is created

to represent each interest and edge point. In an im-

age with m interest/edge points, a total of m feature

vectors: Z

,...,Z

can be extracted.

The set of points is assumed to be a mixture of n

Gaussians in the 5 Dimensional color-spatial feature

space and the Expectation-Maximization (EM) algo-

rithm is used to iteratively estimate the parameter set

of the Gaussians. The parameter set of Gaussian mix-

ture is: θ = {µ

, Σ

, P

}, i = 1, ..., n where µ

is the

mean of the i

Gaussian cluster, Σ

is the covariance

of the i

Gaussian cluster and P

is the prior probabil-

ity of the i

Gaussian cluster.

By applying Bayes theorem at each E-step, we can

estimate the expected value of the log likelihood func-

tion, with respect to the conditional distribution of β

(denotes the Gaussian which Z

come from under the

current estimate of the parameters θ(t) ).

P(β

,θ(t)) =

P(Z

/θ(t))P(β

/θ(t))

P(Z

)

(1)

P(Z

) =

∑

k=1

P(Z

/β

,θ(t))P(β

/θ(t)) (2)

At each M-step, the parameter set θ of the n Gaus-

sians is updated to maximize the log-likelihood

Q(θ) =

∑

j=1

∑

i=1

P(β

/Z,θ(t))lnP(Z

/β

,θ(t))P(β

/θ(t))

(3)

At the ﬁnal step of the EM algorithm, we obtain all

the parameters needed to construct our set of Gaus-

sians. Then each point is assigned to one of the Gaus-

sians.

2.1.2 Extracting and Describing Local Features

In our approach, we use the SURF low-level feature

descriptor that describes how the pixel intensities are

distributed within a scale-dependent neighborhood of

each interest point detected by the Fast-Hessian. This

approach is similar to the SIFT one (Lowe, 2004), but

integral images (Viola and Jones, 2001) are used in

conjunction with ﬁlters known as Haar wavelets in or-

der to increase robustness and decrease the computa-

tion time. Haar wavelets are simple ﬁlters which can

USING ASSOCIATION RULES AND SPATIAL WEIGHTING FOR AN EFFECTIVE CONTENT BASED-IMAGE

RETRIEVAL

113

be used to ﬁnd gradients in the x and y directions. The

extraction of the descriptor can be divided into two

distinct tasks. Each interest point is assigned to a re-

producible orientation. Then a scale-dependent win-

dow is constructed in which a 64 Dimensional vector

is extracted. In order to achieve scale-invariant re-

sults, it is important that all calculations for the de-

scriptor are based on measurements relative to the de-

tected scale. In addition to the SURF descriptor, we

introduce a novel Edge context descriptor at each

interest point detected by the Fast-Hessian, based on

the distribution of the edge points in the same Gaus-

sian (by returning to the 5 Dimensional color-spatial

feature space).

Our descriptor is inspired by the shape context

descriptor proposed by (Belongie et al., 2002) with re-

gard to extracting information from edge point’s dis-

tribution. Describing the distribution of these points

enriches our descriptor with more information, rather

than the intensity that is described by SURF. More-

over, the distribution over relative positions is a ro-

bust, compact, and highly discriminative descriptor.

As shown in Figure 2, vectors from each interest point

in the 2D spatial image space are drawn point to all

other edge points (that are within the same cluster in

5 Dimensional color-spatial feature space). Then the

edge local descriptor for each interest point is repre-

sented as a histogram of 6 bins for R (magnitude of

the drawn vector from the interest point to the edge

points) and 4 bins for θ (orientation angle). For this

novel descriptor many invariance can be applied:

Firstly, invariance to translation is intrinsic to the

edge context deﬁnition since the distribution of the

edge points is measured with respect to ﬁxed interest

points.

Secondly, invariance for scale can be achieved by

normalizing the radial distance by a mean distance be-

tween the whole set of points within the same Gaus-

sian in the 5 Dimension color-spatial feature space.

Thirdly, invariance for rotation is achieved by

measuring all angles relative to the tangent angle of

each interest point.

Following the visual construction part in Figure

1, after the extraction of the edge context feature, fu-

sion between this descriptor and the SURF descriptor

is performed. This mixed feature vector is composed

of 88 dimensions (64 from SURF + 24 from the edge

context descriptor). The new feature vector describes

information on the distribution of the intensity and the

edge points of the image at the same time. This en-

riches our image representation with more local infor-

mation.

Visual words are created by clustering the mixed

feature vectors (SURF + edge context feature vector)

Interest point

Ed i t

ge po

Gaussian cluster

A vector drawn

from an interest

point to an edge

one

Figure 2: Extraction of the edge descriptor in the 2D spatial

space where the points are already clustered before in the 5

Dimensional color-spatial Gaussian space.

in order to form a visual vocabulary. We quantize the

88 Dimensional feature vector space by assigning to

each observed feature the closest visual word.

2.1.3 Spatial Weighting for the Visual Words

For the spatial weighting, we used an adapted scheme

from the one used by (Chen et al., 2009) which differs

from tf-idf weighting scheme. Supposing that in an

image there exist local descriptors obtained from the

interest point set {1, 2, ..., n

} that belong to the same

Gaussian and are assigned to a visual word w

where

1<l<k and k is the number of visual words in the

visual vocabulary. Then the summation of the prob-

abilities of the occurrence of the salient points will

indicate the contribution of the visual word w

to the

Gaussian β

. Therefore, the weighted term frequency

(T f

) of a visual word w

with respect to Gaussian

is deﬁned as follow:

T f

∑

m=1

P(β

) (4)

The average weighted term frequency (T f

) of w

with respect to an image I where w

occurs in the n

Gaussians is deﬁned as follow:

T f

∑

i=1

(T f

)/n

(5)

The weighted inverse Gaussian frequency of w

with respect to an image I with n Gaussians is deﬁned

as follow:

I f

= ln

(6)

The ﬁnal spatial weight of the visual word w

deﬁned by the following formula:

= T f

× I f

(7)

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

114

2.2 Visual Phrase Construction

Before proceeding to the construction phase of visual

phrases for the set of images, let us examine phrases

in text. A phrase can be deﬁned as a group of words

functioning as a single unit in the syntax of a sentence

and sharing a common meaning. For example, from

the sentence ”James Gordon Brown is the Prime Min-

ister of the United Kingdom and leader of the Labor

Party”, we can extract a shorter phrase ”Prime Minis-

ter”. The meaning shared by these two phrases is the

governmental career of James Gordon Brown.

Images are particular arrangements of patches in

a 2D space. Such patches in an image are not inde-

pendent but are likely to belong to the same physi-

cal object with each other and, consequently, they are

likely to have the same conceptual interpretation. The

inter-relationships among patches encode important

information for our perception. Applying association

rules, we used both the patches themselves and their

inter-relationships to obtain a higher-level representa-

tion of the data known as visual phrase.

We are not alone in applying the association rules

to images. (Martinet and Satoh, 2007) adapted the

deﬁnition of association rules to the context of per-

ceptual objects in order to merge strongly associated

features and get a more compact representation of the

data. We apply an adapted version of this to the fre-

quent, consecutive visual words that share the strong

association rules and are located within the same lo-

cal context. All local patches are within the same

context whenever the distance between their centers

are less than or equal to a given threshold. Consider-

ing that the set of all visual words (visual vocabulary)

W = {w

, w

, ..., w

}, D is a database (set of im-

ages I), T = {t

, t

, ..., t

} is the set of all different

sets of visual words located in the same context (see

Figure 3).

An association rule is a relation of an expression

X ⇒ Y , where X and Y are sets of items. The prop-

erties that characterize association rules are:

• The rule X ⇒ Y holds in the transaction set T

with support s if s % of transactions in T contains

X and Y .

• The rule X ⇒ Y holds in the transaction set T

with conﬁdence c if c % of transactions in T that

contains X also contains Y .

Given a set of documents D, the problem of min-

ing association rules is to discover all strong rules,

which have a support and conﬁdence greater than

the pre-deﬁned minimum support (minsupport) and

minimum conﬁdence (minconﬁdence). Although a

number of algorithms have been proposed to improve

various aspects of association rule mining, Apriori

(Agrawal et al., 1993) remains the most commonly

used.

Since our aim is to discover the inter-relationships

between different visual words, we consider the fol-

lowing:

• W denotes the set of items.

• T denotes the set of transactions.

• X and Y can be sets of one or more of frequent

visual words that are within the same context.

Figure 3: Two sample images where 10 randomly chosen

visual words are represented in the local context for each

one. The square resembles a local patch, which denotes one

of the visual words, and the circle around the center of the

patch denotes the local context for this visual word.

After mining the whole transactions and ﬁnding

the association rules, all visual words located in the

same context and involved in at least one strong asso-

ciation rule will form the visual phrases.

3 IMAGE SIMILARITY

MATCHING AND RETRIEVAL

Given the proposed image representation discussed

in Section 2, we describe here how documents are

matched, by estimating a similarity value from the 2-

faceted representation. The traditional Vector Space

Model of Information Retrieval (Salton et al., 1975)

is adapted to our representation, and used for simi-

larity matching and retrieval of images. The doublet

represents each image in the model:

d =

(

(8)

Where

and

are the vectors for the word and

phrase representations of a document respectively:

= (w

1,d

, ..., w

) ,

= (p

1,d

, ..., p

) (9)

USING ASSOCIATION RULES AND SPATIAL WEIGHTING FOR AN EFFECTIVE CONTENT BASED-IMAGE

RETRIEVAL

115

Note that the vectors for each level of representa-

tion lie in a separate space. In the above vectors, each

component represents the weight of the correspond-

ing dimension. We used the spatial weight scheme

deﬁned in Section 2.1, for the words and the stan-

dard td.idf-weighting scheme for the phrases. We

have designed a simple measure that allows evaluat-

ing the contribution of words and phrases. The simi-

larity measure between a query q and a document d is

estimated with:

sim(q,d) = (1 − α)RSV (

) + (α)RSV(

)

(10)

The Retrieval Status Value (RSV) of 2 vectors is

estimated with the cosine distance. The non-negative

parameter α is to be set according the experiment runs

in order to evaluate the representation independently,

and a combination of the two representations.

4 EXPERIMENTS

This section describes the set of experiments we per-

formed to explore the performance of the proposed

methodology.

4.1 Data Set and Experimental Setup

The image data set used for these experiments is the

Caltech101 Dataset1 (Fei-Fei et al., 2007). It con-

tains 8707 images, which include objects belonging

to 101 classes. For the various experiments, we con-

struct our test data set by randomly selecting 10 im-

ages from each class (1010 images). We randomly

select 30 images (different from the test dataset) from

each class to build the visual vocabulary (3030 im-

ages). We manually choose to set the size of the vo-

cabulary at k=2750.

Firstly, we run experiments with a similarity

matching parameter α=0 in order to compare our spa-

tial weighting scheme with other approaches. Then,

we evaluate the contribution between words and

phrases by running the experiments several times with

different values of α. All our experiments have been

run on a 3GHz Intel Xeon machine with 3GB memory

running under Microsoft Windows XP.

4.2 Evaluation for the Spatial

Weighting Performance

We compare the proposed spatial weighting scheme

to 2 other approaches, the ’blobworld’ approach (Be-

longie et al., 1998) and the ’bag of visual words’

approach (Sivic and Zisserman, 2003). The ’blob-

world’ approach is a well known image representation

method, which simply represents images by the pa-

rameter sets of Gaussian mixture models. The bag-of-

visual-words approach is based on local image patch

extraction using SIFT-like region descriptors. For

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

101

Our spatial weighting approach

Bag-of-visual-words approach

Blob approach

Average Precision (AP)

Image class

Figure 4: Comparison of image retrieval effectiveness be-

tween our spatial weight scheme approach and the bag of

visual words and blob approaches. For a clear presentation,

we arrange the 101 classes from left to right in the ﬁgure

with respect to the ascending order of their average retrieval

precision.

each category, we compute the average precision for

the top 20 retrieved images. It is clear from the re-

sults displayed in Figure 4 that the spatial weight-

ing scheme generally outperforms the other two ap-

proaches. In categories in which the image content

is highly heterogeneous, exhibiting a lot of textures,

and thus being more complicated (such as brain or

watch images), our scheme outperforms the others.

In categories in which image content is relatively ho-

mogeneous (like human face or dolphin images), the

bag-of-visual-words approach performs as well as our

approach. We noticed that the blobworld approach

shows similar results to other approaches only when

the image colors are uniform.

4.3 Evaluation of the Contribution of

Words and Phrases

In the previous section, we demonstrated the good

performance of the visual phrase approach. We are

now interested in combining visual phrase and visual

word approaches by varying the parameter α used in

the similarity matching approach. Figure 5 plots the

average precision for different values of α over all 101

classes.

When considering only visual phrases in the sim-

ilarity matching (α = 1), the mean average precision

(MAP) is better than the scenario in which only visual

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

116

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.25

0.5

0.75

Mean Average Precision (MAP)

α value

Figure 5: Contribution of words and phrases.

words in the similarity matching (α = 0) are taken

into consideration. However, the combination of both

yields better results than using words or phrases sep-

arately.

The explanation is that there are some images,

which are not texture-rich like human face, stop sign

or umbrella pictures, which leads to detect a small

number of interest points. From this study, we con-

clude that visual phrase alone can not capture all

the similarity information between images, the visual

word similarity is still required.

5 CONCLUSIONS

A new spatial weighting technique has been devel-

oped which enhances the basic bag-of-visual-words

approach by using spatial relations. We also devised

methods to construct visual phrases based on the as-

sociation rule technique. Our experimental studies

showed that a combined use of words and phrases

could perform better than using them separately. It

also showed good performance when compared to

similar recent approaches.

In our future work, we will perform more stud-

ies about the interrelationship between different vi-

sual words in order to further investigate the higher

representation level. This will improve the discrimi-

nation power of the visual words.

REFERENCES

Agrawal, R., Imielinski, T., and Swami, A. N. (1993). Pro-

ceedings of the 1993 acm sigmod international confer-

ence on management of data, washington, d.c., may

26-28, 1993. In Buneman, P. and Jajodia, S., edi-

tors, Mining Association Rules between Sets of Items

in Large Databases. ACM Press.

Baeza-Yates, R. A. and Ribeiro-Neto, B. A. (1999). Modern

Information Retrieval. ACM Press / Addison-Wesley.

Bay, H., Ess, A., Tuytelaars, T., and Gool, L. J. V. (2008).

Speeded-up robust features (surf). Computer Vision

and Image Understanding, 110(3):346–359.

Belongie, S., Carson, C., Greenspan, H., and Malik, J.

(1998). Color- and texture-based image segmentation

using the expectation-maximization algorithm and its

application to content-based image retrieval. In ICCV,

pages 675–682.

Belongie, S., Malik, J., and Puzicha, J. (2002). Shape

matching and object recognition using shape contexts.

IEEE Trans. Pattern Anal. Mach. Intell., 24(4):509–

522.

Bilmes, J. A. (1997). A gentle tutorial on the em algorithm

and its application to parameter estimation for gaus-

sian mixture and hidden markov models.

Canny, J. (1986). A computational approach to edge de-

tection. IEEE Trans. Pattern Anal. Mach. Intell.,

8(6):679–698.

Chen, X., Hu, X., and Shen, X. (2009). Spatial weighting

for bag-of-visual-words and its application in content-

based image retrieval. In PAKDD ’09, pages 867–874.

Fei-Fei, L., Fergus, R., and Perona, P. (2007). Learning gen-

erative visual models from few training examples: An

incremental bayesian approach tested on 101 object

categories. Comput. Vis. Image Underst., 106(1):59–

70.

Jurie, F. and Triggs, B. (2005). Creating efﬁcient codebooks

for visual recognition. In ICCV, pages 604–610.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60(2):91–110.

Martinet, J. and Satoh, S. (2007). A study of intra-modal

association rules for visual modality representation. In

CBMI ’07.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector

space model for automatic indexing. Commun. ACM,

18(11):613–620.

Sivic, J. and Zisserman, A. (2003). Video google: A text

retrieval approach to object matching in videos. In

ICCV, pages 1470–1477. IEEE Computer Society.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In CVPR 2001,

volume 1, pages I–511–I–518 vol.1.

Willamowski, J., Arregui, D., Csurka, G., Dance, C. R., and

Fan, L. (2004). Categorizing nine visual classes using

local appearance descriptors. In In ICPR Workshop on

Learning for Adaptable Visual Systems.

USING ASSOCIATION RULES AND SPATIAL WEIGHTING FOR AN EFFECTIVE CONTENT BASED-IMAGE

RETRIEVAL

117