Using Visual Attention in a CBIR System

Experimental Results on Landmark and Object Recognition Tasks

Franco Alberto Cardillo, Giuseppe Amato and Fabrizio Falchi

Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy

Keywords:

Biologically-Inspired Vision, Visual Attention, Visual Saliency, Landmark Recognition, Object Recognition,

Content-based Image Retrieval.

Abstract:

Many novel applications in the ﬁeld of object recognition and pose estimation have been built relying on

local invariant features extracted from key points that rely on high-contrast regions of the images. The visual

saliency of the those regions is not considered by state-of-the art detection algorithms that assume the user

is interested in the whole image. In this paper we present the experimental results of the application of a

biologically-inspired model of visual attention to the problem of local feature selection in landmark and object

recognition tasks. The results show that the approach improves the accuracy of the classiﬁer in the object

recognition task and preserves a good accuracy in the landmark recognition task.

1 INTRODUCTION

Given an image as query, a Content-Based Image Re-

trieval (CBIR) system returns a list of images ranked

according to their visual similarity with the query im-

age. The ranking is based on a comparison among

the visual features extracted from the query image and

from the images stored in the index. Many CBIR sys-

tems support general visual similarity searches based

on global features, such as color and edge histograms.

The adoption of descriptions based on local features

(e.g., SIFT and SURF) provided multimedia informa-

tion systems with the possibility to build applications

able to exploit local image similarities. The number

of local visual features extracted from cluttered, real-

world images is usually in the order of thousands.

When the number is ‘too’ large, the overall perfor-

mance of the CBIR system may decline, and, if too

many features are extracted from irrelevant regions,

the matching accuracy may decline. The reduction of

the number of visual features used in the image de-

scriptions can thus be considered a central point in

reaching a good overall performance in a CBIR sys-

tem. In this work we present an approach concerning

the application of a biologically-inspired visual atten-

tion model for ﬁltering out a subset of the features ex-

tracted from an image. The basic assumption of our

experimental work is that the user selects the query

image according to its most salient areas. In order to

assess quantitatively the performance of the approach,

we tested it on a landmark recognition task and an

object recognition task using two publicly available

datasets. The results show that the feature ﬁltering

based on the image saliency is able to drastically re-

duce the number of keypoints with an improvement

or just a slightly decrease in the accuracy of the clas-

siﬁer in, respectively, the object recognition task and

the landmark recognition task.

2 PREVIOUS WORKS

(Marques et al., 2007) proposes a segmentation

method that exploits visual attention in selecting re-

gions of interest in a CBIR dataset to be used in the

similarity computation. The salient regions are used

to segment the image with a region growing approach.

They compute the visual saliency using the model

in (Itti et al., 1998), while another visual attention

model, the Stentiford’s, is used to guide the segmen-

tation step. They experimented their methods on a

dataset containing 110 images of road signs, red soda

cans, and emergency triangles. Since that dataset is

well known and used in other published experimenta-

tions, we used it in order to test our ﬁltering approach.

(Gao and Yang, 2011) propose a method for ﬁlter-

ing SIFT keypoints using saliency maps. The authors

use two different algorithms for computing the im-

age saliency, the Itti-Koch model (for local-contrast

analysis) and a frequency-based method (for global-

468

Cardillo F., Amato G. and Falchi F..

Using Visual Attention in a CBIR System - Experimental Results on Landmark and Object Recognition Tasks.

DOI: 10.5220/0004299404680471

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 468-471

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

contrast analysis) (Hou and Zhang, 2007). The ﬁ-

nal saliency, corresponding to the sum of the saliency

maps computed by the two methods, is used to start

a segmentation algorithm based on fuzzy growing.

They experimented their method on a dataset com-

posed by 10 classes with more than 10 images per

class, extracted from the ALOI image dataset and the

Caltech 256 photo gallery. The authors show that

their method has a precision that is lower than stan-

dard SIFT and comparable to PCA-SIFT, but that it is

also much faster, making it suitable for use in CBIR

systems.

3 BIOLOGICAL INSPIRATION

When we open our eyes we see a colorful and mean-

ingful three-dimensional world surrounding us. Such

visual experience results from a sequence of trans-

formations performed on the light stimuli that starts

in our eyes. The light is focused on the retinal sur-

face, then processed and transferred to our thalamus,

and ﬁnally routed to the cerebral cortex. Earlier com-

putational steps extract basic and non-structured fea-

tures, while later steps are able to compute complex

features, such as, for example, lines at various ori-

entations or different color layouts using a center-

surround receptive ﬁeld. However, the amount of in-

formation contained in the patterns of neural activity

is still too large for our brain to process in a reason-

able amount of time. Evolution has thus endowed hu-

mans with a series of attentional ﬁlters able to reduce

the incoming information.

Several psychological theories have been pro-

posed to explain how a unitary visual perception can

arise from distinct computational ﬂows ﬁltered by vi-

sual attention. In the ”Feature Integration Theory”

(FIT), the parallel, preattentive processes build an im-

age representation with respect to a single feature and

encode the information in feature maps (color, ori-

entation, spatial frequency, . . . ). The maps are then

combined and the their peaks of activity are used to

choose the areas to inspect. One of the most inﬂuen-

tial and detailed models was proposed in (Koch and

Ullman, 1985). The computational model used in this

work is an extension of the model in (Itti et al., 1998),

built upon the theory (Koch and Ullman, 1985).

4 THE MODEL

The input image is ﬁrst encoded in the Lab color

model. The raw L, a, and b values are then used

to build the color channels I

, I

, and

that correspond, respectively, to intensity, red,

green, blue, and yellow. Each channel is encoded

in an image pyramid , that allows the model to

perform a multiresolution analysis of the input im-

age. Such channels are merged in Feature Con-

trast Maps (FCMs) using a center-surround repre-

sentation. The model uses single opponent chan-

nels, meaning that it builds a pyramid for each oppo-

nent feature: (R, G), (G, R), (B,Y ), (Y, B), (I

, I

o f f

, I

), where ( f , f

) denotes a map encoding a

center-surround difference with feature f in the cen-

ter and feature f

in the surround. For example,

the map BY , obtained by I

and I

, is computed

as follows: BY ← [(B ∗ G

) − max(Y ∗ G

,Y ∗ G

)]

where ∗ corresponds to the convolution, G

, G

corresponds to Gaussian kernels with different sizes

corresponds to the central part of the receptive

ﬁeld, G

, G

to the surround) and the function [·]

simply sets to zero negative values. The FCMs are

then merged into Feature Maps (FMs) that encodes

the strength of each couple of opponent features. For

example, the maps (R, G) and (G, R) leads to the FM

RG. The two FCMs are normalized in order to re-

duce the values in maps with a diffuse activity and

enhance the values in maps with only few and small

spots of activity. The normalization is based on the

values of Summed Area Tables indicating how the

activity is spread all over the map. Local orienta-

tion maps are computed on the intensity pyramid by

convolving the intensity image in each layer using

a set of oriented Gabor ﬁlters at four different ori-

entations θ ∈





. The ﬁlters used in the

model implementation are expressed as in (Daugman,

1985), with parameters with biological plausible val-

ues (Serre et al., 2007). The two color FMs and the

intensity FM are merged together using the same nor-

malization strategy described before. The four ori-

entation maps are merged into a single FM with the

same normalization method. All of the maps are then

normalized again using the method above and merged

together by simply computing the pixel-wise maxi-

mum among the maps. For each level of the pyra-

mids, the maps are merged into Level Saliency Maps

(LSMs). The ﬁnal saliency map is obtained at the

lowest resolution level by taking the maximum value

in the corresponding area in all the LSMs.

5 EXPERIMENTATIONS

We tested the proposed VA-based ﬁltering approach

on one landmark recognition and one objection recog-

nition tasks using two different datasets. The ﬁrst

one is the publicly available dataset containing 1227

UsingVisualAttentioninaCBIRSystem-ExperimentalResultsonLandmarkandObjectRecognitionTasks

469

0.4$

0.5$

0.6$

0.7$

0.8$

0.9$

0$ 20$ 40$ 60$ 80$ 100$

Accuracy'

Percentage'of'SIFT'keypoints'kept'

Random$

VA$

No$ﬁlter$

0.4$

0.5$

0.6$

0.7$

0.8$

0.9$

0$ 10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$ 100$

Accuracy'

Percentage'of'SIFT'keypoints'kept'

RANDOM$

No$Filter$

VA$

Figure 1: Accuracy on the the two datasets. Left: PISA-DATASET; Right: STIM-DATASET. Solid line: ﬁltering based on

the saliency map; dashed line: random ﬁltering. The maximum accuracy obtained without applying any ﬁlter is shown by the

horizontal dotted line.

photos of 12 landmarks (object classes) located in

Pisa (also used in (Amato et al., 2011; Amato and

Falchi, 2011)), hereafter named PISA-DATASET.

The dataset is divided in a training set (Tr) consisting

of 226 photos (20% of the dataset) and a test set (Te)

consisting of 921 photos (80% of the dataset). The

second dataset is contains 258 photos belonging to

three classes (cans, road signs, and emergency trian-

gles), hereafter named STIM-DATASET. The dataset

is similarly split into a training and a test set con-

taining, respectively, 206 and 52 photos. The experi-

ments were conducted using the Scale Invariant Fea-

ture Transformation (SIFT) (Lowe, 2004) algorithm

that represents the visual content of an image using

scale-invariant local features extracted from regions

around selected keypoints. Such keypoints usually lie

on high-contrast regions of the image, such as ob-

ject edges. Image matching is performed by com-

paring the description of the keypoints in two images

searching for matching pairs. The candidate pairs for

matches are veriﬁed to be consistent with a geomet-

ric transformation (e.g., afﬁne or homography) using

the RANSAC algorithm (Fischler and Bolles, 1981).

The percentage of veriﬁed matches is used to argue

whether or not the two images contain the very same

rigid object. The number of local features in the de-

scription of the images is typically in the order of

thousands. This results in efﬁciency issues on com-

paring the content of two images described with the

SIFT descriptors. For this reason we applied a ﬁl-

tering strategy selecting only the SIFT keypoints ex-

tracted from regions with a high saliency. Each image

in the dataset was processed by the VA model produc-

ing a saliency map. Since the resolution of the ﬁnal

saliency map is low, each saliency map has been re-

sized to the dimension of the input image.

PISA-DATASET. In order to study how many SIFT

keypoints could be ﬁltered out by the index, we ap-

plied several thresholds on the saliency levels stored

in the saliency maps. The thresholds range from 0.3

to 0.7 the maximum saliency value (normalized to 1).

The 0.3 threshold did not modify at all any of the

saliency maps, meaning that all of the saliency maps

had values larger than 0.3. SIFT keypoints were ﬁl-

tered out only when they corresponded to points in

the saliency map with a value below the given thresh-

old. In order to see how effective the ﬁltering by the

VA model was, we compared it against random ﬁl-

tering. In this second case, we kept from 10% to

90% of the original SIFTs by incrementally remov-

ing keypoints chosen randomly. We used accuracy in

assigning the correct landmark to the test images (in

the previously mentioned dataset) as the measure of

performance . For each test image, the best candidate

match between the training images is selected using

the SIFT description and verifying the matches for an

afﬁne transformation using the RANSAC algorithm.

The results of the experimentation are shown in ﬁg-

ure 1. The x-axis shows the percentage of SIFT key-

points kept after ﬁltering. The y-axis corresponds to

the accuracy reached by the classiﬁer after the ﬁlter-

ing. The maximum accuracy is reached by not remov-

ing any keypoint and is equal to 0.935. The accuracy

does not vary much till a 40% ﬁltering, when it starts

decreasing. When all the saliency values are used,

the ﬁltering performed using the visual saliency maps

reaches a 0.89 accuracy when it removes almost 57%

of the original keypoints. The performance of the

VA-based ﬁlter is very similar to the random-based

one when 30% keypoints are kept. However, when

the percentages of removed keypoints increases, the

VA-based ﬁltering algorithm outperforms the random

ﬁltering. The results of the model with aggressive ﬁl-

tering levels are quite encouraging. The model is in

fact able to preserve regions that are relevant for the

recognition of the speciﬁc object. There is a decrease

in the overall accuracy with respect to the SIFT clas-

siﬁers, but the time needed to perform the classiﬁca-

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

470

tion is signiﬁcantly lower. In fact, when the classiﬁ-

cation uses 100% of the SIFT keypoints (no ﬁltering),

the average time for classifying a single test image

is 7.2 seconds. When we use only 30% or 20% of

the original SIFT keypoints (VA-based ﬁltering) the

time needed for the classiﬁcation of an image is, re-

spectively, 0.78 and 0.6 seconds per image on aver-

age. Even when the random ﬁlter and the VA-based

ﬁlter have the same accuracy, the saliency-based ﬁlter

provides better keypoints. When only a 40% percent-

age of the original keypoints is kept, the average time

needed to classify a single image is 1.07 and 0.97 sec-

onds for, respectively, images preprocessed using the

random ﬁlter and the VA-based ﬁlter. However, the

experimentation has also shown a relevant limitation

of ﬁltering approaches based on bottom-up visual at-

tention. In fact, many test images misclassiﬁed by the

classiﬁer contain salient regions that are radically dif-

ferent from the other images in the same category. For

example, since many pictures contain people in front

of monuments, the visual attention ﬁlter is prone to

remove (i.e., assign a low saliency to) the monument

in the background and preserve the people as the most

salient areas.

STIM-DATASET. In the case of the STIM-

DATASET the saliency maps were thresholded using

values ranging from 0.1 to 0.9 the maximum value

in the map. The percentage of SIFT keypoints kept

and used by the classiﬁer ranges from 11% to 77%

(on average) the number of keypoints originally ex-

tracted from images. In this dataset, the relevant ob-

jects are well-separated by the background in almost

every image. Furthermore, since they never ﬁll the en-

tire frame, their features are not considered too ’com-

mon’ and are not suppressed by the attentional mech-

anism. From the graph shown in Fig. 1 it is clear that

the VA-based ﬁltering is able both to improve the ac-

curacy and to decrease the time needed for the classi-

ﬁcation. By using only half the keypoints selected

by the VA model, the classiﬁer reaches 81% accu-

racy, which is much greater than those obtained us-

ing 100% of the original keypoints or 90% randomly

selected, that are equal to, respectively, 0.77 and 0.74.

6 CONCLUSIONS

In this paper we have presented a ﬁltering approach

based on a visual attention model that can be used to

improve the performance of CBIR systems and ob-

ject recognition algorithms. The model uses a richer

image representation than other common and well-

known models and is able to process a single im-

age in a short time thanks to many approximations

used in various processing steps. The results show

that a VA-based ﬁltering approach allows to reach a

better accuracy on object recognition tasks where the

objects stand out clearly from the background, like

in the STIM-DATASET. The results on the PISA-

DATASET are encouraging since a faster response

in the classiﬁcation step is obtained with only a mi-

nor decrease in accuracy. However, the results need

a deeper inspection in order to gain a better under-

standing of the model on cluttered scene where the

object (or landmark) to be detected does not corre-

spond to the most salient image areas and usually ﬁlls

the frame.

REFERENCES

Amato, G. and Falchi, F. (2011). Local feature based image

similarity functions for kNN classﬁcation. In Proc. of

the 3rd Int’l Conf. on Agents and Artiﬁcial Intelligence

(ICAART 2011), pages 157–166. SciTePress. Vol. 1.

Amato, G., Falchi, F., and Gennaro, C. (2011). Geometric

consistency checks for knn based image classiﬁcation

relying on local features. In SISAP ’11: 4th Int’l Conf.

on Similarity Search and Applications, pages 81–88.

ACM.

Daugman, J. (1985). Uncertainty relations for resolution in

space, spatial frequency, and orientation optimized by

two-dimensional visual cortical ﬁlters. Journal of the

Optical Society of America A, 2:1160–1169.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: A paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Commun. ACM, 24(6):381–395.

Gao, H.-p. and Yang, Z.-q. (2011). Integrated visual

saliency based local feature selection for image re-

trieval. In Intelligence Information Processing and

Trusted Computing (IPTC), 2011 2nd International

Symposium on, pages 47 –50.

Hou, X. and Zhang, L. (2007). Saliency detection: A spec-

tral residual approach. In Computer Vision and Pat-

tern Recognition, 2007. CVPR ’07. IEEE Conference

on, pages 1 –8.

Itti, L., Koch, C., and Niebur, E. (1998). A model of

saliency-based visual attention for rapid scene anal-

ysis. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 20(11):1254–1259.

Koch, C. and Ullman, S. (1985). Shifts in selective visual

attention: towards the underlying neural circuitry. Hu-

man Neurobiology, 4:219–227.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60(2):91–110.

Marques, O., Mayron, L. M., Borba, G. B., and Gamba,

H. R. (2007). An attention-driven model for group-

ing similar images with image retrieval applications.

EURASIP J. Appl. Signal Process., 2007(1):116–116.

Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., and Pog-

gio, T. (2007). Robust object recognition with cortex-

like mechanisms. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, 29(3):411–426.

UsingVisualAttentioninaCBIRSystem-ExperimentalResultsonLandmarkandObjectRecognitionTasks

471