tion is significantly lower. In fact, when the classifi-
cation uses 100% of the SIFT keypoints (no filtering),
the average time for classifying a single test image
is 7.2 seconds. When we use only 30% or 20% of
the original SIFT keypoints (VA-based filtering) the
time needed for the classification of an image is, re-
spectively, 0.78 and 0.6 seconds per image on aver-
age. Even when the random filter and the VA-based
filter have the same accuracy, the saliency-based filter
provides better keypoints. When only a 40% percent-
age of the original keypoints is kept, the average time
needed to classify a single image is 1.07 and 0.97 sec-
onds for, respectively, images preprocessed using the
random filter and the VA-based filter. However, the
experimentation has also shown a relevant limitation
of filtering approaches based on bottom-up visual at-
tention. In fact, many test images misclassified by the
classifier contain salient regions that are radically dif-
ferent from the other images in the same category. For
example, since many pictures contain people in front
of monuments, the visual attention filter is prone to
remove (i.e., assign a low saliency to) the monument
in the background and preserve the people as the most
salient areas.
STIM-DATASET. In the case of the STIM-
DATASET the saliency maps were thresholded using
values ranging from 0.1 to 0.9 the maximum value
in the map. The percentage of SIFT keypoints kept
and used by the classifier ranges from 11% to 77%
(on average) the number of keypoints originally ex-
tracted from images. In this dataset, the relevant ob-
jects are well-separated by the background in almost
every image. Furthermore, since they never fill the en-
tire frame, their features are not considered too ’com-
mon’ and are not suppressed by the attentional mech-
anism. From the graph shown in Fig. 1 it is clear that
the VA-based filtering is able both to improve the ac-
curacy and to decrease the time needed for the classi-
fication. By using only half the keypoints selected
by the VA model, the classifier reaches 81% accu-
racy, which is much greater than those obtained us-
ing 100% of the original keypoints or 90% randomly
selected, that are equal to, respectively, 0.77 and 0.74.
6 CONCLUSIONS
In this paper we have presented a filtering approach
based on a visual attention model that can be used to
improve the performance of CBIR systems and ob-
ject recognition algorithms. The model uses a richer
image representation than other common and well-
known models and is able to process a single im-
age in a short time thanks to many approximations
used in various processing steps. The results show
that a VA-based filtering approach allows to reach a
better accuracy on object recognition tasks where the
objects stand out clearly from the background, like
in the STIM-DATASET. The results on the PISA-
DATASET are encouraging since a faster response
in the classification step is obtained with only a mi-
nor decrease in accuracy. However, the results need
a deeper inspection in order to gain a better under-
standing of the model on cluttered scene where the
object (or landmark) to be detected does not corre-
spond to the most salient image areas and usually fills
the frame.
REFERENCES
Amato, G. and Falchi, F. (2011). Local feature based image
similarity functions for kNN classfication. In Proc. of
the 3rd Int’l Conf. on Agents and Artificial Intelligence
(ICAART 2011), pages 157–166. SciTePress. Vol. 1.
Amato, G., Falchi, F., and Gennaro, C. (2011). Geometric
consistency checks for knn based image classification
relying on local features. In SISAP ’11: 4th Int’l Conf.
on Similarity Search and Applications, pages 81–88.
ACM.
Daugman, J. (1985). Uncertainty relations for resolution in
space, spatial frequency, and orientation optimized by
two-dimensional visual cortical filters. Journal of the
Optical Society of America A, 2:1160–1169.
Fischler, M. A. and Bolles, R. C. (1981). Random sample
consensus: A paradigm for model fitting with appli-
cations to image analysis and automated cartography.
Commun. ACM, 24(6):381–395.
Gao, H.-p. and Yang, Z.-q. (2011). Integrated visual
saliency based local feature selection for image re-
trieval. In Intelligence Information Processing and
Trusted Computing (IPTC), 2011 2nd International
Symposium on, pages 47 –50.
Hou, X. and Zhang, L. (2007). Saliency detection: A spec-
tral residual approach. In Computer Vision and Pat-
tern Recognition, 2007. CVPR ’07. IEEE Conference
on, pages 1 –8.
Itti, L., Koch, C., and Niebur, E. (1998). A model of
saliency-based visual attention for rapid scene anal-
ysis. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 20(11):1254–1259.
Koch, C. and Ullman, S. (1985). Shifts in selective visual
attention: towards the underlying neural circuitry. Hu-
man Neurobiology, 4:219–227.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. International Journal of Com-
puter Vision, 60(2):91–110.
Marques, O., Mayron, L. M., Borba, G. B., and Gamba,
H. R. (2007). An attention-driven model for group-
ing similar images with image retrieval applications.
EURASIP J. Appl. Signal Process., 2007(1):116–116.
Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., and Pog-
gio, T. (2007). Robust object recognition with cortex-
like mechanisms. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 29(3):411–426.
UsingVisualAttentioninaCBIRSystem-ExperimentalResultsonLandmarkandObjectRecognitionTasks
471