multi-class scenario and learned a dictionary for each
object class. Then, we used local descriptors encoded
with the learned atoms to guide the pooling stage: we
designed a pooling operator making use of weights
directly obtained from the coded descriptors.
We performed an extensive evaluations of the method
in both single instance object recognition and object
categorization problems, and stressed the representa-
tion we proposed considering a classical image re-
trieval scenarios – using the very popular Caltech 101
– as well as on a typical robot vision task – with data
acquired by the iCub humanoid robot. Results clearly
speak in favor of our approach, showing that the dic-
tionary based pooling strategy we proposed outper-
forms previous approaches. Our method is also com-
putationally effective thanks to compactness of the
description and usability with linear kernels.
ACKNOWLEDGEMENTS
This work was supported by the European FP7 ICT
project No. 270490 (EFAA), project No. 270273
(Xperience) and project No. 288382 (Poeticon++).
REFERENCES
Bay, H., Ess, A., Tuytelaars, T., and Vangool, L. (2008).
Speeded-up robust features. CVIU, 110:346–359.
Boureau, Y.-L., Bach, F., LeCun, Y., and Ponce, J. (2010).
Learning mid-level features for recognition. In CVPR.
Boureau, Y.-L., Le Roux, N., Bach, F., Ponce, J., and Le-
Cun, Y. (2011). Ask the locals: multi-way local pool-
ing for image recognition. In ICCV.
Chen, Q., Song, Z., Hua, Z., Y., H., and Yan, S. (2012). Hi-
erarchical matching with side information for image
classification. In CVPR.
Collet, A., Martinez, M., and Srinivasa, S. S. (2011). The
MOPED framework: Object Recognition and Pose
Estimation for Manipulation. The International Jour-
nal of Robotics Research.
Csurka, G., Dance, C., Fan, L., Willamowski, J., and
BrayLixin, C. (2004). Visual categorization with bags
of keypoints. In In Workshop on Statistical Learning
in Computer Vision, ECCV.
Destrero, A., De Mol, C., Odone, F., and A., V. (2009). A
sparsity-enforcing method for learning face features.
IP, 18:188–201.
Ekvall, S., Kragic, D., and Hoffmann, F. (2003). Object
recognition and pose estimation using color cooccur-
rence histograms and geometric modeling. In Image
Vision Computing.
Fanello, S., Ciliberto, C., Santoro, M., Natale, L., Metta,
G., Rosasco, L., and Odone, F. (2013a). icub world:
Friendly robots help building good vision data-sets. In
CVPRW.
Fanello, S. R., Ciliberto, C., Natale, L., and Metta, G.
(2013b). Weakly supervised strategies for natural ob-
ject recognition in robotics. ICRA.
Fanello, S. R., Noceti, N., Metta, G., and Odone, F. (2013c).
Multi-class image classification: Sparsity does it bet-
ter. VISAPP.
Fei-Fei, L., Fergus, R., and Perona, P. (2004). Learning
generative visual models from few training examples:
An incremental bayesian approach tested on 101 ob-
ject categories. CVPRW.
Fei-fei, L. and Perona, P. (2005). A bayesian hierarchical
model for learning natural scene categories. In CVPR,
pages 524–531.
Feng, J., Ni, B., Tian, Q., and Yan, S. (2011). Geometric
lp-norm feature pooling for image classification. In
CVPR, pages 2609–2704.
Gordon, I. and Lowe, D. (2006). What and where: 3d object
recognition with accurate pose. In Lecture Notes in
Computer Science.
Huang, K. and Aviyente, S. (2008). Wavelet feature selec-
tion for image classification. IP, 17:1709–1720.
Jia, Y., Huang, C., and Darrell, T. (2012). Beyond spatial
pyramids: Receptive field learning for pooled image
features. In CVPR, pages 3370–3377.
Kong, S. and Wang, D. (2012). A dictionary learning ap-
proach for classification: separating the particularity
and the commonality. In ECCV.
Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond
bags of features: Spatial pyramid matching for recog-
nizing natural scene categories. In CVPR, volume 2,
pages 2169–2178.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. IJCV, 60:91–110.
Metta, G., Sandini, G., Vernon, D., Natale, L., and Nori, F.
(2008). The icub humanoid robot: an open platform
for research in embodied cognition. In 8th Work. on
Performance Metrics for Intelligent Systems. Website:
http://www.icub.org.
Perronnin, F., S
´
anchez, J., and Mensink, T. (2010). Improv-
ing the fisher kernel for large-scale image classifica-
tion. In ECCV.
Russakovsky, O., Lin, Y., Yu, K., and Fei-Fei, L. (2012).
Object-centric spatial pooling for image classification.
In ECCV.
Taylor, G. and Kleeman, L. (2003). Fusion of multimodal
visual cues for model-based object tracking. In ACRA.
Vapnik, V. (1998). Statistical Learning Theory. John Wiley
and Sons, Inc.
Viola, P. and Jones, M. (2004). Robust real-time face detec-
tion. IJCV, 57:137–154.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y.
(2010). Locality-constrained linear coding for image
classification. In CVPR.
Yang, J., Yu, K., Gong, Y., and Huang, T. (2009). Linear
spatial pyramid matching using sparse coding for im-
age classification. In CVPR.
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
274