Unsupervised and Transfer Learning under Uncertainty - From Object Detections to Scene Categorization

Grégoire Mesnil, Salah Rifai, Antoine Bordes, Xavier Glorot, Yoshua Bengio, Pascal Vincent


Classifying scenes (e.g. into “street”, “home” or “leisure”) is an important but complicated task nowadays, because images come with variability, ambiguity, and a wide range of illumination or scale conditions. Standard approaches build an intermediate representation of the global image and learn classifiers on it. Recently, it has been proposed to depict an image as an aggregation of its contained objects:the representation on which classifiers are trained is composed of many heterogeneous feature vectors derived from various object detectors. In this paper, we propose to study different approaches to efficiently combine the data extracted by these detectors. We use the features provided by Object-Bank (Li-Jia Li and Fei-Fei, 2010a) (177 different object detectors producing 252 attributes each), and show on several benchmarks for scene categorization that careful combinations, taking into account the structure of the data, allows to greatly improve over original results (from +5% to +11%) while drastically reducing the dimensionality of the representation by 97% (from 44;604 to 1; 000).


  1. Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2:53-58.
  2. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1- 127. Also published as a book. Now Publishers, 2009.
  3. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Adv. Neural Inf. Proc. Sys. 19, pages 153-160.
  4. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
  5. Bosch, A., Zisserman, A., and Mun˜oz, X. (2006). Scene classification via plsa. In In Proc. ECCV, pages 517- 530.
  6. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and FeiFei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
  7. Espinace, P., Kollar, T., Soto, A., and Roy, N. (2010). Indoor scene recognition through object detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Anchorage, AK.
  8. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871-1874.
  9. Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. (2009). Describing objects by their attributes. IEEE Conference on Computer Vision and Pattern Recognition, pages 1778-1785.
  10. Fei-Fei, L. and Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02, CVPR 7805, pages 524-531. IEEE Computer Society.
  11. Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008). A discrimitatively trained, multiscale, deformable part model. CVPR.
  12. Gao, S., Tsang, I., Chia, L., and Zhao, P. (2010). Local features are not lonely laplacian sparse coding for image classification. IEEE Conference on Computer Vision and Pattern Recognition.
  13. Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In NIPS'09, pages 646-654.
  14. Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554.
  15. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42:177-196.
  16. Hoiem, D., Efros, A., and Hebert, M. (2005). Automatic photo pop-up. SIGGRAPH, 24(3):577584.
  17. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441, 498-520.
  18. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV'09), pages 2146- 2153. IEEE.
  19. Kavukcuoglu, K., Ranzato, M., Fergus, R., and LeCun, Y. (2009). Learning invariant features through topographic filter maps. In Proc. CVPR'09, pages 1605- 1612. IEEE.
  20. Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. (2009). Exploring strategies for training deep neural networks. JMLR, 10:1-40.
  21. Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. IEEE Conference on Computer Vision and Pattern Recognition.
  22. LeCun, Y., Haffner, P., Bottou, L., and Bengio, Y. (1999). Object recognition with gradient-based learning. In Shape, Contour and Grouping in Computer Vision, pages 319-345. Springer.
  23. Li, L.-J. and Fei-Fei, L. (2007). What, where and who? classifying events by scene and object recognition. ICCV.
  24. Li-Jia Li, Hao Su, E. P. X. and Fei-Fei, L. (2010a). Object bank: A high-level image representation for scene classification and semantic feature sparsification. Proceedings of the Neural Information Processing Systems (NIPS).
  25. Li-Jia Li, Hao Su, Y. L. and Fei-Fei, L. (2010b). Objects as attributes for scene classification. In European Conference of Computer Vision (ECCV), International Workshop on Parts and Attributes, Crete, Greece.
  26. Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2012). Unsupervised and transfer learning challenge: a deep learning approach. In Guyon, I., Dror, G., Lemaire, V., Taylor, G., and Silver, D., editors, JMLR W& CP: Proceedings of the Unsupervised and Transfer Learning challenge and workshop, volume 27, pages 97-110.
  27. Oliva, A. and Torralba, A. (2006). Building the gist of a scene: The role of global image features in recognition. Visual Perception, Progress in Brain Research, 155.
  28. Pandey, M. and Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. ICCV.
  29. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6):559-572.
  30. Quattoni, A. and Torralba, A. (2009). Recognizing indoor scenes. CVPR.
  31. Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. In NIPS'06.
  32. Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011a). Higher order contractive auto-encoder. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD).
  33. Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011b). Contracting auto-encoders: Explicit invariance during feature extraction. In Proceedings of the Twenty-eight International Conference on Machine Learning (ICML'11).
  34. Russell, B. C., Torralba, A., Murphy, K. P., and Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. Int. J. Comput. Vision, 77:157-173.
  35. Serre, T., Wolf, L., and Poggio, T. (2005). Object recognition with features inspired by visual cortex. IEEE Conference on Computer Vision and Pattern Recognition.
  36. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., and Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell., 22:1349-1380.
  37. Torralba, A. (2003). Contextual priming for object detection. International Journal of Computer Vision, 53(2):169-191.
  38. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.- A. (2008). Extracting and composing robust features with denoising autoencoders. In Cohen, W. W., McCallum, A., and Roweis, S. T., editors, ICML'08, pages 1096-1103. ACM.
  39. Vogel, J. and Schiele, B. (2004). Natural scene retrieval based on a semantic modeling step. In Proceeedings of the International Conference on Image and Video Retrieval CIVR 2004, Dublin, Ireland, LNCS, volume 3115.
  40. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3485-3492. IEEE.

Paper Citation

in Harvard Style

Mesnil G., Rifai S., Bordes A., Glorot X., Bengio Y. and Vincent P. (2013). Unsupervised and Transfer Learning under Uncertainty - From Object Detections to Scene Categorization . In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-8565-41-9, pages 345-354. DOI: 10.5220/0004227803450354

in Bibtex Style

author={Grégoire Mesnil and Salah Rifai and Antoine Bordes and Xavier Glorot and Yoshua Bengio and Pascal Vincent},
title={Unsupervised and Transfer Learning under Uncertainty - From Object Detections to Scene Categorization},
booktitle={Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},

in EndNote Style

JO - Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Unsupervised and Transfer Learning under Uncertainty - From Object Detections to Scene Categorization
SN - 978-989-8565-41-9
AU - Mesnil G.
AU - Rifai S.
AU - Bordes A.
AU - Glorot X.
AU - Bengio Y.
AU - Vincent P.
PY - 2013
SP - 345
EP - 354
DO - 10.5220/0004227803450354