Trained 3D Models for CNN based Object Recognition

Kripasindhu Sarkar, Kiran Varanasi, Didier Stricker


We present a method for 3D object recognition in 2D images which uses 3D models as the only source of the training data. Our method is particularly useful when a 3D CAD object or a scan needs to be identified in a catalogue form a given query image; where we significantly cut down the overhead of manual labeling. We take virtual snapshots of the available 3D models by a computer graphics pipeline and fine-tune existing pretrained CNN models for our object categories. Experiments show that our method performs better than the existing local-feature based recognition system in terms of recognition recall.


  1. 3Digify (2015). 3digify,
  2. Aubry, M., Maturana, D., Efros, A., Russell, B., and Sivic, J. (2014). Seeing 3d chairs: exemplar part-based 2d3d alignment using a large dataset of cad models. In CVPR.
  3. Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115.
  4. Collet Romea, A., Martinez Torres, M., and Srinivasa, S. (2011). The moped framework: Object recognition and pose estimation for manipulation. International Journal of Robotics Research, 30(10):1284 - 1306.
  5. Collet Romea, A. and Srinivasa, S. (2010). Efficient multiview object recognition and full pose estimation. In 2010 IEEE International Conference on Robotics and Automation (ICRA 2010).
  6. D'Apuzzo, N. (2006). Overview of 3d surface digitization technologies in europe. In Electronic Imaging 2006, pages 605605-605605. International Society for Optics and Photonics.
  7. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303-338.
  8. Fei-Fei, L. (2006). Knowledge transfer in learning to recognize visual objects classes. In Proceedings of the Fifth International Conference on Development and Learning.
  9. Fei-Fei, L., Fergus, R., and Perona, P. (2006). Oneshot learning of object categories. IEEE Transactions On Pattern Analysis and Machine Intelligence, 28(4):594-611.
  10. Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  11. Girshick, R. B. (2015). abs/1504.08083.
  12. Hao, Q., Cai, R., Li, Z., Zhang, L., Pang, Y., Wu, F., and Rui, Y. (2013). Efficient 2d-to-3d correspondence filtering for scalable 3d object recognition. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 899-906.
  13. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. CoRR, abs/1512.03385.
  14. Irschara, A., Zach, C., Frahm, J.-M., and Bischof, H. (2009). From structure-from-motion point clouds to fast location recognition. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2599-2606.
  15. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097- 1105. Curran Associates, Inc.
  16. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324.
  17. Long, M. and Wang, J. (2015). Learning transferable features with deep adaptation networks. CoRR, abs/1502.02791, 1:2.
  18. Lowe, D. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2):91-110.
  19. Peng, X., Sun, B., Ali, K., and Saenko, K. (2014). Exploring invariances in deep convolutional neural networks using synthetic images. CoRR, abs/1412.7122.
  20. Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. CoRR, abs/1506.01497.
  21. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211-252.
  22. Sarkar, K., Pagani, A., and Stricker, D. (2016). Featureaugmented trained models for 6dof object recognition and camera calibration. In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pages 632-640.
  23. Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
  24. Skrypnyk, I. and Lowe, D. (2004). Scene modelling, recognition and tracking with invariant image features. In Mixed and Augmented Reality, 2004. ISMAR 2004. Third IEEE and ACM International Symposium on, pages 110-119.
  25. Snavely, N., Seitz, S. M., and Szeliski, R. (2006). Photo tourism: Exploring photo collections in 3d. ACM Trans. Graph., 25(3):835-846.
  26. Snavely, N., Seitz, S. M., and Szeliski, R. (2008). Modeling the world from internet photo collections. Int. J. Comput. Vision, 80(2):189-210.
  27. Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. G. (2015a). Multi-view convolutional neural networks for 3d shape recognition. In Proc. ICCV.
  28. Su, H., Qi, C. R., Li, Y., and Guibas, L. J. (2015b). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In The IEEE International Conference on Computer Vision (ICCV).
  29. Sun, B., Feng, J., and Saenko, K. (2015). Return of frustratingly easy domain adaptation. CoRR, abs/1511.05547.
  30. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. (2014). Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474.

Paper Citation

in Harvard Style

Sarkar K., Varanasi K. and Stricker D. (2017). Trained 3D Models for CNN based Object Recognition . In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017) ISBN 978-989-758-226-4, pages 130-137. DOI: 10.5220/0006272901300137

in Bibtex Style

author={Kripasindhu Sarkar and Kiran Varanasi and Didier Stricker},
title={Trained 3D Models for CNN based Object Recognition},
booktitle={Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)},

in EndNote Style

JO - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)
TI - Trained 3D Models for CNN based Object Recognition
SN - 978-989-758-226-4
AU - Sarkar K.
AU - Varanasi K.
AU - Stricker D.
PY - 2017
SP - 130
EP - 137
DO - 10.5220/0006272901300137