Table 1: Comparison of the models proposed in this work to SoTA works. Red and blue indicates best and second best
performance respectively.
Method RGB Depth RGB-D
Nonlinear SVM (Lai et al., 2011) 74.5 ± 3.1 64.7 ± 2.2 83.9 ± 3.5
CNN-RNN (Socher et al., 2012) 80.8 ± 4.2 78.9 ± 3.8 86.8 ± 3.3
FusionNet (Eitel et al., 2015) 84.1 ± 2.7 83.8 ± 2.7 91.3 ± 1.4
CNN+Fisher (Li et al., 2015) 90.8 ± 1.6 81.8 ± 2.4 93.8 ± 0.9
DepthNet (Carlucci et al., 2016) 88.4 ± 1.8 83.8 ± 2.0 92.2 ± 1.3
1
CIMDL (Wang et al., 2016) 87.3 ± 1.6 84.2 ± 1.7 92.4 ± 1.8
DCNN-GPC (Sun et al., 2017) 88.4 ± 2.1 80.3 ± 2.7 91.8 ± 1.1
STEM-CaRFs (Asif et al., 2017) 88.8 ± 2.0 80.8 ± 2.1 92.2 ± 1.3
This work 89.5 ± 1.9 84.5 ± 2.9 93.5 ± 1.1
6 CONCLUSION
The FusionNet model for object recognition proposed
by (Eitel et al., 2015), showed promising results with
the use of a two streamed CNNs architecture, based
on the 8-layered CaffeNet, and a simple Jet color map
based encoding method for the depth values. In this
work, we have shown that the FusionNet model can
be improved by encoding the depth values as col-
orized surface normals, and by using the deeper 16-
layered VGGNet for the RGB stream. The improve-
ment in recognition performance is mainly due to the
larger capacity of the VGGNet, but also due to depth
values encoded as colorized surface normals, better
captures structural and curvature information of ob-
jects. When evaluating on the Washington RGB-D
object dataset, these changes was found to result in
an accuracy of 93.5%, which is 2.2% higher than the
original FusionNet proposed by (Eitel et al., 2015),
and competitive with current SoTA works.
REFERENCES
Asif, U., Bennamoun, M., and Sohel, F. A. (2017). Rgb-d
object recognition and grasp detection using hierarchi-
cal cascaded forests. IEEE Transactions on Robotics,
PP(99):1–18.
Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008).
Speeded-up robust features (surf). Comput. Vis. Image
Underst., 110(3):346–359.
Carlucci, F. M., Russo, P., and Caputo, B. (2016). A deep
representation for depth images from synthetic data.
ArXiv e-prints.
Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M.,
and Burgard, W. (2015). Multimodal deep learning for
robust rgb-d object recognition. In IEEE/RSJ Interna-
tional Conference on Intelligent Robots and Systems
(IROS), Hamburg, Germany.
Guo, Y., Bennamoun, M., Sohel, F., Lu, M., and Wan,
J. (2014). 3d object recognition in cluttered scenes
with local surface features: A survey. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence,
36(11):2270–2287.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep
residual learning for image recognition. CoRR,
abs/1512.03385.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,
Girshick, R., Guadarrama, S., and Darrell, T. (2014).
Caffe: Convolutional architecture for fast feature em-
bedding. arXiv preprint arXiv:1408.5093.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
Imagenet classification with deep convolutional neu-
ral networks. In Pereira, F., Burges, C. J. C., Bottou,
L., and Weinberger, K. Q., editors, Advances in Neu-
ral Information Processing Systems 25, pages 1097–
1105. Curran Associates, Inc.
Lai, K., Bo, L., Ren, X., and Fox, D. (2011). A large-scale
hierarchical multi-view rgb-d object dataset. In ICRA,
pages 1817–1824. IEEE.
Li, W., Cao, Z., Xiao, Y., and Fang, Z. (2015). Hybrid rgb-
d object recognition using convolutional neural net-
work and fisher vector. In 2015 Chinese Automation
Congress (CAC), pages 506–511.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. Int. J. Comput. Vision, 60(2):91–
110.
Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson,
S. (2014). CNN features off-the-shelf: an astounding
baseline for recognition. CoRR, abs/1403.6382.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,
S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,
Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015).
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV),
115(3):211–252.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
CoRR, abs/1409.1556.
Socher, R., Huval, B., Bath, B., Manning, C. D., and Ng,
A. Y. (2012). Convolutional-recursive deep learning
for 3d object classification. In Pereira, F., Burges, C.
J. C., Bottou, L., and Weinberger, K. Q., editors, Ad-