that our simple and well-designed model outperforms
other models on the same datasets and using the same
loss functions during training. Our work generates
high-quality depth images that capture the boundaries
and reveal finer parts such as the holes in the back.
We believe that the encoder-decoder model for
depth estimation can be applied within areas such as
scene depth estimation of monocular SLAM and the
depth information can be utilized for further applica-
tions such as semantic segmentation and scene recon-
struction.
REFERENCES
Afifi, A. J. and Hellwich, O. (2016). Object depth esti-
mation from a single image using fully convolutional
neural network. In 2016 International Conference on
Digital Image Computing: Techniques and Applica-
tions (DICTA), pages 1–7. IEEE.
Black, M. J. and Rangarajan, A. (1996). On the unification
of line processes, outlier rejection, and robust statis-
tics with applications in early vision. International
Journal of Computer Vision, 19(1):57–91.
Cao, Y., Wu, Z., and Shen, C. (2018). Estimating depth
from monocular images as classification using deep
fully convolutional residual networks. IEEE Transac-
tions on Circuits and Systems for Video Technology,
28(11):3174–3182.
Choi, S., Zhou, Q.-Y., Miller, S., and Koltun, V. (2016).
A large dataset of object scans. arXiv preprint
arXiv:1602.02481.
Dalal, N. and Triggs, B. (2005). Histograms of oriented
gradients for human detection. In international Con-
ference on computer vision & Pattern Recognition
(CVPR’05), volume 1, pages 886–893. IEEE Com-
puter Society.
Eigen, D. and Fergus, R. (2015). Predicting depth, surface
normals and semantic labels with a common multi-
scale convolutional architecture. In Proceedings of
the IEEE international conference on computer vi-
sion, pages 2650–2658.
Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth map
prediction from a single image using a multi-scale
deep network. In Advances in neural information pro-
cessing systems, pages 2366–2374.
Glorot, X. and Bengio, Y. (2010). Understanding the diffi-
culty of training deep feedforward neural networks.
In Proceedings of the thirteenth international con-
ference on artificial intelligence and statistics, pages
249–256.
Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Scoffier, M.,
Kavukcuoglu, K., Muller, U., and LeCun, Y. (2009).
Learning long-range vision for autonomous off-road
driving. Journal of Field Robotics, 26(2):120–144.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Hoiem, D., Efros, A. A., and Hebert, M. (2005). Automatic
photo pop-up. ACM transactions on graphics (TOG),
24(3):577–584.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In Advances in neural information process-
ing systems, pages 1097–1105.
Ladicky, L., Shi, J., and Pollefeys, M. (2014). Pulling things
out of perspective. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 89–96.
Liu, B., Gould, S., and Koller, D. (2010). Single image
depth estimation from predicted semantic labels. In
2010 IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition, pages 1253–
1260. IEEE.
Liu, F., Lin, G., and Shen, C. (2017). Discriminative train-
ing of deep fully connected continuous crfs with task-
specific loss. IEEE Transactions on Image Processing,
26(5):2127–2136.
Liu, F., Shen, C., and Lin, G. (2015). Deep convolutional
neural fields for depth estimation from a single image.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 5162–5170.
Liu, M., Salzmann, M., and He, X. (2014). Discrete-
continuous depth estimation from a single image. In
Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 716–723.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. International journal of computer
vision, 60(2):91–110.
Milletari, F., Navab, N., and Ahmadi, S.-A. (2016). V-
net: Fully convolutional neural networks for volumet-
ric medical image segmentation. In 2016 Fourth Inter-
national Conference on 3D Vision (3DV), pages 565–
571. IEEE.
Perronnin, F., S
´
anchez, J., and Mensink, T. (2010). Im-
proving the fisher kernel for large-scale image classi-
fication. In European conference on computer vision,
pages 143–156. Springer.
Ren, X., Bo, L., and Fox, D. (2012). Rgb-(d) scene labeling:
Features and algorithms. In 2012 IEEE Conference
on Computer Vision and Pattern Recognition, pages
2759–2766. IEEE.
Roberts, R., Sinha, S. N., Szeliski, R., and Steedly, D.
(2011). Structure from motion for scenes with large
duplicate structures. In CVPR 2011, pages 3137–
3144. IEEE.
Saxena, A., Sun, M., and Ng, A. Y. (2009). Make3d: Learn-
ing 3d scene structure from a single still image. IEEE
transactions on pattern analysis and machine intelli-
gence, 31(5):824–840.
Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012).
Indoor segmentation and support inference from rgbd
images. In European Conference on Computer Vision,
pages 746–760. Springer.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556.
Mini V-Net: Depth Estimation from Single Indoor-Outdoor Images using Strided-CNN
213