and while there are undoubtedly improvements that
could be made to the model architecture and train-
ing regime, the single largest opportunity for model
improvement will come from generating larger and
more diverse training data sets. An important obser-
vation from the current work is that the virtual world
allows a fine level of control over the arrangement,
density and distribution of objects at various depths.
For example, we found that early data sets contained
too much road and sky, which distorted the accuracy
metrics. Later data sets contained more objects of in-
terest and produced more robust models. Future mod-
els will be trained on data that is generated in an inter-
active process, designed to create data for classes and
at depths where the network errors are largest. We be-
lieve this will allow a greater efficiency for large scale
network training.
ACKNOWLEDGEMENTS
This research was funded by DSTL.
REFERENCES
Alhwarin, F., Ferrein, A., and Scholl, I. (2014). Ir stereo
kinect: improving depth images by combining struc-
tured light with ir stereo. In Pacific Rim International
Conference on Artificial Intelligence, pages 409–421.
Springer.
Cao, Y., Wu, Z., and Shen, C. (2017). Estimating depth
from monocular images as classification using deep
fully convolutional residual networks. IEEE Transac-
tions on Circuits and Systems for Video Technology,
28(11):3174–3182.
Caron, M., Touvron, H., Misra, I., J
´
egou, H., Mairal, J., Bo-
janowski, P., and Joulin, A. (2021). Emerging prop-
erties in self-supervised vision transformers. In Pro-
ceedings of the IEEE/CVF International Conference
on Computer Vision, pages 9650–9660.
Chen, K., Pogue, A., Lopez, B. T., Agha-Mohammadi, A.-
A., and Mehta, A. (2020). Unsupervised monocular
depth learning with integrated intrinsics and spatio-
temporal constraints. In 2021 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS),
pages 2451–2458. IEEE.
Clevert, D.-A., Unterthiner, T., and Hochreiter, S.
(2015). Fast and accurate deep network learning
by exponential linear units (elus). arXiv preprint
arXiv:1511.07289.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996).
A density-based algorithm for discovering clusters in
large spatial databases with noise. In kdd, volume 96,
pages 226–231.
Goodfellow, I., Bengio, Y., Courville, A., and Bach, F., edi-
tors (2016). Deep Learning, page 193. Adaptive Com-
putation and Machine Learning Series. MIT Press,
12th floor, One Broadway, Cambridge, MA 02142.
Goutcher, R., Barrington, C., Hibbard, P. B., and Graham,
B. (2021). Binocular vision supports the development
of scene segmentation capabilities: Evidence from a
deep learning model. Journal of vision, 21(7):13–13.
Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., and
Freeman, W. T. (2022). Unsupervised semantic seg-
mentation by distilling feature correspondences. arXiv
preprint arXiv:2203.08414.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delv-
ing deep into rectifiers: Surpassing human-level per-
formance on imagenet classification. In Proceedings
of the IEEE international conference on computer vi-
sion, pages 1026–1034.
Hibbard, P. B. (2007). A statistical model of binocular dis-
parity. Visual Cognition, 15(2):149–165.
Hibbard, P. B. and Bouzit, S. (2005). Stereoscopic corre-
spondence for ambiguous targets is affected by eleva-
tion and fixation distance. Spatial vision, 18(4):399–
411.
Kirillov, A., He, K., Girshick, R., Rother, C., and Doll
´
ar,
P. (2019). Panoptic segmentation. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 9404–9413.
Lee, S., Kim, N., Jung, K., Hayes, M. H., and Paik, J.
(2013). Single image-based depth estimation using
dual off-axis color filtered aperture camera. In 2013
IEEE International Conference on Acoustics, Speech
and Signal Processing, pages 2247–2251. IEEE.
Li, S., Luo, Y., Zhu, Y., Zhao, X., Li, Y., and Shan,
Y. (2021). Enforcing temporal consistency in video
depth estimation. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
1145–1154.
Liu, X., Deng, Z., and Yang, Y. (2019). Recent progress in
semantic image segmentation. Artificial Intelligence
Review, 52(2):1089–1106.
Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-
volutional networks for semantic segmentation. In
Proceedings of the IEEE conference on computer vi-
sion and pattern recognition, pages 3431–3440.
Long, Y., Morris, D., Liu, X., Castro, M., Chakravarty, P.,
and Narayanan, P. (2021). Radar-camera pixel depth
association for depth completion. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 12507–12516.
Mansour, M., Davidson, P., Stepanov, O., and Pich
´
e, R.
(2019). Relative importance of binocular disparity and
motion parallax for depth estimation: a computer vi-
sion approach. Remote Sensing, 11(17):1990.
Palou, G. and Salembier, P. (2012). 2.1 depth estimation of
frames in image sequences using motion occlusions.
Combined Depth and Semantic Segmentation from Synthetic Data and a W-Net Architecture
421