6 CONCLUSION
In this paper, we compared visual navigation methods
based on reinforcement learning and localization. We
performed experiments on differently sized discrete-
state environments composed of both virtual and real
images. The results suggest that, despite the avail-
ability of multi-target approaches, visual navigation
methods based on reinforcement learning have diffi-
culties to generalize to targets unseen during training.
On the contrary, a simple baseline which relies on in-
accurate localization achieves similar results on tar-
gets seen during training and generalizes better to un-
seen targets. These observations suggest that methods
based on reinforcement learning could benefit even
from inaccurate localization. Future works can inves-
tigate approaches to fuse visual navigation methods
based on reinforcement learning and localization.
ACKNOWLEDGEMENTS
This research is supported by OrangeDev s.r.l. and
Piano della Ricerca 2016-2018, Linea di Intervento 2
of DMI, University of Catania.
REFERENCES
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic,
J. (2016). Netvlad: Cnn architecture for weakly super-
vised place recognition. In CVPR, pages 5297–5307.
Bojarski, M., Testa, D. D., Dworakowski, D., Firner, B.,
Flepp, B., Goyal, P., Jackel, L. D., Monfort, M.,
Muller, U., Zhang, J., Zhang, X., Zhao, J., and Zieba,
K. (2016). End to end learning for self-driving cars.
CoRR, abs/1604.07316.
Cormen, T. H., Stein, C., Rivest, R. L., and Leiserson, C. E.
(2001). Introduction to Algorithms. McGraw-Hill
Higher Education, 2nd edition.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). ImageNet: A Large-Scale Hierarchical
Image Database. In CVPR.
Giusti, A., Guzzi, J., Cires¸an, D. C., He, F.-L., Rodr
´
ıguez,
J. P., Fontana, F., Faessler, M., Forster, C., Schmidhu-
ber, J., Di Caro, G., et al. (2015). A machine learning
approach to visual perception of forest trails for mo-
bile robots. RA-L, 1(2):661–667.
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., and Ma-
lik, J. (2017). Cognitive mapping and planning for
visual navigation. In CVPR, pages 7272–7281.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In CVPR, pages
770–778.
Hong Zhang and Ostrowski, J. P. (2002). Visual motion
planning for mobile robots. T-RA, 18(2):199–208.
J
´
egou, H., Douze, M., Schmid, C., and P
´
erez, P. (2010).
Aggregating local descriptors into a compact image
representation. In CVPR, pages 3304–3311.
Kempka, M., Wydmuch, M., Runc, G., Toczek, J., and
Ja
´
skowski, W. (2016). Vizdoom: A doom-based ai re-
search platform for visual reinforcement learning. In
CIG, pages 1–8.
Kendall, A., Grimes, M., and Cipolla, R. (2015). Posenet:
A convolutional network for real-time 6-dof camera
relocalization. In ICCV, pages 2938–2946.
Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algo-
rithms. In NIPS, pages 1008–1014.
Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard,
A. J., Banino, A., Denil, M., Goroshin, R., Sifre,
L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R.
(2016). Learning to navigate in complex environ-
ments. CoRR, abs/1611.03673.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,
Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous methods for deep reinforcement learn-
ing. In ICML, pages 1928–1937.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. In NIPS Deep Learning Workshop.
Orlando, S. O., Furnari, A., Battiato, S., and Farinella,
G. M. (2019). Image-based localization with simu-
lated egocentric navigations. In VISAPP.
Ragusa, F., Furnari, A., Battiato, S., Signorello, G., and
Farinella, G. (2019). Egocentric visitors localization
in cultural sites. JOCCH, 12:1–19.
Ross, S., Gordon, G., and Bagnell, J. (2010). A reduction
of imitation learning and structured prediction to no-
regret online learning. JMLR, 15.
Sattler, T., Leibe, B., and Kobbelt, L. (2016). Efficient &
effective prioritized matching for large-scale image-
based localization. PAMI, 39(9):1744–1756.
Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans,
E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J.,
et al. (2019). Habitat: A platform for embodied ai
research. arXiv preprint arXiv:1904.01201.
Schnberger, J. L. and Frahm, J. (2016). Structure-from-
motion revisited. In CVPR, pages 4104–4113.
Simonyan, K. and Zisserman, A. (2015). Very deep convo-
lutional networks for large-scale image recognition. In
ICLR.
Thrun, S., Burgard, W., and Fox, D. (2005). Probabilistic
robotics. MIT press.
Ulrich, I. and Borenstein, J. (1998). Vfh+: Reliable obstacle
avoidance for fast mobile robots. In ICRA, volume 2,
pages 1572–1577.
Xia, F., Zamir, A. R., He, Z., Sax, A., Malik, J., and
Savarese, S. (2018). Gibson env: Real-world percep-
tion for embodied agents. In CVPR, pages 9068–9079.
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-
Fei, L., and Farhadi, A. (2017). Target-driven visual
navigation in indoor scenes using deep reinforcement
learning. In ICRA, pages 3357–3364.
A Comparison of Visual Navigation Approaches based on Localization and Reinforcement Learning in Virtual and Real Environments
635