5 CONCLUSIONS, FUTURE
WORK
In this paper we examine cross view image transla-
tion, generating a street view from the correspond-
ing aerial view using a cascade pipeline, where coarse
street view image generation, semantic segmentation,
and image refinement, are combined and trained to-
gether . We tested SoA generator models U-Net,
ResNet, and ResU-Net++ and found best results were
obtained for the configuration (Generator 1: U-Net,
Generator 2: ResNet, Generator 3: ResU-Net++).
This demonstrates the importance of sjkip connec-
tions for street view generation and of attention for
image refinement. The role of each of the 3 subtasks
in the pipeline was studied and it was concluded that
each subtask improved overall performance qualita-
tively and quantitatively. Future work includes inves-
tigating appropriate networks for further refinement
of the output images to address artifacts related to per-
spective projection, and how to incorporate varying
sources of input data (such as aerial input data from
drones at varying heights, or video input).
REFERENCES
Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., and Choo,
J. (2018). Stargan: Unified generative adversarial net-
works for multi-domain image-to-image translation.
In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 8789–8797.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).
Image-to-image translation with conditional adversar-
ial networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages
1125–1134.
Jha, D., Smedsrud, P. H., Riegler, M. A., Johansen, D.,
De Lange, T., Halvorsen, P., and Johansen, H. D.
(2019). Resunet++: An advanced architecture for
medical image segmentation. In 2019 IEEE Interna-
tional Symposium on Multimedia (ISM), pages 225–
2255. IEEE.
Kim, J., Kim, M., Kang, H., and Lee, K. (2019). U-gat-
it: Unsupervised generative attentional networks with
adaptive layer-instance normalization for image-to-
image translation. arXiv preprint arXiv:1907.10830.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. Advances in neural information processing
systems, 25:1097–1105.
Larsen, A. B. L., Sønderby, S. K., Larochelle, H., and
Winther, O. (2016). Autoencoding beyond pixels
using a learned similarity metric. In International
conference on machine learning, pages 1558–1566.
PMLR.
Mirza, M. and Osindero, S. (2014). Conditional generative
adversarial nets. arXiv preprint arXiv:1411.1784.
Odena, A., Olah, C., and Shlens, J. (2017). Conditional
image synthesis with auxiliary classifier gans. In In-
ternational conference on machine learning, pages
2642–2651. PMLR.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and
Efros, A. A. (2016). Context encoders: Feature learn-
ing by inpainting. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 2536–2544.
Regmi, K. and Borji, A. (2018). Cross-view image synthe-
sis using conditional gans. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pages 3501–3510.
Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J. J., and
Yan, Y. (2019). Multi-channel attention selection gan
with cascaded semantic guidance for cross-view im-
age translation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition,
pages 2417–2426.
Toker, A., Zhou, Q., Maximov, M., and Leal-Taixe, L.
(2021). Coming down to earth: Satellite-to-street view
synthesis for geo-localization. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 6488–6497.
Workman, S., Souvenir, R., and Jacobs, N. (2015). Wide-
area image geolocalization with aerial reference im-
agery. In Proceedings of the IEEE International Con-
ference on Computer Vision, pages 3961–3969.
Zhai, M., Bessinger, Z., Workman, S., and Jacobs, N.
(2017). Predicting ground-level scene layout from
aerial imagery. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition,
pages 867–875.
Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A.
(2019). Self-attention generative adversarial net-
works. In International conference on machine learn-
ing, pages 7354–7363. PMLR.
Zhao, H., Gallo, O., Frosio, I., and Kautz, J. (2016).
Loss functions for image restoration with neural net-
works. IEEE Transactions on computational imaging,
3(1):47–57.
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor-
ralba, A. (2017). Places: A 10 million image database
for scene recognition. IEEE transactions on pattern
analysis and machine intelligence, 40(6):1452–1464.
Aerial to Street View Image Translation using Cascaded Conditional GANs
379