ACKNOWLEDGEMENTS
This research was supported by MIUR PRIN project
“PREVUE: PRediction of activities and Events
by Vision in an Urban Environment”, grant ID
E94I19000650001.
REFERENCES
Afifi, A. J., Hellwich, O., and Soomro, T. A. (2018). Simul-
taneous object classification and viewpoint estimation
using deep multi-task convolutional neural network.
In VISIGRAPP (5: VISAPP), pages 177–184. 1
Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B.
(2014). 2d human pose estimation: New bench-
mark and state of the art analysis. In Proceedings of
the IEEE Conference on computer Vision and Pattern
Recognition, pages 3686–3693. 6
Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2017). Real-
time multi-person 2d pose estimation using part affin-
ity fields. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 7291–
7299. 6
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on com-
puter vision and pattern recognition, pages 248–255.
Ieee. 5
Grabner, A., Roth, P. M., and Lepetit, V. (2018). 3d pose es-
timation and 3d model retrieval for objects in the wild.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 3022–3031. 1
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778. 2, 6
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
K. Q. (2017). Densely connected convolutional net-
works. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4700–
4708. 6
Kingma, D. P. and Ba, J. (2014). Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980. 5
Kortylewski, A., He, J., Liu, Q., and Yuille, A. L. (2020).
Compositional convolutional neural networks: A deep
architecture with innate robustness to partial occlu-
sion. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
8940–8949. 1
Long, J. L., Zhang, N., and Darrell, T. (2014). Do convnets
learn correspondence? In Advances in neural infor-
mation processing systems, pages 1601–1609. 6
Mottaghi, R., Xiang, Y., and Savarese, S. (2015). A coarse-
to-fine model for 3d pose estimation and sub-category
recognition. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
418–426. 1
Newell, A., Yang, K., and Deng, J. (2016). Stacked hour-
glass networks for human pose estimation. In Euro-
pean conference on computer vision, pages 483–499.
Springer. 3, 6
Palazzi, A., Borghi, G., Abati, D., Calderara, S., and Cuc-
chiara, R. (2017). Learning to map vehicles into bird’s
eye view. In International Conference on Image Anal-
ysis and Processing, pages 233–243. Springer. 1
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and
Lerer, A. (2017). Automatic differentiation in pytorch.
5
Pavlakos, G., Zhou, X., Chan, A., Derpanis, K. G., and
Daniilidis, K. (2017). 6-dof object pose from semantic
keypoints. In 2017 IEEE international conference on
robotics and automation (ICRA), pages 2011–2018.
IEEE. 6
Simoni, A., Bergamini, L., Palazzi, A., Calderara, S., and
Cucchiara, R. (2020). Future urban scenes generation
through vehicles synthesis. In International Confer-
ence on Pattern Recognition (ICPR). 1, 5
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556. 6
Tulsiani, S. and Malik, J. (2015). Viewpoints and keypoints.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1510–1519. 6
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao,
Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. (2020).
Deep high-resolution representation learning for vi-
sual recognition. IEEE transactions on pattern analy-
sis and machine intelligence. 6
Xiang, Y., Mottaghi, R., and Savarese, S. (2014). Beyond
pascal: A benchmark for 3d object detection in the
wild. In IEEE winter conference on applications of
computer vision, pages 75–82. IEEE. 2, 4
Xiao, M., Kortylewski, A., Wu, R., Qiao, S., Shen, W.,
and Yuille, A. (2019). Tdapnet: Prototype network
with recurrent top-down attention for robust object
classification under partial occlusion. arXiv preprint
arXiv:1909.03879. 1
Xie, S., Girshick, R., Doll
´
ar, P., Tu, Z., and He, K. (2017).
Aggregated residual transformations for deep neural
networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1492–
1500. 2, 6
Zhou, X., Karpur, A., Luo, L., and Huang, Q. (2018).
Starmap for category-agnostic keypoint and viewpoint
estimation. In Proceedings of the European Confer-
ence on Computer Vision (ECCV), pages 318–334. 6
Improving Car Model Classification through Vehicle Keypoint Localization
361