
of orientations and directional discriminators. Both
methods demonstrated state-of-the-art performance
on the PASCAL3D+ dataset, with minimal perfor-
mance differences under certain conditions, high-
lighting their practical applicability.
In terms of potential improvements, exploring a
range of data augmentation techniques could enhance
model robustness, particularly in real-world scenar-
ios. Additionally, accuracy might be further refined
by employing model ensembling to combine predic-
tions from various models or iterations, thereby re-
ducing the impact of outlier predictions.
ACKNOWLEDGMENTS
This work was partially supported by the MUR under
the grant “Dipartimenti di Eccellenza 2023-2027” of
the Department of Informatics, Systems and Commu-
nication of the University of Milano-Bicocca, Italy.
REFERENCES
Beyer, L., Hermans, A., and Leibe, B. (2015). Biternion
nets: Continuous head pose regression from discrete
training labels. In German Conference on Pattern
Recognition, pages 157–168. Springer.
Buzzelli, M. and Segantin, L. (2021). Revisiting the
compcars dataset for hierarchical car classification:
New annotations, experiments, and results. Sensors,
21(2):596.
Dani, M., Narain, K., and Hebbalaguppe, R. (2021).
3DPoseLite: A compact 3D pose estimation using
node embeddings. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vi-
sion, pages 1878–1887.
David, L. (2004). Distinctive image features from scale-
invariant keypoints. International journal of computer
vision, 60:91–110.
Felzenszwalb, P. F. and Huttenlocher, D. P. (2005). Pic-
torial structures for object recognition. International
journal of computer vision, 61:55–79.
Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready
for autonomous driving? The KITTI vision bench-
mark suite. In IEEE conference on computer vision
and pattern recognition.
Grabner, A., Roth, P. M., and Lepetit, V. (2018). 3D
pose estimation and 3D model retrieval for objects in
the wild. In Proceedings of the IEEE conference on
CVPR, pages 3022–3031.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask R-CNN. In Proceedings of the IEEE ICCV,
pages 2961–2969.
Kendall, A., Grimes, M., and Cipolla, R. (2015). Posenet: A
convolutional network for real-time 6-DoF camera re-
localization. In Proceedings of the IEEE international
conference on computer vision, pages 2938–2946.
Klee, D. M., Biza, O., Platt, R., and Walters, R.
(2023). Image to sphere: Learning equivariant fea-
tures for efficient pose prediction. arXiv preprint
arXiv:2302.13926.
Lepetit, V., Moreno-Noguer, F., and Fua, P. (2009). Ep n
p: An accurate o (n) solution to the p n p problem.
International journal of computer vision, 81:155–166.
Mahendran, S., Lu, M. Y., Ali, H., and Vidal, R. (2018).
Monocular object orientation estimation using Rie-
mannian regression and classification networks. arXiv
preprint arXiv:1807.07226.
Mousavian, A., Anguelov, D., Flynn, J., and Kosecka, J.
(2017). 3D bounding box estimation using deep learn-
ing and geometry. In Proceedings of the IEEE con-
ference on Computer Vision and Pattern Recognition,
pages 7074–7082.
Nie, W.-Z., Jia, W.-W., Li, W.-H., Liu, A.-A., and Zhao,
S.-C. (2020). 3D pose estimation based on reinforce-
ment learning for 2D image-based 3D model retrieval.
IEEE Transactions on Multimedia, 23:1021–1034.
Pavlakos, G., Zhou, X., Chan, A., Derpanis, K. G., and
Daniilidis, K. (2017). 6-DoF object pose from seman-
tic keypoints. In 2017 IEEE international conference
on robotics and automation, pages 2011–2018.
Prokudin, S., Gehler, P., and Nowozin, S. (2018). Deep di-
rectional statistics: Pose estimation with uncertainty
quantification. In Proceedings of the European con-
ference on computer vision (ECCV), pages 534–551.
Qin, Z., Wang, J., and Lu, Y. (2019). Monogrnet: A geo-
metric reasoning network for monocular 3D object lo-
calization. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 33, pages 8851–8858.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M., et al. (2015). Imagenet large scale visual
recognition challenge. International journal of com-
puter vision, 115:211–252.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. (2014). Dropout: a simple way
to prevent neural networks from overfitting. The jour-
nal of machine learning research, 15(1):1929–1958.
Su, H., Qi, C. R., Li, Y., and Guibas, L. J. (2015). Render
for cnn: Viewpoint estimation in images using cnns
trained with rendered 3D model views. In Proceedings
of the IEEE international conference on computer vi-
sion, pages 2686–2694.
Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model
scaling for convolutional neural networks. In Interna-
tional conference on machine learning, pages 6105–
6114. PMLR.
Tulsiani, S. and Malik, J. (2015). Viewpoints and keypoints.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1510–1519.
Xiang, Y., Mottaghi, R., and Savarese, S. (2014). Beyond
pascal: A benchmark for 3D object detection in the
wild. In IEEE winter conference on applications of
computer vision, pages 75–82. IEEE.
Xiao, Y., Qiu, X., Langlois, P.-A., Aubry, M., and Marlet, R.
(2019). Pose from shape: Deep pose estimation for ar-
bitrary 3D objects. arXiv preprint arXiv:1906.05105.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
860