
Li, H., Shi, B., Dai, W., Zheng, H., Wang, B., Sun, Y.,
Guo, M., Li, C., Zou, J., and Xiong, H. (2023). Pose-
oriented transformer with uncertainty-guided refine-
ment for 2d-to-3d human pose estimation. In Proceed-
ings of the AAAI Conference on Artificial Intelligence,
volume 37, pages 1296–1304.
Li, W., Liu, H., Ding, R., Liu, M., Wang, P., and Yang,
W. (2022a). Exploiting temporal contexts with strided
transformer for 3d human pose estimation. IEEE
Transactions on Multimedia, 25:1282–1293.
Li, W., Liu, H., Tang, H., Wang, P., and Gool, L. V. (2022b).
Mhformer: Multi-hypothesis transformer for 3d hu-
man pose estimation.
Martinez, J., Hossain, R., Romero, J., and Little, J. J.
(2017). A simple yet effective baseline for 3d human
pose estimation. In Proceedings of the IEEE inter-
national conference on computer vision, pages 2640–
2649.
Rhodin, H., Salzmann, M., and Fua, P. (2018). Unsu-
pervised geometry-aware representation for 3d human
pose estimation. In Proceedings of the European con-
ference on computer vision (ECCV), pages 750–767.
Sun, K., Xiao, B., Liu, D., and Wang, J. (2019). Deep high-
resolution representation learning for human pose es-
timation. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages
5693–5703.
Sun, X., Xiao, B., Wei, F., Liang, S., and Wei, Y. (2018).
Integral human pose regression. In Proceedings of
the European conference on computer vision (ECCV),
pages 529–545.
Toshev, A. and Szegedy, C. (2014). Deeppose: Human pose
estimation via deep neural networks. In Proceedings
of the IEEE conference on computer vision and pat-
tern recognition, pages 1653–1660.
Yu, T., Zheng, Z., Guo, K., Zhao, J., Dai, Q., Li, H., Pons-
Moll, G., and Liu, Y. (2018). Doublefusion: Real-
time capture of human performances with inner body
shapes from a single depth sensor. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 7287–7296.
Zhang, J., Tu, Z., Yang, J., Chen, Y., and Yuan, J. (2022).
Mixste: Seq2seq mixed spatio-temporal encoder for
3d human pose estimation in video. In Proceedings
of the IEEE/CVF conference on computer vision and
pattern recognition, pages 13232–13242.
Zhao, Q., Zheng, C., Liu, M., Wang, P., and Chen, C.
(2023). Poseformerv2: Exploring frequency domain
for efficient and robust 3d human pose estimation.
Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J.,
Kehtarnavaz, N., and Shah, M. (2023). Deep learning-
based human pose estimation: A survey. ACM Com-
puting Surveys, 56(1):1–37.
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., and
Ding, Z. (2021). 3d human pose estimation with spa-
tial and temporal transformers. Proceedings of the
IEEE International Conference on Computer Vision
(ICCV).
Zhi, T., Lassner, C., Tung, T., Stoll, C., Narasimhan, S. G.,
and Vo, M. (2020). Texmesh: Reconstructing detailed
human texture and geometry from rgb-d video. In
Computer Vision–ECCV 2020: 16th European Con-
ference, Glasgow, UK, August 23–28, 2020, Proceed-
ings, Part X 16, pages 492–509. Springer.
APPENDIX
To provide further illustration, the figures included in
this appendix offer additional context and detailed in-
sights in comparison to PoseFormer estimations.
Three actions of SittingDown, Directions and
Photo from test set S9 are presented to provide fur-
ther illustration. Figure 8 to Figure 10, compare the
MPJPE across all joints and all frames of the afore-
mentioned actions to demonstrate the extent of the
improvement in each joint when the method utilized
prior body dimensions. Figure 11 to Figure 13 present
the same comparison when the method employed es-
timated body dimensions.
Figure 6 and Figure 7 illustrate the MPJPE over
400 frames of the action SittingDown and Directions
for test set S9 of the Human3.6M dataset. The fig-
ures demonstrate the extent to which the refinement
process reduced the MPJPE, utilizing both prior body
dimensions and estimated body dimensions.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
218