
Malik, J., Abdelaziz, I., Elhayek, A., Shimada, S., Ali,
S. A., Golyanik, V., Theobalt, C., and Stricker, D.
(2020). Handvoxnet: Deep voxel-based network for
3d hand shape and pose estimation from a single depth
map. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 7113–
7122.
Malik, J., Shimada, S., Elhayek, A., Ali, S. A., Theobalt, C.,
Golyanik, V., and Stricker, D. (2021). Handvoxnet++:
3d hand shape and pose estimation using voxel-based
neural networks. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 44(12):8962–8974.
Moon, G. and Lee, K. M. (2020). I2l-meshnet: Image-to-
lixel prediction network for accurate 3d human pose
and mesh estimation from a single rgb image. In
Computer Vision–ECCV 2020: 16th European Con-
ference, Glasgow, UK, August 23–28, 2020, Proceed-
ings, Part VII 16, pages 752–768. Springer.
Park, J., Oh, Y., Moon, G., Choi, H., and Lee, K. M. (2022).
Handoccnet: Occlusion-robust 3d hand mesh estima-
tion network. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition,
pages 1496–1505.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,
Antiga, L., et al. (2019). Pytorch: An imperative style,
high-performance deep learning library. Advances in
neural information processing systems, 32.
Prakash, A., Gupta, A., and Gupta, S. (2023). Mitigating
perspective distortion-induced shape ambiguity in im-
age crops. arXiv preprint arXiv:2312.06594.
Remelli, E., Han, S., Honari, S., Fua, P., and Wang, R.
(2020). Lightweight multi-view 3d pose estimation
through camera-disentangled representation. In Pro-
ceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 6040–6049.
Romero, J., Tzionas, D., and Black, M. J. (2022). Embod-
ied hands: Modeling and capturing hands and bodies
together. arXiv preprint arXiv:2201.02610.
Shuai, H., Wu, L., and Liu, Q. (2022). Adaptive multi-
view and temporal fusing transformer for 3d human
pose estimation. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 45(4):4122–4135.
Sun, K., Xiao, B., Liu, D., and Wang, J. (2019). Deep high-
resolution representation learning for human pose es-
timation. In CVPR.
Sun, X., Xiao, B., Wei, F., Liang, S., and Wei, Y. (2018).
Integral human pose regression. In Proceedings of
the European conference on computer vision (ECCV),
pages 529–545.
Tu, H., Wang, C., and Zeng, W. (2020). Voxelpose:
Towards multi-camera 3d human pose estimation in
wild environment. In Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, August 23–
28, 2020, Proceedings, Part I 16, pages 197–212.
Springer.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
Yang, L., Li, K., Zhan, X., Wu, F., Xu, A., Liu, L., and Lu,
C. (2022). Oakink: A large-scale knowledge repos-
itory for understanding hand-object interaction. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 20953–
20962.
Yang, L., Xu, J., Zhong, L., Zhan, X., Wang, Z., Wu, K.,
and Lu, C. (2023). Poem: Reconstructing hand in a
point embedded multi-view stereo. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 21108–21117.
Yu, Z., Yang, L., Chen, S., and Yao, A. (2021). Local and
global point cloud reconstruction for 3d hand pose es-
timation. arXiv preprint arXiv:2112.06389.
Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A.,
Sung, G., Chang, C.-L., and Grundmann, M. (2020).
Mediapipe hands: On-device real-time hand tracking.
arXiv preprint arXiv:2006.10214.
Zhang, J., Cai, Y., Yan, S., Feng, J., et al. (2021a). Di-
rect multi-view multi-person 3d pose estimation. Ad-
vances in Neural Information Processing Systems,
34:13153–13164.
Zhang, Z., Wang, C., Qiu, W., Qin, W., and Zeng, W.
(2021b). Adafuse: Adaptive multiview fusion for ac-
curate human pose estimation in the wild. Interna-
tional Journal of Computer Vision, 129:703–718.
Zhao, W., Wang, W., and Tian, Y. (2022). Graformer:
Graph-oriented transformer for 3d pose estimation. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 20438–
20447.
Zheng, X., Wen, C., Xue, Z., Ren, P., and Wang, J. (2023).
Hamuco: Hand pose estimation via multiview collab-
orative self-supervised learning. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 20763–20773.
Zhou, Y., Habermann, M., Xu, W., Habibie, I., Theobalt, C.,
and Xu, F. (2020). Monocular real-time hand shape
and motion capture using multi-modal data. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5346–5355.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
562