
ACKNOWLEDGEMENTS
Funded by the German Federal Ministry of Educa-
tion and Research (BMBF) – Project-ID 01IS23047B
– aiRobot.
REFERENCES
Baek, S., Kim, K. I., and Kim, T.-K. (2019). Pushing the
envelope for rgb-based dense 3d hand pose estimation
via neural rendering. In 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 1067–1076.
Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.-J., Yuan, J., and
Thalmann, N. M. (2019). Exploiting spatial-temporal
relationships for 3d pose estimation via graph convo-
lutional networks. In Proceedings of the IEEE In-
ternational Conference on Computer Vision, pages
2272–2281.
Calli, B., Siu, A., Walsman, A., Matusik, W., and Allen, P.
(2017). The ycb object and model set: Towards com-
mon benchmarks for manipulation research. arXiv
preprint arXiv:1709.06965.
Cao, Z., Radosavovic, I., Kanazawa, A., and Malik, J.
(2021). Reconstructing hand-object interactions in
the wild. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), pages
12417–12426.
Chen, P., Chen, Y., Yang, D., Wu, F., Li, Q., Xia, Q., and
Tan, Y. B. (2021). I2uv-handnet: Image-to-uv pre-
diction network for accurate and high-fidelity 3d hand
mesh modeling. 2021 IEEE/CVF International Con-
ference on Computer Vision (ICCV), pages 12909–
12918.
Chen, X., Liu, Y., Dong, Y., Zhang, X., Ma, C., Xiong, Y.,
Zhang, Y., and Guo, X. (2022a). Mobrecon: Mobile-
friendly hand mesh reconstruction from monocular
image. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 12912–12921.
Chen, Z., Hampali, S., Schmid, C., and Laptev, I. (2023).
Gsdf: Geometry-driven signed distance functions for
3d hand-object reconstruction. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 12890–12900.
Chen, Z., Hasson, Y., Schmid, C., and Laptev, I. (2022b).
Alignsdf: Pose-aligned signed distance fields for
hand-object reconstruction. In Proceedings of the
European Conference on Computer Vision (ECCV),
pages 231–248, Cham, Switzerland. Springer.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Min-
derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. ArXiv,
abs/2010.11929.
Ge, L., Cai, Y., Weng, J., and Yuan, J. (2018). Hand point-
net: 3d hand pose estimation using point sets. In 2018
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 8417–8426.
Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J., and
Yuan, J. (2019). 3d hand shape and pose estimation
from a single rgb image. 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 10825–10834.
Hampali, S., Rad, M., Oberweger, M., and Lepetit, V.
(2020). Honnotate: A method for 3d annotation
of hand and object poses. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 3196–3206.
Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M.,
and Schmid, C. (2020). Leveraging photometric con-
sistency over time for sparsely supervised hand-object
reconstruction. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 571–580.
Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black,
M. J., Laptev, I., and Schmid, C. (2019). Learning
joint reconstruction of hands and manipulated objects.
In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
11807–11816.
Hoang, D.-C., Tan, P. X., Nguyen, A.-N., Vu, D.-Q., Vu,
V.-D., Nguyen, T.-U., Hoang, N.-A., Phan, K.-T.,
Tran, D.-T., Nguyen, V.-T., Duong, Q.-T., Ho, N.-
T., Tran, C.-T., Duong, V.-H., and Ngo, P.-Q. (2024).
Multi-modal hand-object pose estimation with adap-
tive fusion and interaction learning. IEEE Access,
12:54339–54351.
Jiang, T., Xie, X., and Li, Y. (2024). Rtmw: Real-time
multi-person 2d and 3d whole-body pose estimation.
arXiv preprint arXiv:2407.08634.
Jiang, Z., Rahmani, H., Black, S., and Williams, B. M.
(2023). A probabilistic attention model with
occlusion-aware texture regression for 3d hand recon-
struction from a single rgb image. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 6276–6286.
Lepetit, V., Moreno-Noguer, F., and Fua, P. (2009). Epnp:
An accurate o(n) solution to the pnp problem. In Pro-
ceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1–8.
Lin, K., Wang, L., and Liu, Z. (2021). End-to-end human
pose and mesh reconstruction with transformers. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
10690–10699.
Lin, T.-Y., Doll
´
ar, P., Girshick, R., He, K., Hariharan, B.,
and Belongie, S. (2017). Feature pyramid networks
for object detection. Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 2117–2125.
Liu, S., Jiang, H., Xu, J., Liu, S., and Wang, X.
(2021). Semi-supervised 3d hand-object poses esti-
mation with interactions in time. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 14687–14696.
Mishra, A., Fathi, A., Jain, M., and Handa, A. (2020). Dex-
ycb: A benchmark for dexterous manipulation of ob-
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
806