
Doosti, B., Naha, S., Mirbagheri, M., and Crandall, D. J.
(2020). Hope-net: A graph-based model for hand-
object pose estimation. 2020 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 6607–6616.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Min-
derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2021). An image is worth 16x16 words:
Transformers for image recognition at scale. ArXiv,
abs/2010.11929.
Garcia-Hernando, G., Yuan, S., Baek, S., and Kim, T.-
K. (2018). First-person hand action benchmark with
rgb-d videos and 3d hand pose annotations. 2018
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 409–419.
Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J., and
Yuan, J. (2019). 3d hand shape and pose estimation
from a single rgb image. 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 10825–10834.
Hasson, Y., Varol, G., Schmid, C., and Laptev, I. (2021).
Towards unconstrained joint hand-object reconstruc-
tion from rgb videos. 2021 International Conference
on 3D Vision (3DV).
Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black,
M. J., Laptev, I., and Schmid, C. (2019). Learning
joint reconstruction of hands and manipulated objects.
In CVPR.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778.
Higgins, I., Matthey, L., Pal, A., Burgess, C. P., Glorot,
X., Botvinick, M. M., Mohamed, S., and Lerchner,
A. (2017). beta-vae: Learning basic visual concepts
with a constrained variational framework. In ICLR.
H
¨
oll, M., Oberweger, M., Arth, C., and Lepetit, V. (2018).
Efficient physics-based implementation for realistic
hand-object interaction in virtual reality. In Proc. of
Conference on Virtual Reality and 3D User Interfaces.
Hu, Y., Hugonot, J., Fua, P. V., and Salzmann, M. (2019).
Segmentation-driven 6d object pose estimation. 2019
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 3380–3389.
Ivanovic, B., Leung, K., Schmerling, E., and Pavone, M.
(2021). Multimodal deep generative models for tra-
jectory prediction: A conditional variational autoen-
coder approach. IEEE Robotics and Automation Let-
ters, 6(2):295–302.
Karunratanakul, K., Yang, J., Zhang, Y., Black, M. J.,
Muandet, K., and Tang, S. (2020). Grasping field:
Learning implicit representations for human grasps.
2020 International Conference on 3D Vision (3DV),
pages 333–344.
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., and Navab, N.
(2017). Ssd-6d: Making rgb-based 3d detection and
6d pose estimation great again. 2017 IEEE Interna-
tional Conference on Computer Vision (ICCV), pages
1530–1538.
Kingma, D. P. and Ba, J. (2015). Adam: A method for
stochastic optimization. CoRR, abs/1412.6980.
Kingma, D. P. and Welling, M. (2014). Auto-encoding vari-
ational bayes. CoRR, abs/1312.6114.
Kulon, D., G
¨
uler, R. A., Kokkinos, I., Bronstein, M. M.,
and Zafeiriou, S. (2020). Weakly-supervised mesh-
convolutional hand reconstruction in the wild. 2020
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 4989–4999.
Labb’e, Y., Carpentier, J., Aubry, M., and Sivic, J. (2020).
Cosypose: Consistent multi-view multi-object 6d
pose estimation. In ECCV.
Li, S., Wang, H., and Lee, D. (2020). Hand pose estimation
for hand-object interaction cases using augmented au-
toencoder. In 2020 IEEE International Conference on
Robotics and Automation (ICRA), pages 993–999.
Liu, S., Jiang, H., Xu, J., Liu, S., and Wang, X.
(2021). Semi-supervised 3d hand-object poses esti-
mation with interactions in time. In Proceedings of
the IEEE conference on computer vision and pattern
recognition.
Moon, G., Chang, J., and Lee, K. M. (2018). V2v-
posenet: Voxel-to-voxel prediction network for accu-
rate 3d hand and human pose estimation from a single
depth map. In The IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR).
Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Srid-
har, S., Casas, D., and Theobalt, C. (2018). Ganerated
hands for real-time 3d hand tracking from monocular
rgb. In Proceedings of Computer Vision and Pattern
Recognition (CVPR).
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas,
D., and Theobalt, C. (2017). Real-time hand tracking
under occlusion from an egocentric rgb-d sensor. 2017
IEEE International Conference on Computer Vision
Workshops (ICCVW), pages 1284–1293.
Ortenzi, V., Cosgun, A., Pardi, T., Chan, W. P., Croft, E. A.,
and Kuli
´
c, D. (2021). Object handovers: A review for
robotics. IEEE Transactions on Robotics, 37:1855–
1873.
Park, J., Oh, Y., Moon, G., Choi, H., and Lee, K. M. (2022).
Handoccnet: Occlusion-robust 3d hand mesh estima-
tion network. In Conference on Computer Vision and
Pattern Recognition (CVPR).
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,
Antiga, L., Desmaison, A., K
¨
opf, A., Yang, E., De-
Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,
Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019).
PyTorch: An Imperative Style, High-Performance
Deep Learning Library. Curran Associates Inc., Red
Hook, NY, USA.
Peng, S., Liu, Y., Huang, Q., Bao, H., and Zhou, X. (2019).
Pvnet: Pixel-wise voting network for 6dof pose es-
timation. 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 4556–
4565.
Piumsomboon, T., Clark, A., Billinghurst, M., and Cock-
burn, A. (2013). User-defined gestures for augmented
reality. In CHI ’13 Extended Abstracts on Human
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
192