ACKNOWLEDGEMENTS
This work is Funded by the Deutsche Forschungs-
gemeinschaft (DFG, German Research Foundation) -
Project-ID 416228727 - SFB 1410.
REFERENCES
Agarap, A. F. (2019). Deep learning using rectified linear
units (relu).
Bandi, C. and Thomas, U. (2020). Regression-based 3d
hand pose estimation using heatmaps. In Proceedings
of the 15th International Joint Conference on Com-
puter Vision, Imaging and Computer Graphics Theory
and Applications - Volume 5: VISAPP,, pages 636–
643. INSTICC, SciTePress.
Bochkovskiy, A., Wang, C., and Liao, H. M. (2020).
Yolov4: Optimal speed and accuracy of object detec-
tion. CoRR, abs/2004.10934.
Brahmbhatt, S., Tang, C., Twigg, C. D., Kemp, C. C., and
Hays, J. (2020). ContactPose: A dataset of grasps
with object contact and hand pose. In The European
Conference on Computer Vision (ECCV).
Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object de-
tection via region-based fully convolutional networks.
Doosti, B., Naha, S., Mirbagheri, M., and Crandall, D. J.
(2020). Hope-net: A graph-based model for hand-
object pose estimation. CoRR, abs/2004.00060.
Garcia-Hernando, G., Yuan, S., Baek, S., and Kim, T.-
K. (2018). First-person hand action benchmark with
rgb-d videos and 3d hand pose annotations. In Pro-
ceedings of Computer Vision and Pattern Recognition
(CVPR).
Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J., and Yuan,
J. (2019). 3d hand shape and pose estimation from a
single RGB image. CoRR, abs/1903.00812.
Hampali, S., Oberweger, M., Rad, M., and Lepetit, V.
(2019). HO-3D: A multi-user, multi-object dataset
for joint 3d hand-object pose estimation. CoRR,
abs/1907.01481.
Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M.,
and Schmid, C. (2020). Leveraging photometric con-
sistency over time for sparsely supervised hand-object
reconstruction. CoRR, abs/2004.13449.
Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black,
M. J., Laptev, I., and Schmid, C. (2019). Learning
joint reconstruction of hands and manipulated objects.
CoRR, abs/1904.05767.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep
residual learning for image recognition. CoRR,
abs/1512.03385.
Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski,
G., Konolige, K., and Navab, N. (2012). Model based
training, detection and pose estimation of texture-less
3d objects in heavily cluttered scenes. In Proceedings
of the 11th Asian Conference on Computer Vision -
Volume Part I, ACCV’12, page 548–562, Berlin, Hei-
delberg. Springer-Verlag.
Iqbal, U., Molchanov, P., Breuel, T., Gall, J., and Kautz, J.
(2018). Hand pose estimation via latent 2.5d heatmap
regression. CoRR, abs/1804.09534.
Kingma, D. P. and Ba, J. (2017). Adam: A method for
stochastic optimization.
Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-
sification with graph convolutional networks. CoRR,
abs/1609.02907.
Labb
´
e, Y., Carpentier, J., Aubry, M., and Sivic, J. (2020).
Cosypose: Consistent multi-view multi-object 6d
pose estimation. CoRR, abs/2008.08465.
Lepetit, V., Moreno-Noguer, F., and Fua, P. (2009). Epnp:
An accurate o(n) solution to the pnp problem. Int. J.
Comput. Vision, 81(2):155–166.
Li, Y., Wang, G., Ji, X., Xiang, Y., and Fox, D. (2018).
Deepim: Deep iterative matching for 6d pose estima-
tion. CoRR, abs/1804.00175.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot
multibox detector. Lecture Notes in Computer Sci-
ence, page 21–37.
Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Srid-
har, S., Casas, D., and Theobalt, C. (2018). Ganerated
hands for real-time 3d hand tracking from monocular
rgb. In Proceedings of Computer Vision and Pattern
Recognition (CVPR).
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas,
D., and Theobalt, C. (2017). Real-time hand track-
ing under occlusion from an egocentric rgb-d sensor.
In Proceedings of International Conference on Com-
puter Vision (ICCV).
Newell, A., Yang, K., and Deng, J. (2016). Stacked hour-
glass networks for human pose estimation. CoRR,
abs/1603.06937.
Park, K., Patten, T., and Vincze, M. (2019). Pix2pose:
Pixel-wise coordinate regression of objects for 6d
pose estimation. CoRR, abs/1908.07433.
Peng, S., Liu, Y., Huang, Q., Bao, H., and Zhou, X. (2018).
Pvnet: Pixel-wise voting network for 6dof pose esti-
mation. CoRR, abs/1812.11788.
Romero, J., Tzionas, D., and Black, M. J. (2017). Embod-
ied hands: Modeling and capturing hands and bodies
together. ACM Transactions on Graphics, (Proc. SIG-
GRAPH Asia), 36(6):245:1–245:17.
Rosenberger, P., Cosgun, A., Newbury, R., Kwan, J.,
Ortenzi, V., Corke, P., and Grafinger, M. (2020).
Object-independent human-to-robot handovers using
real time robotic vision. CoRR, abs/2006.01797.
Sandler, M., Howard, A. G., Zhu, M., Zhmoginov, A., and
Chen, L. (2018). Inverted residuals and linear bottle-
necks: Mobile networks for classification, detection
and segmentation. CoRR, abs/1801.04381.
Simon, T., Joo, H., Matthews, I. A., and Sheikh, Y. (2017).
Hand keypoint detection in single images using mul-
tiview bootstrapping. CoRR, abs/1704.07809.
Sridhar, S., Mueller, F., Zollhoefer, M., Casas, D.,
Oulasvirta, A., and Theobalt, C. (2016). Real-time
joint tracking of a hand manipulating an object from
rgb-d input. In Proceedings of European Conference
on Computer Vision (ECCV).
3D Hand and Object Pose Estimation for Real-time Human-robot Interaction
779