the data is specifically captured in that workspace, then
the performance of the network can be improved
further.
5 CONCLUSIONS
We proposed a cascaded CNN pipeline for the upper
body pose and the 3D hand pose estimation. Heatmaps
and regression techniques are the norms for pose
estimation in direct RGB images. We experimented
with the stacked encoder-decoder architecture for
heatmap based 2D detections and 3D direct regression.
Two large-scale RGB datasets and a new SSMH
custom dataset were considered for training and testing
the performance of the proposed network. We
observed that the network performs well under
occlusions for all the datasets. We achieved the mean
error as low as 20 mm for images containing minimal
or no occlusions and mean error is over 60 mm for
highly occluded images from SSMH dataset. To apply
the proposed pipeline in real-time Human-Machine-
Interaction applications, occlusion dataset must be
extended and retrained. Further improvements like
kinematic fitting and tracking could help in fingertip
refinement.
ACKNOWLEDGEMENTS
This research is supported by Saechsische
AufbauBank (SAB β application no. 100378180).
REFERENCES
Tompson, J., Stein, M., Lecun, Y., Perlin, K., 2014. Real-
Time Continuous Pose Recovery of Human Hands Using
Convolutional Networks. ACM Transactions on
Graphics, 33(5):1β 10.
Wei, S., Ramakrishna, V., Kanade, T., Sheikh, Y., 2016.
Convolutional pose machines. In Proc. of the IEEE Conf.
on Computer Vision and Pattern Recognition (CVPR),
pages 4724β4732.
Toshev, T., Szegedy, C., 2014. Human pose estimation via
deep neural networks. In Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR),
pages 1653β1660.
Wan, C., Thomas, P., Van Gool, L., Yao, A., 2017. Dense 3D
Regression for Hand Pose Estimation.
arXiv:1711.08996v1 [cs.CV].
Garcia-Hernando, G., Yuan S., Baek, S., Kim T.K., 2018.
First Person Hand Action Benchmark with RGB-D
Videos and 3D Hand Pose Annotations.
arXiv:1704.02463v2 [cs.CV].
Zimmermann, C., Brox, T., 2017. Learning to Estimate 3D
Hand Pose from Single RGB Images.
arXiv:1705.01389v3 [cs.CV].
Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D.,
Sridhar, S., Casas, D., and Theobalt, C., 2018.
GANerated Hands for Real-Time 3D Hand Tracking
from Monocular RGB. CVPR 2018.
Gomez-Donoso F., Orts-Escolano, S., Cazorla, M., 2017.
Large Scale Multiview 3D Hand Pose Dataset.
arXiv:1707.03742v3.
Bambach, Sven and Lee, Stefan and Crandall, David, J., and
Yu, Chen, 2015. Lending A Hand: Detecting Hands and
Recognizing Activities in Complex Egocentric
Interactions, The IEEE International Conference on
Computer Vision (ICCV).
Newell, A., Yang, K., Deng, J., 2016. Stacked Hourglass
Networks for Human Pose Estimation
arXiv:1603.06937v2 [cs.CV].
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y., 2017.
Towards 3D Human Pose Estimation in the Wild: a
Weakly-supervised Approach, Shanghai Key Laboratory
of Intelligent Information Processing School of
Computer Science, Fudan University, The University of
Texas at Austin, Microsoft Research
arXiv:1704.02447v2 [cs.CV].
Tang, D., Chang, H.J., Tejani, A., Kim, T.K., 2014. Latent
Regression Forest: Structural Estimation of 3D
Articulated Hand Posture, Proc. of IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR),
Columbus, Ohio, USA.
He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep Residual
Learning for Image Recognition, arXiv:1512.03385v1
[cs.CV]. Microsoft Research.
Howard, G.A., Zhu, M., Chen, B., Kalenichenko, D., Wang,
W., Weyand, T., Andreetto, M., Adam, H., 2017.
MobileNets: Efficient Convolutional Neural Networks
for Mobile Vision Applications, Google Inc,
arXiv:1704.04861v1 [cs.CV].
Redmon, J., Farhadi, A., 2018. YOLOv3: An Incremental
Improvement, University of Washington,
arXiv:1804.02767 [cs.CV].
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, Bernt,
2014. 2D Human Pose Estimation: New Benchmark and
State of the Art Analysis, IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
Chen, Xinghao, Wang, Guijin, Guo, Hengkai, Zhang,
Cairong, 2018. Pose Guided Structured Region
Ensemble Network for Cascaded Hand Pose Estimation.
Neurocomputing Journal.
Moon, G., Chang, J.Y., Lee, K.M., 2018. V2V-Posenet:
Voxel-To-Voxel Prediction Network for Accurate 3d
Hand and Human Pose Estimation from a Single Depth
Map, CVPR, arXiv:1711.07399[cs.CV].
Sridhar, S., Mueller, F., Zollhoefer, M., Casas, D., 2016.
Real-time Joint Tracking of Hand Manipulating an
Object from RGB-D Input. ECCV.