Classifying and Visualizing Motion Capture Sequences using Deep Neural Networks

Kyunghyun Cho, Xi Chen


The gesture recognition using motion capture data and depth sensors has recently drawn more attention in vision recognition. Currently most systems only classify dataset with a couple of dozens different actions. Moreover, feature extraction from the data is often computational complex. In this paper, we propose a novel system to recognize the actions from skeleton data with simple, but effective, features using deep neural networks. Features are extracted for each frame based on the relative positions of joints (PO), temporal differences (TD), and normalized trajectories of motion (NT). Given these features a hybrid multi-layer perceptron is trained, which simultaneously classifies and reconstructs input data. We use deep autoencoder to visualize learnt features. The experiments show that deep neural networks can capture more discriminative information than, for instance, principal component analysis can. We test our system on a public database with 65 classes and more than 2,000 motion sequences. We obtain an accuracy above 95% which is, to our knowledge, the state of the art result for such a large dataset.


  1. Barnachon, M., Bouakaz, S., Boufama, B., and Guillou, E. (2013). A real-time system for motion retrieval and interpretation. Pattern Recognition Letters.
  2. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Schölkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 153-160. MIT Press, Cambridge, MA.
  3. Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J., editors, Large Scale Kernel Machines. MIT Press.
  4. Chen, X. and Koskela, M. (2013). Classification of rgb-d and motion capture sequences using extreme learning machine. In Proceedings of 18th Scandinavian Conference on Image Analysis.
  5. Chung, H. and Yang, H.-D. (2013). Conditional random field-based gesture recognition with depth information. Optical Engineering, 52(1):017201-017201.
  6. Haykin, S. (2009). Neural Networks and Learning Machines. Pearson Education, 3rd edition.
  7. Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786):504-507.
  8. Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2006). Extreme learning machine: Theory and applications. Neurocomputing, 70(1-3):489-501.
  9. Larochelle, H. and Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In Proceedings of the 25th international conference on Machine learning (ICML 2008), pages 536-543, New York, NY, USA. ACM.
  10. Müller, M. and Roder, T. (2006). Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of the Eurographics/ACM SIGGRAPH symposium on Computer animation, volume 2, pages 137-146, Vienna, Austria.
  11. Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., and Weber, A. (2007). Documentation mocap database HDM05. Technical Report CG-2007-2, U. Bonn.
  12. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., and Bajcsy, R. (2013). Berkeley mhad: A comprehensive multimodal human action database. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 53- 60.
  13. Raptis, M., Kirovski, D., and Hoppe, H. (2011). Realtime classification of dance gestures from skeleton animation. In Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 147-156. ACM.
  14. Raptis, M., Wnuk, K., Soatto, S., et al. (2008). Flexible dictionaries for action classification. In Proc. MLVMA'08.
  15. Rumelhart, D. E., Hinton, G., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(Oct):533-536.
  16. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. (2011).
  17. Vieira, A., Lewiner, T., Schwartz, W., and Campos, M. (2012). Distance matrices as invariant features for classifying MoCap data. In 21st International Conference on Pattern Recognition (ICPR), Tsukuba, Japan.
  18. Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1290- 1297. IEEE.
  19. Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. arXiv:1212.5701 [cs.LG].
  20. Zhao, X., Li, X., Pang, C., and Wang, S. (2013). Human action recognition based on semi-supervised discriminant analysis with global constraint. Neurocomputing, 105(0):45 - 50.
  21. 42. sitDownTable
  22. 43. skier1RepsLstart skier3RepsLstart
  23. 44. sneak2StepsLStart sneak2StepsRStart sneak4StepsLStart sneak4StepsRStart
  24. 45. squat1Reps squat3Reps
  25. 46. staircaseDown3Rstart
  26. 47. staircaseUp3Rstart
  27. 48. standUpKneelToStand
  28. 49. standUpLieFloor
  29. 50. standUpSitChair
  30. 51. standUpSitFloor
  31. 52. standUpSitTable
  32. 53. throwBasketball
  33. 54. throwFarR

Paper Citation

in Harvard Style

Cho K. and Chen X. (2014). Classifying and Visualizing Motion Capture Sequences using Deep Neural Networks . In Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014) ISBN 978-989-758-004-8, pages 122-130. DOI: 10.5220/0004718301220130

in Bibtex Style

author={Kyunghyun Cho and Xi Chen},
title={Classifying and Visualizing Motion Capture Sequences using Deep Neural Networks},
booktitle={Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014)},

in EndNote Style

JO - Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014)
TI - Classifying and Visualizing Motion Capture Sequences using Deep Neural Networks
SN - 978-989-758-004-8
AU - Cho K.
AU - Chen X.
PY - 2014
SP - 122
EP - 130
DO - 10.5220/0004718301220130