Combining Contextual and Modal Action Information into a Weighted Multikernel SVM for Human Action Recognition

Jordi Bautista-Ballester; Jaume Jaume Vergés-Llahí; Domenec Puig

doi:10.5220/0005669002990307

Combining Contextual and Modal Action Information into a Weighted Multikernel SVM for Human Action Recognition

Jordi Bautista-Ballester, Jaume Jaume Vergés-Llahí, Domenec Puig

2016

Abstract

Understanding human activities is one of the most challenging modern topics for robots. Either for imitation or anticipation, robots must recognize which action is performed by humans when they operate in a human environment. Action classification using a Bag of Words (BoW) representation has shown computational simplicity and good performance, but the increasing number of categories, including actions with high confusion, and the addition, especially in human robot interactions, of significant contextual and multimodal information has led most authors to focus their efforts on the combination of image descriptors. In this field, we propose the Contextual and Modal MultiKernel Learning Support Vector Machine (CMMKL-SVM). We introduce contextual information -objects directly related to the performed action by calculating the codebook from a set of points belonging to objects- and multimodal information -features from depth and 3D images resulting in a set of two extra modalities of information in addition to RGB images-. We code the action videos using a BoW representation with both contextual and modal information and introduce them to the optimal SVM kernel as a linear combination of single kernels weighted by learning. Experiments have been carried out on two action databases, CAD-120 and HMDB. The upturn achieved with our approach attained the same results for high constrained databases with respect to other similar approaches of the state of the art and it is much better as much realistic is the database, reaching a performance improvement of 14.27 % for HMDB.

References

Bautista-Ballester, J., Vergés-Llahí, J., and Puig, D. (2014). Using action objects contextual information for a multichannel svm in an action recognition approach based on bag of visual words. In International Conference on Computer Vision Theory and Applications, VISAPP.
Bilinski, P. and Corvee, E. (2013). Relative Dense Tracklets for Human Action Recognition. 10th IEEE International Conference on Automatic Face and Gesture Recognition.
Bucak, S., Jin, R., and Jain, A. (2014). Multiple kernel learning for visual object recognition: A review. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7):1354-1369.
Bucak, S., Jin, R., and Jain, A. K. (2010). Multi-label multiple kernel learning by stochastic approximation: Application to visual object recognition. In Advances in Neural Information Processing Systems, pages 325- 333.
Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1-27:27. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Gehler, P. and Nowozin, S. (2009). Let the kernel figure it out; principled learning of pre-processing for kernel classifiers. In Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, pages 2836- 2843. IEEE.
Ikizler-Cinbis, N. and Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In Proceedings of the 11th European Conference on Computer Vision: Part I, ECCV'10, pages 494-507, Berlin, Heidelberg. Springer-Verlag.
Kläser, A., Marszalek, M., and Schmid, C. (2008). A spatiotemporal descriptor based on 3d-gradients. In British Machine Vision Conference, pages 995-1004.
Koppula, H. S., Gupta, R., and Saxena, A. (2013). Learning human activities and object affordances from rgbd videos. The International Journal of Robotics Research, 32(8):951-970.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011). HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV).
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., and Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res., 5:27-72.
Laptev, I. (2005). On space-time interest points. Int. J. Comput. Vision, 64(2-3):107-123.
Pieropan, A., Salvi, G., Pauwels, K., and Kjellstrom, H. (2014). Audio-visual classification and detection of human manipulation actions. In Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pages 3045-3052.
Rakotomamonjy, A., Bach, F. R., Canu, S., and Grandvalet, Y. (2008). Simplemkl. Journal of Machine Learning Research.
Rusu, R. B. (2009). Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. PhD thesis, Computer Science department, Technische Universitaet Muenchen, Germany.
Snoek, C. G. M., Worring, M., and Smeulders, A. W. M. (2005). Early versus late fusion in semantic video analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA 7805, pages 399-402, New York, NY, USA. ACM.
Tsai, J.-S., Hsu, Y.-P., Liu, C., and Fu, L.-C. (2013). An efficient part-based approach to action recognition from rgb-d video with bow-pyramid representation. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 2234-2239.
Vedaldi, A., Gulshan, V., Varma, M., and Zisserman, A. (2009). Multiple kernels for object detection. In Proceedings of the International Conference on Computer Vision, 2009.
Wang, H., Kläser, A., Schmid, C., and Liu, C. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision.
Wang, H., Kläser, A., Schmid, C., and Liu, C.-L. (2011). Action Recognition by Dense Trajectories. In IEEE Conf. on Computer Vision & Pattern Recognition, pages 3169-3176, Colorado Springs, United States.
Wang, H. and Schmid, C. (2013). Action Recognition with Improved Trajectories. In ICCV 2013 - IEEE International Conference on Computer Vision, pages 3551- 3558, Sydney, Australie. IEEE.

Download

Paper Citation

in Harvard Style

Bautista-Ballester J., Jaume Vergés-Llahí J. and Puig D. (2016). Combining Contextual and Modal Action Information into a Weighted Multikernel SVM for Human Action Recognition . In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, (VISIGRAPP 2016) ISBN 978-989-758-175-5, pages 299-307. DOI: 10.5220/0005669002990307

in Bibtex Style

@conference{visapp16,
author={Jordi Bautista-Ballester and Jaume Jaume Vergés-Llahí and Domenec Puig},
title={Combining Contextual and Modal Action Information into a Weighted Multikernel SVM for Human Action Recognition},
booktitle={Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, (VISIGRAPP 2016)},
year={2016},
pages={299-307},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005669002990307},
isbn={978-989-758-175-5},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, (VISIGRAPP 2016)
TI - Combining Contextual and Modal Action Information into a Weighted Multikernel SVM for Human Action Recognition
SN - 978-989-758-175-5
AU - Bautista-Ballester J.
AU - Jaume Vergés-Llahí J.
AU - Puig D.
PY - 2016
SP - 299
EP - 307
DO - 10.5220/0005669002990307