Using Action Objects Contextual Information for a Multichannel SVM in an Action Recognition Approach based on Bag of VisualWords

Jordi Bautista-Ballester, Jaume Vergés-Llahí, Domenec Puig

2015

Abstract

Classifying web videos using a Bag of Words (BoW) representation has received increased attention due to its computational simplicity and good performance. The increasing number of categories, including actions with high confusion, and the addition of significant contextual information has lead to most of the authors focusing their efforts on the combination of descriptors. In this field, we propose to use the multikernel Support Vector Machine (SVM) with a contrasted selection of kernels. It is widely accepted that using descriptors that give different kind of information tends to increase the performance. To this end, our approach introduce contextual information, i.e. objects directly related to performed action by pre-selecting a set of points belonging to objects to calculate the codebook. In order to know if a point is part of an object, the objects are previously tracked by matching consecutive frames, and the object bounding box is calculated and labeled. We code the action videos using BoW representation with the object codewords and introduce them to the SVM as an additional kernel. Experiments have been carried out on two action databases, KTH and HMDB, the results provide a significant improvement with respect to other similar approaches.

References

  1. Bilinski, P. and Corvee, E. (2013). Relative Dense Tracklets for Human Action Recognition. 10th IEEE International Conference on Automatic Face and Gesture Recognition.
  2. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1-27:27. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
  3. Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01, CVPR 7805, pages 886-893, Washington, DC, USA. IEEE Computer Society.
  4. Dalal, N., Triggs, B., and Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In Proceedings of the 9th European conference on Computer Vision - Volume Part II, ECCV'06, pages 428-441, Berlin, Heidelberg. Springer-Verlag.
  5. Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In Proceedings of the 14th International Conference on Computer Communications and Networks, ICCCN 7805, pages 65-72, Washington, DC, USA. IEEE Computer Society.
  6. Hartley, R. I. and Zisserman, A. (2004). Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition.
  7. Ikizler-Cinbis, N. and Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In Proceedings of the 11th European Conference on Computer Vision: Part I, ECCV'10, pages 494-507, Berlin, Heidelberg. Springer-Verlag.
  8. Jiang, Y., Dai, Q., Xue, X., Liu, W., and Ngo, C. (2012). Trajectory-based modeling of human actions with motion reference points. In European Conference on Computer Vision (ECCV).
  9. Kläser, A., Marszalek, M., and Schmid, C. (2008). A spatiotemporal descriptor based on 3d-gradients. In British Machine Vision Conference, pages 995-1004.
  10. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011). HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV).
  11. Laptev, I. (2005). On space-time interest points. Int. J. Comput. Vision, 64(2-3):107-123.
  12. Lucas, B. D. and Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th international joint conference on Artificial intelligence - Volume 2, IJCAI'81, pages 674-679, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  13. Poppe, R. (2010). A survey on vision-based human action recognition. Image Vision Comput., 28(6):976-990.
  14. Reddy, K. K. and Shah, M. (2013). Recognizing 50 human action categories of web videos. Mach. Vision Appl., 24(5):971-981.
  15. Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing human actions: A local svm approach. In Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03, ICPR 7804, pages 32-36, Washington, DC, USA. IEEE Computer Society.
  16. Scovanner, P., Ali, S., and Shah, M. (2007). A 3- dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th International Conference on Multimedia, MULTIMEDIA 7807, pages 357-360, New York, NY, USA. ACM.
  17. Snoek, C. G. M., Worring, M., and Smeulders, A. W. M. (2005). Early versus late fusion in semantic video analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA 7805, pages 399-402, New York, NY, USA. ACM.
  18. Solmaz, B., Modiri, S. A., and Shah, M. (2012). Classifying web videos using a global video descriptor. Machine Vision and Applications.
  19. Wang, H., Kläser, A., Schmid, C., and Liu, C. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision.
  20. Wang, H., Kläser, A., Schmid, C., and Liu, C.-L. (2011). Action Recognition by Dense Trajectories. In IEEE Conf. on Computer Vision & Pattern Recognition, pages 3169-3176, Colorado Springs, United States.
  21. Wang, H. and Schmid, C. (2013). Action Recognition with Improved Trajectories. In ICCV 2013 - IEEE International Conference on Computer Vision, pages 3551- 3558, Sydney, Australie. IEEE.
  22. Weinland, D., Ronfard, R., and Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision Image Understanding, 115(2):224-241.
  23. Willems, G., Tuytelaars, T., and Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the 10th European Conf. on Computer Vision: Part II, ECCV 7808, pages 650-663, Berlin, Heidelberg. Springer-Verlag.
  24. Zhang, J., Marszalek, M., Lazebnik, S., and Schmid, C. (2006). Local features and kernels for classification of texture and object categories: A comprehensive study. In Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop, CVPRW 7806, pages 13-, Washington, DC, USA. IEEE Computer Society.
Download


Paper Citation


in Harvard Style

Bautista-Ballester J., Vergés-Llahí J. and Puig D. (2015). Using Action Objects Contextual Information for a Multichannel SVM in an Action Recognition Approach based on Bag of VisualWords . In Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015) ISBN 978-989-758-090-1, pages 78-86. DOI: 10.5220/0005301000780086


in Bibtex Style

@conference{visapp15,
author={Jordi Bautista-Ballester and Jaume Vergés-Llahí and Domenec Puig},
title={Using Action Objects Contextual Information for a Multichannel SVM in an Action Recognition Approach based on Bag of VisualWords},
booktitle={Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015)},
year={2015},
pages={78-86},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005301000780086},
isbn={978-989-758-090-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015)
TI - Using Action Objects Contextual Information for a Multichannel SVM in an Action Recognition Approach based on Bag of VisualWords
SN - 978-989-758-090-1
AU - Bautista-Ballester J.
AU - Vergés-Llahí J.
AU - Puig D.
PY - 2015
SP - 78
EP - 86
DO - 10.5220/0005301000780086