A Novel Dictionary Learning based Multiple Instance Learning Approach to Action Recognition from Videos

Abhinaba Roy, Biplab Banerjee, Vittorio Murino

2017

Abstract

In this paper we deal with the problem of action recognition from unconstrained videos under the notion of multiple instance learning (MIL). The traditional MIL paradigm considers the data items as bags of instances with the constraint that the positive bags contain some class-specific instances whereas the negative bags consist of instances only from negative classes. A classifier is then further constructed using the bag level annotations and a distance metric between the bags. However, such an approach is not robust to outliers and is time consuming for a moderately large dataset. In contrast, we propose a dictionary learning based strategy to MIL which first identifies class-specific discriminative codewords, and then projects the bag-level instances into a probabilistic embedding space with respect to the selected codewords. This essentially generates a fixed-length vector representation of the bags which is specifically dominated by the properties of the class-specific instances. We introduce a novel exhaustive search strategy using a support vector machine classifier in order to highlight the class-specific codewords. The standard multiclass classification pipeline is followed henceforth in the new embedded feature space for the sake of action recognition. We validate the proposed framework on the challenging KTH and Weizmann datasets, and the results obtained are promising and comparable to representative techniques from the literature.

References

  1. Andrews, S., Tsochantaridis, I., and Hofmann, T. (2002). Support vector machines for multiple-instance learning. In Advances in neural information processing systems, pages 561-568.
  2. Blank, M., Gorelick, L., Shechtman, E., Irani, M., and Basri, R. (2005). Actions as space-time shapes. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, volume 2, pages 1395- 1402. IEEE.
  3. Brox, T. and Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):500-513.
  4. Chang, C.-C. and Lin, C.-J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27.
  5. Jiang, Z., Lin, Z., and Davis, L. S. (2011). Learning a discriminative dictionary for sparse coding via label consistent k-svd. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1697-1704. IEEE.
  6. Jurie, F. and Triggs, B. (2005). Creating efficient codebooks for visual recognition. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, volume 1, pages 604-610. IEEE.
  7. Klaser, A., Marszalek, M., and Schmid, C. (2008). A spatiotemporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference, pages 275-1. British Machine Vision Association.
  8. Kreutz-Delgado, K., Murray, J. F., Rao, B. D., Engan, K., Lee, T.-W., and Sejnowski, T. J. (2003). Dictionary learning algorithms for sparse representation. Neural computation, 15(2):349-396.
  9. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2-3):107-123.
  10. Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B. (2008). Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8. IEEE.
  11. Lee, H., Battle, A., Raina, R., and Ng, A. Y. (2006). Efficient sparse coding algorithms. In Advances in neural information processing systems, pages 801-808.
  12. Li, H., Chen, J., Xu, Z., Chen, H., and Hu, R. (2016). Multiple instance discriminative dictionary learning for action recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2014-2018. IEEE.
  13. Lin, W.-C., Tsai, C.-F., Chen, Z.-Y., and Ke, S.-W. (2016). Keypoint selection for efficient bag-of-words feature generation and effective image classification. Information Sciences, 329:33-51.
  14. Liu, J., Ali, S., and Shah, M. (2008). Recognizing human actions using multiple features. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8. IEEE.
  15. Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., and Bach, F. R. (2009). Supervised dictionary learning. In Advances in neural information processing systems, pages 1033-1040.
  16. Negin, F. and Bremond, F. (2016). Human action recognition in videos: A survey.
  17. Nowak, E., Jurie, F., and Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In European conference on computer vision, pages 490-503. Springer.
  18. Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research, 37(23):3311-3325.
  19. Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6):976- 990.
  20. Sapienza, M., Cuzzolin, F., and Torr, P. H. (2014). Learning discriminative space-time action parts from weakly labelled videos. International Journal of Computer Vision, 110(1):30-47.
  21. Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32-36. IEEE.
  22. Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489- 4497. IEEE.
  23. Verma, Y. and Jawahar, C. (2016). A robust distance with correlated metric learning for multi-instance multilabel data. In Proceedings of the 2016 ACM on Multimedia Conference, pages 441-445. ACM.
  24. Wang, H., Kläser, A., Schmid, C., and Liu, C.-L. (2011). Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3169-3176. IEEE.
  25. Wang, H., Ullah, M. M., Klaser, A., Laptev, I., and Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In BMVC 2009-British Machine Vision Conference, pages 124-1. BMVA Press.
  26. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. (2010). Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3360-3367. IEEE.
  27. Wang, J. and Zucker, J.-D. (2000). Solving multipleinstance problem: A lazy learning approach.
  28. Wang, X., Wang, B., Bai, X., Liu, W., and Tu, Z. (2013). Max-margin multiple-instance dictionary learning. In ICML (3), pages 846-854.
  29. Wang, X., Wang, L., and Qiao, Y. (2012). A comparative study of encoding, pooling and normalization methods for action recognition. In Asian Conference on Computer Vision, pages 572-585. Springer.
  30. Weinland, D., Ronfard, R., and Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2):224-241.
  31. Zhang, C., Platt, J. C., and Viola, P. A. (2005). Multiple instance boosting for object detection. In Advances in neural information processing systems, pages 1417- 1424.
  32. Zhang, D., Wang, F., Si, L., and Li, T. (2009). M3ic: Maximum margin multiple instance clustering. In IJCAI, volume 9, pages 1339-1344.
  33. Zhou, Y., Ni, B., Hong, R., Wang, M., and Tian, Q. (2015). Interaction part mining: A mid-level approach for fine-grained action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3323-3331.
  34. Zhou, Z.-H. (2004). Multi-instance learning: A survey. Department of Computer Science & Technology, Nanjing University, Tech. Rep.
  35. Zhu, J., Wang, B., Yang, X., Zhang, W., and Tu, Z. (2013). Action recognition with actons. In Proceedings of the IEEE International Conference on Computer Vision, pages 3559-3566.
Download


Paper Citation


in Harvard Style

Roy A., Banerjee B. and Murino V. (2017). A Novel Dictionary Learning based Multiple Instance Learning Approach to Action Recognition from Videos . In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-222-6, pages 519-526. DOI: 10.5220/0006200205190526


in Bibtex Style

@conference{icpram17,
author={Abhinaba Roy and Biplab Banerjee and Vittorio Murino},
title={A Novel Dictionary Learning based Multiple Instance Learning Approach to Action Recognition from Videos},
booktitle={Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2017},
pages={519-526},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006200205190526},
isbn={978-989-758-222-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - A Novel Dictionary Learning based Multiple Instance Learning Approach to Action Recognition from Videos
SN - 978-989-758-222-6
AU - Roy A.
AU - Banerjee B.
AU - Murino V.
PY - 2017
SP - 519
EP - 526
DO - 10.5220/0006200205190526