Egocentric Activity Recognition using Histograms of Oriented Pairwise Relations

Ardhendu Behera, Matthew Chapman, Anthony G. Cohn, David C. Hogg


This paper presents an approach for recognising activities using video from an egocentric (first-person view) setup. Our approach infers activity from the interactions of objects and hands. In contrast to previous approaches to activity recognition, we do not require to use an intermediate such as object detection, pose estimation, etc. Recently, it has been shown that modelling the spatial distribution of visual words corresponding to local features further improves the performance of activity recognition using the bag-of-visual words representation. Influenced and inspired by this philosophy, our method is based on global spatio-temporal relationships between visual words. We consider the interaction between visual words by encoding their spatial distances, orientations and alignments. These interactions are encoded using a histogram that we name the Histogram of Oriented Pairwise Relations (HOPR). The proposed approach is robust to occlusion and background variation and is evaluated on two challenging egocentric activity datasets consisting of manipulative task. We introduce a novel representation of activities based on interactions of local features and experimentally demonstrate its superior performance in comparison to standard activity representations such as bag-of-visual words.


  1. Aggarwal, J. K. and Ryoo, M. S. (2011). Human activity analysis: A review. ACM Comput. Surv., 43(3):1-16.
  2. Aghazadeh, O., Sullivan, J., and Carlsson, S. (2011). Novelty detection from an ego-centric perspective. In CVPR, pages 3297-3304.
  3. Allen, J. F. (1983). Maintaining Knowledge about Temporal Intervals. Commun. ACM, 26(11):832-843.
  4. Balbiani, P., Condotta, J.-F., and del Cerro, L. F. (1999). A new tractable subclass of the rectangle algebra. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI), pages 442- 447.
  5. Bay, H., Tuytelaars, T., and Gool, L. V. (2006). SURF: Speeded up robust features. In ECCV, pages 404-417.
  6. Behera, A., Cohn, A. G., and Hogg, D. C. (2012a). Workflow activity monitoring using dynamics of pair-wise qualitative spatial relations. In MMM, pages 196-209.
  7. Behera, A., Hogg, D. C., and Cohn, A. G. (2012b). Egocentric activity monitoring and recovery. In ACCV, pages 519-532.
  8. Blank, M., Gorelick, L., Shechtman, E., Irani, M., and Basri, R. (2005). Actions as space-time shapes. In ICCV, pages 1395-1402.
  9. Carneiro, G. and Lowe, D. (2006). Sparse flexible models of local features. In ECCV, pages 29-43.
  10. Crandall, D. J. and Huttenlocher, D. P. (2006). Weakly supervised learning of part-based spatial models for visual object recognition. In ECCV (1), pages 16-29.
  11. Deselaers, T. and Ferrari, V. (2010). Global and efficient self-similarity for object classification and detection. In CVPR, pages 1633-1640.
  12. Dollár, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pages 65-72.
  13. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874.
  14. Fathi, A., Farhadi, A., and Rehg, J. M. (2011a). Understanding egocentric activities. In ICCV, pages 407- 414.
  15. Fathi, A., Ren, X., and Rehg, J. M. (2011b). Learning to recognize objects in egocentric activities. In CVPR, pages 3281-3288.
  16. Gilbert, A., Illingworth, J., and Bowden, R. (2009). Fast realistic multi-action recognition using mined dense spatio-temporal features. In ICCV, pages 925-931.
  17. Gupta, A. and Davis, L. S. (2007). Objects in action: An approach for combining action understanding and object perception. In CVPR, pages 1-8.
  18. Kitani, K. M., Okabe, T., Sato, Y., and Sugimoto, A. (2011). Fast unsupervised ego-action learning for first-person sports videos. In CVPR, pages 3241-3248.
  19. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV, pages 2556-2563.
  20. Laptev, I. and Lindeberg, T. (2003). Space-time interest points. In ICCV, pages 432-439.
  21. Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR, pages 1-8.
  22. Liu, D., Hua, G., Viola, P. A., and Chen, T. (2008). Integrated feature selection and higher-order spatial feature extraction for object categorization. In CVPR, pages 1-8.
  23. Liu, J., Luo, J., and Shah, M. (2009a). Recognizing realistic actions from videos “in the wild”. In CVPR, pages 1996-2003.
  24. Liu, W., Li, S., and Renz, J. (2009b). Combining rcc-8 with qualitative direction calculi: Algorithms and complexity. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), pages 854-859.
  25. Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2):91-110.
  26. Matikainen, P., Hebert, M., and Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. In ECCV (1), pages 508-521.
  27. Moeslund, T. B., Hilton, A., and Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2-3):90-126.
  28. Niebles, J. C. and Li, F.-F. (2007). A hierarchical model of shape and appearance for human action classification. In CVPR, pages 1-8.
  29. Ryoo, M. S. and Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV, pages 1593-1600.
  30. Savarese, S., Winn, J. M., and Criminisi, A. (2006). Discriminative object class models of appearance and shape by correlations. In CVPR (2), pages 2033-2040.
  31. Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR, pages 32-36.
  32. Shechtman, E. and Irani, M. (2007). Matching local selfsimilarities across images and videos. In CVPR.
  33. Starner, T. and Pentland, A. (1995). Real-time American sign language recognition from video using hidden Markov models. In Proc. of Int'l Symposium on Computer Vision, pages 265 - 270.
  34. Sun, J., Wu, X., Yan, S., Cheong, L. F., Chua, T.-S., and Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In CVPR, pages 2004- 2011.
  35. Turaga, P. K., Chellappa, R., Subrahmanian, V. S., and Udrea, O. (2008). Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Techn., 18(11):1473-1488.
  36. Vedaldi, A. and Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In CVPR, pages 3539-3546.

Paper Citation

in Harvard Style

Chapman M., Behera A., G. Cohn A. and C. Hogg D. (2014). Egocentric Activity Recognition using Histograms of Oriented Pairwise Relations . In Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014) ISBN 978-989-758-004-8, pages 22-30. DOI: 10.5220/0004655100220030

in Bibtex Style

author={Matthew Chapman and Ardhendu Behera and Anthony G. Cohn and David C. Hogg},
title={Egocentric Activity Recognition using Histograms of Oriented Pairwise Relations},
booktitle={Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014)},

in EndNote Style

JO - Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014)
TI - Egocentric Activity Recognition using Histograms of Oriented Pairwise Relations
SN - 978-989-758-004-8
AU - Chapman M.
AU - Behera A.
AU - G. Cohn A.
AU - C. Hogg D.
PY - 2014
SP - 22
EP - 30
DO - 10.5220/0004655100220030