REFERENCES
Ba, J., Salakhutdinov, R. R., Grosse, R. B., and Frey, B. J.
(2015). Learning wake-sleep recurrent attention mo-
dels. In Advances in Neural Information Processing
Systems, pages 2593–2601.
Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman,
A. (2014). Return of the devil in the details: Delving
deep into convolutional nets. In British Machine Vi-
sion Conference.
Chen, D., Hua, G., and Wen, F. (2016). Supervised trans-
former network for efficient face detection. In Euro-
pean Conference on Computer Vision, pages 122–138.
Springer.
Cherian, A., Fernando, B., Harandi, M., and Gould, S.
(2017a). Generalized rank pooling for activity recog-
nition. arXiv preprint arXiv, 170402112.
Cherian, A. and Gould, S. (2017). Second-order tempo-
ral pooling for action recognition. arXiv preprint
arXiv:1704.06925.
Cherian, A. and Gould, S. (2018). Second-order temporal
pooling for action recognition. International Journal
of Computer Vision.
Cherian, A., Koniusz, P., and Gould, S. (2017b). Higher-
order pooling of CNN features via kernel linearization
for action recognition. CoRR, abs/1701.05432.
Ch
´
eron, G., Laptev, I., and Schmid, C. (2015). P-cnn: Pose-
based cnn features for action recognition. In Procee-
dings of the IEEE international conference on compu-
ter vision, pages 3218–3226.
Dalal, N., Triggs, B., and Schmid, C. (2006). Human de-
tection using oriented histograms of flow and appea-
rance. In European conference on computer vision,
pages 428–441. Springer.
Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016). Con-
volutional two-stream network fusion for video action
recognition. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
1933–1941.
Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015).
Spatial transformer networks. In Advances in Neural
Information Processing Systems, pages 2017–2025.
Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011).
Learning hierarchical invariant spatio-temporal featu-
res for action recognition with independent subspace
analysis. In Computer Vision and Pattern Recogni-
tion (CVPR), 2011 IEEE Conference on, pages 3361–
3368. IEEE.
Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., and Snoek, C. G.
(2018). Videolstm convolves, attends and flows for
action recognition. Computer Vision and Image Un-
derstanding, 166:41–50.
Rohrbach, M., Amin, S., Andriluka, M., and Schiele, B.
(2012). A database for fine grained activity detection
of cooking activities. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on, pa-
ges 1194–1201. IEEE.
Sharma, S., Kiros, R., and Salakhutdinov, R. (2015). Action
recognition using visual attention. arXiv preprint
arXiv:1511.04119.
Simonyan, K. and Zisserman, A. (2014a). Two-stream
convolutional networks for action recognition in vi-
deos. In Advances in neural information processing
systems, pages 568–576.
Simonyan, K. and Zisserman, A. (2014b). Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556.
Singh, B., Marks, T. K., Jones, M., Tuzel, O., and Shao, M.
(2016). A multi-stream bi-directional recurrent neural
network for fine-grained action detection. In Procee-
dings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1961–1970.
Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101:
A dataset of 101 human actions classes from videos in
the wild. arXiv preprint arXiv:1212.0402.
Wang, H., Kl
¨
aser, A., Schmid, C., and Liu, C.-L. (2011).
Action recognition by dense trajectories. In Computer
Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on, pages 3169–3176. IEEE.
Wang, H., Kl
¨
aser, A., Schmid, C., and Liu, C.-L. (2013).
Dense trajectories and motion boundary descriptors
for action recognition. International journal of com-
puter vision, 103(1):60–79.
Wang, L., Qiao, Y., and Tang, X. (2015). Action recog-
nition with trajectory-pooled deep-convolutional des-
criptors. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
4305–4314.
Wang, Y., Song, J., Wang, L., Van Gool, L., and Hilliges,
O. (2016). Two-stream sr-cnns for action recognition
in videos. In BMVC.
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
318