Authors:
Dichao Liu
1
;
Yu Wang
2
and
Jien Kato
3
Affiliations:
1
Graduate School of Informatics, Nagoya University, Nagoya City and Japan
;
2
Graduate School of International Development, Nagoya University, Nagoya City and Japan
;
3
College of Information Science and Engineering, Ritsumeikan University, Kusatsu City and Japan
Keyword(s):
Action Recognition, Video Understanding, Attention, Fine-grained, Deep Learning.
Related
Ontology
Subjects/Areas/Topics:
Computer Vision, Visualization and Computer Graphics
;
Image and Video Analysis
;
Visual Attention and Image Saliency
Abstract:
We aim to propose more effective attentional regions that can help develop better fine-grained action recognition algorithms. On the basis of the spatial transformer networks’ capability that implements spatial manipulation inside the networks, we propose an extension model, the Supervised Spatial Transformer Networks (SSTNs). This network model can supervise the spatial transformers to capture the regions same as hard-coded attentional regions of certain scale levels at first. Then such supervision can be turned off, and the network model will adjust the region learning in terms of location and scale. The adjustment is conditioned to classification loss so that it is actually optimized for better recognition results. With this model, we are able to capture attentional regions of different levels within the networks. To evaluate SSTNs, we construct a six-stream SSTN model that exploits spatial and temporal information corresponding to three levels (general, middle and detail). The re
sults show that the deep-learned attentional regions captured by SSTNs outperform hard-coded attentional regions. Also, the features learned by different streams of SSTNs are complementary to each other and better result is obtained by fusing the features.
(More)