Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition

Dichao Liu, Yu Wang, Jien Kato

Abstract

We aim to propose more effective attentional regions that can help develop better fine-grained action recognition algorithms. On the basis of the spatial transformer networks’ capability that implements spatial manipulation inside the networks, we propose an extension model, the Supervised Spatial Transformer Networks (SSTNs). This network model can supervise the spatial transformers to capture the regions same as hard-coded attentional regions of certain scale levels at first. Then such supervision can be turned off, and the network model will adjust the region learning in terms of location and scale. The adjustment is conditioned to classification loss so that it is actually optimized for better recognition results. With this model, we are able to capture attentional regions of different levels within the networks. To evaluate SSTNs, we construct a six-stream SSTN model that exploits spatial and temporal information corresponding to three levels (general, middle and detail). The results show that the deep-learned attentional regions captured by SSTNs outperform hard-coded attentional regions. Also, the features learned by different streams of SSTNs are complementary to each other and better result is obtained by fusing the features.

Download


Paper Citation


in Harvard Style

Liu D., Wang Y. and Kato J. (2019). Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition.In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, ISBN 978-989-758-354-4, pages 311-318. DOI: 10.5220/0007257803110318


in Bibtex Style

@conference{visapp19,
author={Dichao Liu and Yu Wang and Jien Kato},
title={Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition},
booktitle={Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP,},
year={2019},
pages={311-318},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007257803110318},
isbn={978-989-758-354-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP,
TI - Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition
SN - 978-989-758-354-4
AU - Liu D.
AU - Wang Y.
AU - Kato J.
PY - 2019
SP - 311
EP - 318
DO - 10.5220/0007257803110318