Authors:
Marc Moreaux
1
;
Natalia Lyubova
2
;
Isabelle Ferrané
3
and
Frédéric Lerasle
3
Affiliations:
1
Softbank Robotics Europe, and Univ. de Toulouse, France
;
2
Softbank Robotics Europe and, France
;
3
Univ. de Toulouse, France
Keyword(s):
Semi-supervised Class Localization, Image Classification, Class Saliency, Global Average Pooling.
Related
Ontology
Subjects/Areas/Topics:
Computer Vision, Visualization and Computer Graphics
;
Image and Video Analysis
;
Visual Attention and Image Saliency
Abstract:
This work addresses the issue of image classification and localization of human actions based on visual data
acquired from RGB sensors. Our approach is inspired by the success of deep learning in image classification.
In this paper, we describe our method and how the concept of Global Average Pooling (GAP) applies in the
context of semi-supervised class localization. We benchmark it with respect to Class Activation Mapping
initiated in (Zhou et al., 2016), propose a regularization over the GAP maps to enhance the results, and study
whether a combination of these two ideas can result in a better classification accuracy. The models are trained
and tested on the Stanford 40 Action dataset (Yao et al., 2011) describing people performing 40 different actions
such as drinking, cooking or watching TV. Compared to the aforementioned baseline, our model improves
the classification accuracy by 5.3 percent points, achieves a localization accuracy of 50.3%, and drastically
diminishes th
e computation needed to retrieve the class saliency from the base convolutional model.
(More)