Authors:
David Dembinsky
1
;
2
;
Fatemeh Azimi
1
;
2
;
Federico Raue
2
;
Jörn Hees
2
;
Sebastian Palacio
2
and
Andreas Dengel
1
;
2
Affiliations:
1
TU Kaiserslautern, Germany
;
2
German Research Center for Artificial Intelligence (DFKI), Germany
Keyword(s):
Sequential Spatial Transformer Networks, Reinforcement Learning, Object Classification.
Abstract:
The standard classification architectures are designed and trained for obtaining impressive performance on dedicated image classification datasets, which usually contain images with a single object located at the image center. However, their accuracy drops when this assumption is violated, e.g., if the target object is cluttered with background noise or if it is not centered. In this paper, we study salient object classification: a more realistic scenario where there are multiple object instances in the scene, and we are interested in classifying the image based on the label corresponding to the most salient object. Inspired by previous works on Reinforcement Learning and Spatial Transformer Networks, we propose a model equipped with a trainable focus mechanism, which improves classification accuracy. Our experiments on the PASCAL VOC dataset show that the method is capable of increasing the intersection-ver-union of the salient object, which improves the classification accuracy by 1
.82 pp overall, and 3.63 pp for smaller objects. We provide an analysis of the failing cases, discussing different aspects such as dataset bias and saliency definition on the classification output.
(More)