hence, without a need of extensive localization anno-
tation. This kind of approach is interesting as it is
costly to have human annotators drawing bounding
boxes around objects in dense datasets.
Global Average Pooling (GAP), a mathematical
operation performing the average of a matrix (descri-
bed in Section 3), was first presented as a structural
regularizer in NiN (Lin et al., 2013) and later used
in GoogLeNet (Szegedy and Liu, 2015). More re-
cently, it was used in ResNet (He et al., 2016) and
GoogLeNet-GAP (Zhou et al., 2016) before a fully
connected layer to perform object localization. In this
latter approach, it was preferred to max-pooling to
find all the discriminative parts of a class instead of
the most discriminative one.
In this work, we intend to increment the classifica-
tion and localization research based on the GAP met-
hods by proposing a modified architecture and some
naive regularizations. Section 2 reviews former work
published on this topic. Section 3 introduces both our
architecture and a naive regularization term used for
localization. Section 4, describes our evaluations and
the proposed network. Finally, Section 5 concludes
our work.
2 RELATED WORK
In the context of visual perception, many approaches
based on CNNs have been used in the last years to
perform real-time object localization. Most of the
successful approaches used fully-supervised learning
to tackle theses problems. This section reviews the
architectures that have been used first for supervised
and then for weakly or semi-supervised localization
in image processing in computer vision.
Fully-supervised Learning for Localization: In
recent literature, many architectures propose to per-
form image classification and localization, at the same
time, using fully-supervised learning. Models like
AlexNet (Krizhevsky et al., 2012), VGGNet (Simo-
nyan and Zisserman, 2014) and GoogLeNet (Szegedy
and Liu, 2015) use a stack of convolutional layers fol-
lowed by fully connected layers to predict the class
instance and its location in images, using, for in-
stance, a regression on the bounding box (Sermanet
et al., 2013). Throughout time, these models compe-
ted in ILSVRC
2
(Simonyan and Zisserman, 2014) lo-
calization contest (won by (Krizhevsky et al., 2012)
and (Szegedy and Liu, 2015)). Other models, like
ResNet (He et al., 2016) introduced a similar appro-
ach, but with a GAP layer at the last convolutional
2
Imagenet Large-Scale Visual Recognition Challenge
layer of their networks, and set a new record in the
ILSVRC 2014 localization contest. It is clear that, in
such contest, researchers are using maximum of avai-
lable resources for training their approaches, however,
we would like our models to be less reliant on large
amount of annotated data. This is our motivation to
move towards semi-supervised learning.
Weakly and Semi Supervised Learning for Lo-
calization: Some architectures are designed to per-
form weakly-supervised localization, for example,
the model proposed by Oquab et al. (Oquab et al.,
2015) is trained in two steps. First, a traditional CNN
model, finishing with a softmax layer, is trained on
cropped images to learn to recognize a class based on
a fixed receptive field. The weights learned at this
step are frozen and the second training step consists
in convolving this model to a larger image in order to
produce a matrix of softmax predictions. From this
matrix, a new network is learned to predict the class
localization. This network includes a global max-
pooling operation made to retrieve the maximum pro-
bability of a class being present in the image. We
took inspiration from this work as (a1) the first part
of the model is trained on images which do not in-
clude any contextual information (background remo-
ved at the cropping step) and (a2) the resulting model
produces a saliency map for every class present in an
image, based on a given receptive field. Even though,
we consider that (b1) the two-step learning can be re-
duced to one step, (b2) the global max-pooling is a
bottleneck operation to obtain a one-shot learning mo-
del and (b3) the model should be able to learn with a
lower amount of pre-processed inputs.
These b1, b2, and b3 points have been taken
into account in (Zhou et al., 2016) where the aut-
hors propose a one-shot semi-supervised method to
perform image classification and localization without
any annotation of localization. Their method, called
”GoogLeNet-GAP”, is a stack of CNNs ending with
a large amount of convolutional units where each out-
put map is averaged to a single value with Global
Average Pooling (GAP). The resulting values are then
fully connected to a softmax layer. We believe, that,
because of the GAP layer being fully connected to the
prediction layer, the last convolutional layer, which
is used for localization, shares too much information
with all the predictions resulting in an attention field
broader than needed.
In our approach, we aim at developing, first, one-
shot semi-supervised training for class localization as
in (Zhou et al., 2016). Second, we want to reduce the
attention field in our localization mechanism by remo-
ving the Dense layer following the GAP layer, in or-
der to have an attention model similar to (Oquab et al.,
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
308