As for the gist descriptor, it was introduced by
Oliva et al. (Oliva and Torralba, 2006). In this work,
the version used consists on: (1) obtaining m
2
dif-
ferent resolution images, (2) applying Gabor filters
over the m
2
images with m
1
different orientations,
(3) grouping the pixels of each image into k
2
hori-
zontal blocks and (4) arranging the obtained orien-
tation information into one row to create a vector
~
d ∈ R
m
1
·m
2
·k
2
×1
.
3.2 Methods based on Deep Learning
During the last years, the use of deep learning meth-
ods to solve computer vision issues has extensively
grown. Regarding the localization task through the
use of visual information, this work studies the use of
Convolutional Neural Networks (CNN) and the use
of auto-encoders. The idea is to obtain vectors which
characterize the images through some deep learning
technique. On the one hand, these methods can re-
sult very interesting since their use can be focused on
specific kind of images (such as indoor environments
in our case) and, hence, providing more efficient de-
scriptors. On the other hand, these methods lead to
previous training which normally implies huge pro-
cessing data and noteworthy time.
Regarding the use of CNNs, these networks have
been commonly designed for classification. In this
sense, (1) a set of images correctly labeled are col-
lected and introduced into the network to tackle the
learning process and after that, (2) the network is
properly available to face the classification (test image
as input and the CNN outputs the most likely label op-
tion). The CNNs are composed by several hidden lay-
ers whose parameters and weights are tuned through
the training iterations. In this work, some hidden lay-
ers outputs are used to obtain global appearance de-
scriptors. This idea have already been proposed by
some authors such as Mancini et al. (Mancini et al.,
2017), who use them to carry out place categorization
with the Na
¨
ıve Bayes classifier or Pay
´
a et al. (Pay
´
a
et al., 2018), who proposed CNN-based descriptors
to create hierarchical visual models for mobile robot
localization. The CNN architecture that has been used
in this work is places (Zhou et al., 2014), which was
trained with around 2.5 million images to categorize
205 possible kinds of scenes (no re-training is carried
out in this work). Fig. 1 shows the architecture of
the places CNN, which is based on the caffe CNN.
The net basically consists in (1) an input layer, (2)
several intermediate hidden layers and (3) an output
layer. Within the intermediate layers, the first phase
consists in (2.1) layers for featuring learning (whose
layers incorporate several filters and the output gen-
erated are used as input for the next layer) and (2.2)
layers for classification (whose layers are fully con-
nected and they generate vectors which provide infor-
mation for classification).
In this work, we have evaluated the output infor-
mation from 5 layers. Three fully convolutional layers
(’fc6’, ’fc7’ and ’fc8’) whose output size are 4096 ×1,
4096×1 and 205×1 respectively. Moreover, we have
obtained two descriptors from the output of 2D con-
volution layers (’conv4’ and ’conv5’). These layers
apply several sliding convolutional filters to the input
images with the aim to activate certain characteristics
of the image. Hence, the output of these layers is a set
of images which are the input image after being fil-
tered. Finally, a descriptor is basically obtained from
these layers through selecting an image from the out-
put dataset and arranging the data (matrix) in a single
row (vector). Since the size of the output images is
13 × 13, the size of the descriptor is 169 × 1.
As for the use of auto-encoders, the aim of these
neural networks is to reconstruct the output through
compressing the input into a latent-space representa-
tion (Hubens, 2018). The fig. 2 shows the architecture
design of the auto-encoders. These networks firstly
compress the input (encoding) and secondly recon-
struct the input departing from the latent space rep-
resentation (decoding). The idea consists in building
a latent representation to obtain useful features with
small dimension, i.e., training the auto-encoder to ex-
tract the most salient features. For example, Gao and
Zhang (Gao and Zhang, 2017) used auto-encoders
to detect loops for visual Simultaneous Localization
And Mapping (SLAM).
For this experiment, two types of auto-encoder
are proposed. Both have been trained using the
same parameters (Coefficient for the L
2
weight reg-
ularizer, 0,004; Coefficient that controls the impact
of the sparsity regularizer, 4; Desired proportion of
training examples a neuron reacts to, 0.15; Encoder
Transfer Function, “Logistic sigmoid function”; and
Maximum number of training epochs, 1000) and
also both have been trained using a GPU (NVIDIA
GeForce GTX 1080 Ti), but whereas the first option
(auto-enc-Frib) is trained with the images obtained
from the dataset used to evaluate the localization (ex-
plained in sec. 4), the second alternative (auto-enc-
SUN) is trained with images obtained from a dataset
(SUN 360 DB (Xiao et al., 2012)) which contains
generic panoramic images. The aim of this second
option is to create a generic auto-encoder based on
indoor panoramic images which provides a good-
enough solution to obtain descriptors for panoramic
images independently the environment. This solution
would solve the handicap that introduces the descrip-
ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics
286