COMPARISON OF GLOBAL-APPEARANCE TECHNIQUES
APPLIED TO VISUAL MAP BUILDING AND LOCALIZATION
Extracting the Most Relevant Information from Panoramic Images
Francisco Amor´os, Luis Pay´a, Oscar Reinoso and Luis M. Jim´enez
Departamento de Ingenier´ıa de Sistemas y Autom´atica,
Miguel Hern´andez University, Avda. de la Universidad s/n, 03202, Elche (Alicante), Spain
Keywords:
Robot Mapping, Appearance-based Methods, Omnidirectional Vision, Spatial Localization.
Abstract:
Techniques based on the global appearance of visual information have proved to be a robust alternative in the
field of robotic mapping and localization. However, they present some critical issues that must be studied
when trying to build an application that works in real time. In this paper, we review and compare several
methods to build a global descriptor of panoramic scenes and we study the critical parameters that make their
applicable or not in real mapping and localizations tasks, such as invariance against rotations, computational
costs and accuracy in robot localization. All the experiments have been carried out with omnidirectional
images captured in a real environment under realistic lighting conditions.
1 INTRODUCTION
When a robot or a team of robots has to carry out a
task that implies autonomous navigation through an
environment, an internal representation of this envi-
ronment is needed. This representation has to allow
the robot to estimate its position and orientation using
the information provided by the sensors it is equipped
with. Omnidirectional visual systems are commonly
used with this goal due to the richness of the informa-
tion they provide and the relatively low cost they have.
Classical researches into mobile robots provided with
vision systems have focused on local features descrip-
tors, extracting natural or artificial landmarks from
the image to build the map and carry out the local-
ization of the robot (Thrun, 2003).
Recent approaches propose processing the image
as a whole without local feature extraction. These
appearance-based techniques are interesting when
dealing with unstructured environments where it may
be hard to find patterns to recognize the scene. But
we have to work with a large amount of information,
having a high computational cost. That is the reason
why we need to study compression techniques.
The localization task requires techniques that
present rotational invariance in order to recognise the
most similar image regardless of the robot’s orienta-
tion in the ground plane. But some orientation infor-
mation to estimate the pose of the robot is also nec-
essary. Incremental methods are also advisable, since
some navigation tasks require to add or modify ele-
ments of the map as the robot moves through the en-
vironment.
Several approaches to compress the visual infor-
mation can be found in the literature. For example,
PCA (Principal Components Analysis) has demon-
strated being robust applied to image processing, as
(Krose et al., 2007) shows. Other authors use the
Fourier Transform to extract the most relevant infor-
mation of an image (Menegatti et al., 2004).
(Paya et al., 2009) present a comparative study
about appearance-based techniques. We comple-
ment this study and take into account three methods:
Fourier Signature, Rotational PCA and Gist-Gabor.
The last technique has proved previous promising re-
sults, although we have no notice it has been previ-
ously used in localization and mapping tasks.
2 REVIEW OF COMPRESSION
TECHNIQUES
In this section we summarize some techniques to ex-
tract the most relevant information from a database
made up of panoramic images.
395
Amorós F., Payá L., Reinoso O. and M. Jiménez L..
COMPARISON OF GLOBAL-APPEARANCE TECHNIQUES APPLIED TO VISUAL MAP BUILDING AND LOCALIZATION.
DOI: 10.5220/0003864703950398
In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 395-398
ISBN: 978-989-8565-04-4
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
2.1 Fourier-based Techniques
As shown in (Menegatti et al., 2004) it is possible to
represent an image using the Discrete Fourier Trans-
form of each row. Taking profit of the Fourier Trans-
form properties, since the most relevant information
concentrates in the low frequency components of the
sequence, we keep the first coefficients to represent
each row. Moreover, as we work with omnidirectional
images, when the Fourier Transform of each row is
computed, another interesting property appears: rota-
tional invariance. Comparing the transform of a row
and the transform of the same sequence rotated, the
modulus are the same and just the phase changes.
The modulus let the position estimation, and with the
phase coefficientswe can find out the relative rotation.
2.2 PCA-based Techniques
PCA-based techniques have proved to be a very useful
compressing method (Krose et al., 2007). They make
possible that, having a set of N images with M pix-
els each,
~
I
j
Mx1
, j = 1. . . N, we could transform
each image in a feature vector (also named projection
of the image) ~p
j
kx1
, j = 1. .. N, being k the PCA
features containing the most relevant information of
the image, k N. However, if we apply PCA directly
over the matrix that contains the images, we obtain a
database with information of just one orientation of
each scene. To solve this problem, in (Leonardis and
Jogan, 2000) the use of the Eigenspace of Spining-
Images is proposed. This technique creates a set of
spinning images from every image included in the
map. After that, the database is compressed by means
of PCA analysis. The robustness in localization and
angular resolution of the map depends on the number
of rotated siblings of each image we include.
2.3 Gist-based Techniques
Gist is another concept that can be used to compress
visual information as (Friedman, 1979) details. It can
be defined as an abstract representation that activates
the memory of scenes’ categories. They try to ob-
tain the essential information of the image simulating
the human perception system, i.e., identifying a scene
through its colour or remarkable structures, avoiding
the representation of specific objects. In (Oliva and
Torralba, 2001) this idea is developed under the name
of holistic representation of the spatial envelope to
create a descriptor. In (Torralba, 2003) this model is
computed using global scene features, such as spa-
tial frequencies and different scales based on Gabor
filtering. Although it has demonstrated its capacity
for scene recognition and classification, we have not
found any reference of applications in robotic map-
ping and localization tests. The descriptor we propose
is named Gist-Gabor since it uses Gabor filtering in
order to obtain frequency and orientation information
using the global image.
The first step consists in creating a bank of the
Gabor masks with different resolutions and orienta-
tions. Then, the image is filtered with the set of fil-
ters. The results encode different structural informa-
tion. To create the descriptor, we calculate the aver-
age pixel’s value within cells with the same width as
the omnidirectional image, obtaining an array of ro-
tational invariant characteristics. To know the relative
orientation between two rotated images, vertical win-
dows with the image’s height are used, making up a
vector with the mean value the pixels they contains.
By rotating the order of its components and compar-
ing with the database we estimate the orientation.
3 LOCALIZATION AND
ORIENTATION RECOVERING
In this section we assess each algorithm by calculat-
ing the pose of the robot within a map created previ-
ously, and the time they spent. The image database we
have used to carry out the experiments belongs to the
Technique Faculty of Bielefeld University ((Moeller
et al., 2007). It has been collected in three different
living spaces under realistic illumination conditions.
The images are structured in a 10x10 cm rectangular
grid. In the experiments, we have varied the distance
between images of the database when building the
map to simulate different conditions. Table 1 shows
the different grids information.
Table 1: Grid’s size and number of images selected.
GridA GridB GridC GridD
Distance 10cm 20cm 30cm 40cm
Images 746 204 92 54
The test set is made up of all the available im-
ages in the database and 15 artificial rotations of each
one (every 22.5
), but the images included in the
map. The simulations have been obtained using Mat-
lab R2009b under Mac OS X. The position retrieval
accuracy is studied as binary results, considering if
we obtain the best match as possible or not, and the
information is showed with recall and precision mea-
surement (Gil et al., 2009), with information about if
a correct location is in the Nearest Neighbour (N.N.),
i. e., if it is the first result selected, or between Sec-
ond or Third Nearest Neighbours (S.N.N or T.N.N).
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
396
0 20 40 60 80 100
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Fourier Coefficients
Time (s)
Grid A
Grid B
Grid C
Grid D
(a)
0 20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Fourier Coeficients
Time (s)
Grid A
Grid B
Grid C
Grid D
(b)
4 8 16 32 64
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Image Database Rotations
Time (s)
Grid A
Grid B
Grid C
Grid D
(c)
4 8 16 32 64
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Image Database Rotations
Time (s)
Grid A
Grid B
Grid C
Grid D
(d)
0 5 10 15
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Gabor Masks
Time (s)
Grid A
Grid B
Grid C
Grid D
(e)
4 8 16 32
0.03
0.031
0.032
0.033
0.034
0.035
0.036
Vertical cells
Time (s)
Grid D
V.C.W.= 4 pix.
V.C.W.= 8 pix.
V.C.W.= 16 pix.
V.C.W.= 32 pix.
(f)
0 0.5 1
0.96
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
Recall
Precision
10 Fourier Coef., Grid D
N.N.
S.N.N.
T.N.N.
(g)
0 0.2 0.4 0.6 0.8 1
0.94
0.95
0.96
0.97
0.98
0.99
1
Recall
Precision
Image Rotations=16, Grid D
N.N.
S.N.N.
T.N.N.
(h)
0 0.2 0.4 0.6 0.8 1
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
Recall
Precision
Mask1=2, Mask2=4, Grid D
N.N.
S.N.N.
T.N.N.
(i)
5 10 15 20
0
20
40
60
80
100
Fourier Coefficients
(j)
4 8 16 32
0
20
40
60
80
100
Image Database Rotations
(k)
4 8 16 32
0
20
40
60
80
100
Vertical Cells
(l)
Figure 1: Elapsed time for (a) location and (b) pose estimation using Fourier-based algorithm. Elapsed time for (c) location
and (d) pose estimation using PCA-based algorithm. Elapsed time for (e) location and (f) pose estimation using Gist-Gabor.
Recall-Precision charts using grid D for (g) Fourier-based algorithm, (h) PCA-based algorithm and (i) Gist-Gabor. Phase error
over correct locations using Grid D for (j) Fourier-based algorithm, (k) PCA-based algorithm and (l) Gist-Gabor.
Regarding the rotation, we represent the results accu-
racy in bar graphs depending on how much they differ
from the correct ones, in percentage over correct lo-
cations. In order to avoid redundant information, we
include only the pose estimation experiments in the
most critical case, i.e., using the Grid D.
3.1 Fourier Signature Technique
The map obtained with the Fourier Signature is rep-
resented with two matrices: the module and the phase
of the selected Fourier Coefficients. The location is
estimated by calculating the minimum Euclidean dis-
tance of the power spectrum between the image and
the spectra of the map. The phases’ vector associated
with the most similar image retrieved in the map is
used to compute the orientation. In fig.1(a) we can
see that, to find the position, the elapsed time rises
in accordance with the number of images the map
stores, i.e. the grid, and the number of Fourier com-
ponents. But the pose depends almost only on the
number of coefficients per row (fig.1(b)). This is due
to the orientation estimation, since it is the computa-
tionally heaviest part of the algorithm and it depends
only on the number of components we use. Regarding
the position recovering, fig.1(g) show that in both ex-
periments the algorithm is able to find the best match
using a relatively low number of Fourier components.
The phase lag appears in fig.1 (j). The algorithm is
able to recover the orientation using 10 components
with an error less than or equal 5 degrees in the 92 per
cent of correct locations.
3.2 PCA-based Techniques
When a new image
~
I
1xM
arrives, it is projected
onto the eigenspace ~p = V
T
·
~
I
kx1
. The loca-
tion is estimated by computing the module of ~p and
comparing with the modules of the projections of the
map. The criterion is the minimum Euclidean dis-
tance. Once the position is known, we use the phases
vector ~p
ph
to simulate the projections of the rotated
siblings of the image to determine the orientation.
Fig. 1(c) and (d) show the time spent on location and
pose estimation. Comparing both charts we can see
that, except in Grid A, the measurements are similar,
demonstrating that the phase recovering is quite fast.
Even so, this algorithm is the slowest in the majority
COMPARISON OF GLOBAL-APPEARANCE TECHNIQUES APPLIED TO VISUAL MAP BUILDING AND
LOCALIZATION
397
of the experiments. Fig.1(h) shows that with 16 ro-
tations and 100 eigenvectors the position estimation
presents good accuracy. Fig.1(k)) shows that, with 16
rotations, the percentage of experiments equal than or
under one degree is 86% although in the rest of the
experiments is greater than 10 degrees.
3.3 Gist-based Techniques
To extract the information of a test image, we filter
the image with the same Gabor masks used to built
the map. The maximum number of spatial escales
used is two. After that, we compute the descriptor
using the same horizontal and vertical cells as in the
map. The elapsed time in the position recovering (fig.
1(e)) depends on the number of Gabor masks we use
in order to filter the image. Fig. 1(f) shows the rela-
tionship between the elapsed time in pose estimation
and the orientation parameters. The number of ver-
tical cells determines the results over its size. The
position estimation presents good accuracy with few
masks (fig. 1(i)). The phase retrieval results appear in
fig.1(l). The descriptor is able to estimate the orienta-
tion of almost all the experiments without error using
16 vertical cells. But they are binary results, since
the angle is discretized depending on the number of
vertical cells we apply to the image.
4 CONCLUSIONS
In this paper we have presented the comparison of
different appearance-based algorithms applied to the
creation of a descriptor using panoramic images. We
have studied the elapsed and the accuracy in the pose
estimation regarding a previously created map.
All of them have demonstrated to be perfectly
valid to carry out the estimation of the pose of a robot
within the map. However, when the number of im-
ages included in the map grows, the computational
cost of PCA descriptor can make it application unfea-
sible. Moreover, it is a non-incremental method.
Regarding the elapsed time, rotational PCA ex-
ceeds the other methods. Gist-Gabor lasts longer than
Fourier Signature, and it is more dependant on the
quantity of information it stores, i.e. the number of
masks we use to filter the image. The three algo-
rithms present a high rate of retrieved positions, being
Fourier Signature remarkable.
In the orientation estimation task, PCA technique
has the lowest accuracy. Although Gist-Gabor out-
performs Fourier Signature, Gist-Gabor angle’s esti-
mation is sampled with regard to the number of cells
we use, and it could increase time and memory con-
sumptions as we need higher accuracy.
To finish, this paper proves again the possibil-
ities that appearance-based techniques offer.The re-
sults achieved encourage us to continue studying new
possibilities and deepening in its development, look-
ing for new available techniques and improving its ro-
bustness to illumination change, noise or occlusions.
ACKNOWLEDGEMENTS
This work has been supported by the Spanish govern-
ment through the project DPI2010-15308.
REFERENCES
Friedman, A. (1979). Framing pictures: The role of knowl-
edge in automatized encoding and memory for gist.
In Journal of Experimental Psychology: General,
108:316-355.
Gil, A., Martinez, O., Ballesta, M., and Reinoso, O. (2009).
A comparative evaluation of interest point detectors
and local descriptors for visual slam. SPRINGER Ma-
chine Vision and Applications.
Krose, B., Bunschoten, R., Hagen, S., Terwijn, B., and
Vlassis, N. (2007). Visual homing in enviroments with
anisotropic landmark distrubution. In Autonomous
Robots, 23(3), 2007, pp. 231-245.
Leonardis, A. and Jogan, M. (2000). Robust localization
using eigenspace of spinning-images. In IEEE Work-
shop Omnidirectional Vision. Proceedings of the IEEE
Workshop on Omnidirectional Vision, IEEE Com-
puter Society,pp. 37-44.
Menegatti, E., Maeda, T., and Ishiguro, H. (2004). Image-
based memory for robot navigation using properties of
omnidirectional images. In Robotics and Autonomous
Systems. Vol. 47, No. 4, pp. 251-276.
Moeller, R., Vardy, A., Kreft, S., and Ruwisch, S. (2007).
Visual homing in enviroments with anisotropic land-
mark distrubution. In Autonomous Robots, 23(3),
2007, pp. 231-245.
Oliva, A. and Torralba, A. (2001). Modeling the shape of
the scene: a holistic representation of the spatial en-
velope. In International Journal of Computer Vision,
Vol. 42(3): 145-175.
Paya, L., Fenandez, L., Reinoso, O., Gil, A., and Ubeda,
D. (2009). Appearance-based dense maps cre-
ation: Comparison of compression techniques with
panoramic images. In 6th Int Conf on Informatics
in Control, Automation and Robotics. Ed. INSTICC
PRESS ISBN: 978-989-674-000-9 - pp.250-255.
Thrun, S. (2003). Robotic mapping: A survey, in exploring
artificial intelligence. In The New Milenium, pp. 1-35.
Morgan Kaufmann Publishers, San Francisco, USA.
Torralba, A. (2003). Contextual priming for object detec-
tion. In International Journal of Computer Vision, Vol.
53(2), 169-191.
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
398