descriptor computed from its visual appearance (Gil
et al., 2010b). In this approach a Rao-Blackwellized
particle filter is used to estimate the map and path fol-
lowed by the robots. Other approaches propose the
use of a single camera to estimate a map of 3D vi-
sual landmarks and the 6 DOF pose of the camera
(Davison and Murray, 2002) with an EKF-SLAM al-
gorithm. Each visual landmark is detected with the
Harris corner detector (Harris and Stephens, 1988)
and described with a grey level patch. Since the dis-
tance to the visual landmarks cannot be measured di-
rectly with a single camera, the initialization of the
3D coordinates of a landmark poses a problem. This
fact inspired the inverse depth parametrization ex-
posed in (Civera et al., 2008). A variation of the In-
formation Filter is used in (Joly and Rives, 2010) to
estimate a visual map using a single omnidirectional
camera and an inverse depth parametrization of the
landmarks. Also, in (Jae-Hean and Myung Jin, 2003)
two omnidirectional cameras are combined to obtain
a wide field of view stereo vision sensor. In (Scara-
muzza, 2011), the computation of the essential ma-
trix between two views allows to extract the relative
motion between two camera poses, which leads to a
visual odometry. According to (Andrew J. Davison
et al., 2004), the performance of the single-camera
SLAM is improved when using a wide field of view
lens, which suggests the use of an omni-directional
camera in visual SLAM, since, in this case, the hori-
zontal field of view is maximum.
The approach presented in this paper assumes that
the mobile robot is equipped with a single omni-
directional camera. As shown in Figure 1(a), the opti-
cal axis of the camera points upwards, thus a rotation
of the robot moving on a plane is equivalent to a shift
along the columns of the panoramic image captured.
In this paper a different representation of the en-
vironment is proposed. To date, most of the work in
visual SLAM dealt with the estimation of the position
of a set of visual landmarks expressed in a global ref-
erence frame (Davison and Murray, 2002; Gil et al.,
2010b; Andrew J. Davison et al., 2004; Civera et al.,
2008). In this work, we concentrate on a different rep-
resentation of the environment: the map is formed by
the position and orientation of a set of views in the
environment. Each view is composed by an omnidi-
rectional image captured at a particular position, its
orientation in the environment and a set of interest
points extracted from it. Each interest point is accom-
panied by a visual descriptor that encodes the visual
appearance of the point. With this information stored
in the map, we show how the robot is able to build
the map and localize inside it. When the robot moves
in the neighbourhood of any view stored in the map
and captures an image with the camera, a set of in-
terest points will be matched between the current im-
age and the view. Next, a set of correspondences can
be found between the images. This information al-
lows us to compute the relative movement between
the images. In particular, the rotation between images
can be univocally computed, as well as the transla-
tion (up to a scale factor). To obtain these measure-
ments between images we rely on a modification of
the Seven Point Algorithm (Scaramuzza et al., 2009;
Scaramuzza, 2011). This idea is represented in Fig-
ure 1(b), where we show two omnidirectional images
and some correspondences indicated. The transfor-
mation between both reference systems is also shown.
The computation of the transformation relies on the
existence of a set of corresponding points, thus, when
the robot moves away from the position of the stored
view, the appearance of the scene will vary and it may
be difficult to find any corresponding points. In this
case, a new view will be created at the current position
of the robot. The new initialized view allows for the
localization of the robot around its neighbourhood. It
is worth noting that a visual landmark corresponds to
a physical point, such as a corner on a wall. How-
ever, a view represents the visual information that is
obtained from a particular pose in the environment.
In this sense, a view is an image captured from a pose
in the environment that is associated with a set of 2D
points extracted from it. In the experiments we rely on
the SURF features (Bay et al., 2006) for the detection
and description of the points.
We consider that the map representation intro-
duced here presents some advantages compared to
previous visual SLAM approaches. The most impor-
tant is the compactness of the representation of the
environment. For example, in (Civera et al., 2008)
an Extended Kalman Filter (EKF) is used to estimate
the position of the visual landmarks, as well as the
position and orientation of the camera. With this rep-
resentation each visual landmarks is represented by
6 variables, thus the state vector in the EKF grows
rapidly as the number of visual landmarks increases.
This fact poses a challenge for most existing SLAM
approaches. In opposition with these, in the algo-
rithm presented here, only the pose of a reduced set
of views is estimated. Thus, each view encapsulates
information of a particular area in the environment, in
the form of several interest points detected in the im-
age. Typically, as will be shown in the experiments,
a single view may retain a sufficient number of inter-
est points so that the localization in its neighbourhood
can be performed.
Storing only a set of views of the environment has,
however, some drawbacks. We have to face the prob-
View-basedSLAMusingOmnidirectionalImages
49