image to describe a place (for example a room).
When a new panorama of features is acquired, it is
compared to all the stored panoramas of the map, and
the most similar one is selected as the location of the
robot. Using different types of feature detectors and
descriptors simultaneously increases the probability
of finding good correspondences, but at the same time
can cause other problems, such as more processing
time and more false correspondences. As means
to improve the results of their approach, in this
article various of these covariant region detectors
and descriptors are compared. Our objective is to
evaluate the performance of different combinations of
these methods in order to find the best one for visual
navigation of an autonomous robot. The results of
this comparison will reflect the performance of these
detectors and descriptors under severe changes in
the point of view in a real office environment. With
the results of the comparison, we intend to find the
combination of detectors and descriptors that gives
better results with widely separated views.
The remainder of the paper is organized as fol-
lows. Section 2 provides some background informa-
tion in affine covariant region detectors and descrip-
tors. Section 3 explains the experimental setup used
in the comparison and section 4 presents the results
obtained. Finally, in section 5 we close the paper with
the conclusions.
2 DETECTORS AND
DESCRIPTORS
Affine covariant regions can be defined as sets of pix-
els with high information content, which usually cor-
respond to local extrema of a function over the im-
age. A requirement for these type of regions is that
they should be covariant with transformations intro-
duced by changes in the point of view, which makes
them well suited for tasks where corresponding points
between different views of a scene have to be found.
In addition, its local nature makes them resistant to
partial occlusion and background clutter.
Various affine covariant region detectors have
been developed recently. Furthermore, different
methods detect different types of features, for ex-
ample Harris-Affine detects corners while Hessian-
Affine detects blobs. In consequence, multiple re-
gion detectors can be used simultaneously to increase
the number of detected features and thus of potential
matches.
However, using various region detectors can also
introduce new problems. In applications such as VS-
LAM, storing an arbitrary number of different affine
covariant region types can increase considerably the
size of the map and the computational time needed
to manage it. Another problem may arise if one of
the region detectors or descriptors gives rise to a high
amount of false matches, as the mismatches can con-
fuse the model fitting method and a worse estimation
could be obtained.
Recently Mikolajczyk et al. (Mikolajczyk et al.,
2005) reviewed the state of the art of affine covari-
ant region detectors individually. Based on Mikola-
jczyk et al. work, we have chosen three types of affine
covariant region detectors for our evaluation of com-
binations: Harris-Affine, Hessian-Affine and MSER
(Maximally Stable Extremal Regions). These three
region detectors have a good repeatability rate and a
reasonable computational cost.
Harris-Affine first detects Harris corners in the
scale-space using the approach proposed by Linde-
berg (Lindeberg, 1998). Then the parameters of an
elliptical region are estimated minimizing the differ-
ence between the eigenvalues of the second order mo-
ment matrix of the selected region. This iterative pro-
cedure finds an isotropic region, which is covariant
under affine transformations.
The Hessian-Affine is similar to the Harris-Affine,
but the detected regions are blobs instead of corners.
Local maximums of the determinant of the Hessian
matrix are used as base points, and the remainder of
the procedure is the same as the Harris-Affine.
The Maximally Stable Extremal region detector
proposed by Matas et al. (Matas et al., 2002) detects
connected components where the intensity of the pix-
els is several levels higher or lower than all the neigh-
boring pixels of the region.
Matching local features between different views
implicitly involves the use of local descriptors. Many
descriptors with wide-ranging degrees of complexity
exist in the literature. The most simplest descriptor
is the region pixels alone, but it is very sensitive to
noise and illumination changes. More sophisticated
descriptors make use of image derivatives, gradient
histograms, or information from the frequency do-
main to increase the robustness.
Recently, Mikolajczyk and Schmid published a
performance evaluation of various local descriptors
(Mikolajczyk and Schmid, 2005). In this review more
than ten different descriptors are compared for affine
transformations, rotation, scale changes, jpeg com-
pression, illumination changes, and blur. The conclu-
sions of their analysis showed an advantage in per-
formance of the Scale Invariant Feature Transform
(SIFT) introduced by Lowe (Lowe, 2004) and one of
its variants: Gradient Location Orientation Histogram
COMPARING COMBINATIONS OF FEATURE REGIONS FOR PANORAMIC VSLAM
293