keypoint with significantly higher response, and se-
lects the keypoints with the largest suppression ra-
dius for further processing. ANMS reports promis-
ing results but suffers from quadratic time complex-
ity in the number of considered keypoints. (Gauglitz
et al., 2011) suggest a number of heuristics to reduce
the time complexity of ANMS, and also offer an im-
proved version of ANMS, called suppression via disk
covering, which is of time complexity O(mlogm).
By contrast, (Cheng et al., 2007) achieve an even
spatial distribution by distributing keypoints over a
given number of c cells by means of a k-d tree. Start-
ing with a cell that covers the entire image and con-
tains all keypoints, the algorithm divides the cells re-
cursively into smaller cells, each containing half the
keypoints of the mother cell. For each division, the
spatial variance along the x and y-axis is computed,
and the cell is divided along the median value of the
dimension that has the larger variance. From the key-
points of each cell the n/c keypoints with the highest
response value are then selected for further process-
ing, again achieving a time complexity of O(mlogm).
(Cheng et al., 2007) recommend c = 64 cells if n =
100 keypoints are to be extracted.
Here we study and evaluate a selection method
that is even simpler than even k-d trees. We propose
to partition an image into a regular grid of c cells, to
assign all detected keypoints to their corresponding
cells, and to select from each cell the n/c keypoints
with the highest response value. As sorting by re-
sponse value cannot be avoided, the time complexity
is again O (mlogm), yet the constant factor is reduced
to a minimum. That is, compared with k-d trees, we
eliminate the need to calculate the variance in x and y-
dimension and the median for the winning dimension
for m keypoints for every division level. Calculation
of the median requires an extra sort at every division
level. K-d trees with 64 cells requires log
2
64 = 6 di-
vision levels and 7 sorts, while grid-based selection
requires only one sort.
Real world trials on underwater images show that
already a 2x2 grid can lead to dramatic improvement
over selection by response only. A 2x2 grid requires
only two greater-than comparison operations per key-
point, yet in our trials it extended the average vehicle
distance over which the visual odometer can correctly
estimate motion by an order of magnitude, from sev-
eral meters to several dozens of meters. The images
of these trials were recorded by remote vehicle off the
coast of the Balearic island of Mallorca and contain
thousands of images per sequence. The vehicle op-
erator tried to steer the vehicle at an average distance
of 1 meter from the sea floor. Due to the rocky na-
ture of the underwater terrain, with frequent boulders,
cliffs, and crevices, actual distance to image objects is
highly variable, as is the clarity of image features be-
tween images. It is important to note that in this case,
doubling the number of extracted keypoints per im-
age did not significantly increase the average length
of correct motion estimates.
Visual inspection of image pairs for which a 2x2
grid gave such a dramatic improvement shows that
they are typically of low quality—bad lighting and
much blur and fade—over most of the image, but
with a small image region of medium quality, such
that almost all extracted keypoints cover that small
region. This small region is then either not present
in both images of a pair, or has changed significantly
between images due to changes in lighting, blur or
fade. Response-based selection continues to be con-
centrated on this small region even when more key-
points are extracted, while grid-based selection forces
the low-quality region to be considered as well.
With this promising result at hand, the question
arises whether grid-based selection is also beneficial
or at least not detrimental when motion can already
be estimated well with response-based selection only.
If it is detrimental, a real-time system might require
additional logic to decide when to use grid-based se-
lection and when not. In the remainder of this article
we report on systematic experiments in a highly con-
trolled environment that evaluate the effect of grid-
based selection on motion estimates from image se-
quences of higher quality than in the real-world trials
described above.
2 EXPERIMENTAL SETUP
We base our evaluation on over 2,000 images that
were collected by a downward looking camera aboard
the Girona500 autonomous submarine (Prats et al.,
2012) during two controlled test dives in the labora-
tory pool of the VICOROB research group in Girona.
Image resolution is 384 by 288 pixels. Lighting con-
ditions were good and the water was clean. Motion
blur and light scatter are negligible. The floor of the
pool was covered by a life-size high resolution color
poster that shows a coral reef off the coast of Florida.
The poster was spread out flat, resulting in a planar
image scene. The digital version of the poster allows
us to calculate the ground truth per images with an ac-
curacy of 0.5 pixels. For a complete description of the
image acquisition and the calculation of the ground
truth see (Nannen and Oliver, 2012).
With a planar scene, motion can be estimated by
computing the homography that projects keypoints
from one image coordinate frame to the coordinate
Grid-basedSpatialKeypointSelectionforRealTimeVisualOdometry
587