An early example is the work of Schmid and Mohr
(Schmid et al., 1997), where geometric invariance
was still under image rotations only. Scaling was han-
dled by using circular regions of several sizes. Lowe
et al. (Lowe, 1999) extended these ideas to real scale-
invariance. More general affine invariance has been
achievedin the work of Baumberg (Baumberg, 2000),
that uses an iterative scheme and the combination of
multiple scales, and in the more direct, constructive
methods of Tuytelaars & Van Gool (Tuytelaars et al.,
1999; Tuytelaars and Gool, 2000), Matas et al. (Matas
et al., 2002), and Mikolajczyk & Schmid (Mikola-
jczyk and Schmid, 2002). Although these methods
are capable to find very qualitative correspondences,
most of them are too slow for use in a real-time appli-
cation as the one we envision here. Moreover, none
of these methods are especially suited for the imple-
mentation on an embedded computing system, where
both memory and computing power must be as low as
possible to ensure reliable operation at the lowest cost
possible.
The classic recognition scheme with local fea-
tures, presented in (Lowe, 1999; Tuytelaars and Gool,
2000), and used in many applications such as in our
previous work on robot navigation (Goedem´e et al.,
2005; Goedem´e et al., 2006), is based on finding
one-on-one matches. Between the query image and
a model image of the object to be recognised, bijec-
tive matches are found. For each local feature of the
one image, the most similar feature in the other is se-
lected.
This scheme contains a fundamental drawback,
namely its disability to detect mat-ches when multiple
identical features are present in an image. In that case,
no guarantee can be given that the most similar fea-
ture is the correct correspondence. Such pattern rep-
etitions are quite common in the real world, though,
especially in man-made environments. To reduce the
number of incorrect matches due to this phenomenon,
in classic matching techniques a criterium is used sich
as comparing the distance to the most and the sec-
ond most similar feature (Lowe, 1999). Of course,
this practice throws away a lot of good matches in the
presence of pattern repetitions.
In this paper, we present a possible solution to this
problem by making use of the visual word concept.
Visual words are introduced (Sivic and Zisserman,
2003; Li and Perona, 2005; Zhang and Schmid, 2005)
in the context of object classification. Local features
are grouped into a large number of clusters with those
with similar descriptors assigned into the same clus-
ter. By treating each cluster as a visual word that
represents the specic local pattern shared by the key-
points in that cluster, we have a visual word vocabu-
lary describing all kinds of such local image patterns.
With its local features mapped into visual words, an
image can be represented as a bag of visual words,
as a vector containing the (weighted) count of each
visual word in that image, which is used as feature
vector in the classication task.
In contrast to the in categorisation often used
bag-of-words concept, in this paper we present the
constellation-of-words model. The main difference is
that not only the presence of a number of visual words
is tested, but also their relative positions.
3 ALGORITHM
Figure 1 gives an overview of the algorithm. It con-
sists of two phases, namely the model construction
phase (upper row) and the matching phase (bottom
row).
First, in a model photograph(a), local features are
extracted (b). Then, a vocabulary of visual words is
formed by clustering these features based on their de-
scriptor. The corresponding visual words on the im-
age (c) are used to form the model description. The
relative location of the image centre (the anchor) is
stored for each visual word instance (d).
The bottom row depicts the matching procedure.
In a query image, local features are extracted (e).
Matching with the vocabulary yields a set of visual
words ( f). For each visual word in the model de-
scription, a vote is cast at the relative location of the
anchor location (g). The location of the object can
be found based on these votes as local maxima in a
voting Hough space (h). Each of the following sub-
sections describes one step of this algorithm in detail.
Local Feature Extraction. We chose to use SURF
as local feature detector, instead of the often used
SIFT detector. SURF (Bay et al., 2006; Fasel and
Gool, 2007) is developed to be substantially faster,
but at least as performant as SIFT.
Interest Point Detector. In contrast to SIFT (Lowe,
1999), which approximates Laplacian of Gaussian
(LoG) with Difference of Gaussians (DoG), SURF
approximates second order Gaussian derivatives with
box filters, see figure 2. Image convolutions with
these box filters can be computed rapidly by using in-
tegral images as defined in (Viola and Jones, 2001).
Interest points are localised in scale and image space
by applying a non-maximum suppression in a 3 × 3
neighbourhood. Finally, the found maxima of the de-
terminant of the approximated Hessian matrix are in-
terpolated in scale and image space.
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
94