the image and quantized in visual words with respect
to a vocabulary. The vocabulary is learned beforehand
by clustering all feature vectors from a set of train-
ing data (3000 random Flickr images) using k-means
clustering and contains 10000 visual words. Each
keyframe selected by the SLAM method is registered
in the image database and is represented as a vector of
visual words. We take advantage of the inverted index
to find the most likely past keyframe that matchs the
current keyframe. Each time a word is found, we up-
date the similarity scores of the past images retrieved
from the index by adding the term frequency - in-
verted document frequency (tf - idf) weighting term as
in (Sivic and Zisserman, 2003). Thus, we measure the
similarity between a pair of images and then assume
that two images with high similarity score are taken
from the same location. However, the current obser-
vation may come from a previously unknown place.
A geometric post-verification stage, which tests the
geometric consistency of the matched images, is re-
quired.
3.3 Merging Both Approaches with a
3D/2D Geometric Validation
We confirm the place recognition hypothesis with a
3D validation. Features extracted in the database im-
age are matched to the projection of 3D points seen in
the current image. We estimate the relative pose be-
tween these two similar images. We retain the match
if there are enough points verifying the geometric
constraint. Thus, we reject all errors due to percep-
tual aliasing. Besides, this method makes it possible
to determine the static structure of the scene and to
identify a set of inconsistent points. In the case where
a scene is composed of multiple rigid objects moving
relative to each other, we can detect possible objects
as nearby sets of points that share similar movement.
We present an overview of the state of the art for two-
view multiple motion estimation in the section 4.1 and
our method for object detection in section 4.2.
4 AUTOMATIC OBJECTS
DETECTION
Comparing two views of the same 3D scene taken at
different times highlights 3D points inconsistent with
the static structure. We want to infer the presence
of moved objects by clustering points according with
their motion. The setting is the following: given the
set of corresponding points in two similar images, we
have to estimate the movement of the camera and the
movement of an unknown number of moving objects.
In this section we first review alternative methods for
two-view multibody estimation and then describe our
approach.
4.1 Two-view Multiple Structures
Estimation
To simplify the problem, we consider only planar ob-
jects. We need to detect multiple planar homogra-
phies in image pairs. Zuliani et al. (Zuliani et al.,
2005) describe the multiRANSAC algorithm but this
method requires prior specification of the number of
model. Toldo and Fusiello (Toldo and Fusiello, 2008)
present a simple method for the robust detection of
multiple structures in pairs of images. They generate
multiple instances of a model from random sample
sets of correspondences and then merge group points
belonging to the same model using a agglomerative
clustering method called J-Linkage. Our method is
based in part on this algorithm. We combine planar
detection with 3D reconstruction to detect only mov-
ing objects.
4.2 Identification of the Moving Objects
in the 3D Scene
Our metrical SLAM algorithm constructs a sparse
map of the environment. Inconsistent points retrieved
at the 3D validation step aren’t enough to estimate a
model and define an object. To tackle this problem,
we extract a large number of features in each image,
match them and generate many local hypotheses of
homographies. We then merge sets of points belong-
ing to the same motion using a technique explained
below and finally keep those with points associated
with 3D inconsistent points. We use SURF features
to describe interest points. Each feature is matched
with its nearest neighbor in the similar image. Figure
3 illustrates our method.
4.2.1 Preliminaries and Notation
Points in the 2D image plane of a camera are repre-
sented by homogeneous vectors p. p
1
and p
2
are two
corresponding points detected in pair of similar im-
ages. These points are the projection of the same 3D
points in different camera views. We have to detect
perspective transformations (homographies) that map
planar surfaces from one image to the other. To do so,
we find the set of correspondences fitting the same
homography H:
p
2
∼ H p
1
. (1)
AUTOMATIC OBJECTS DETECTION FOR MODELING INDOOR ENVIRONMENTS
575