similar primitives are denoted as ‘groups’ in the fol-
lowing. These are matched across two stereo views,
and pairs of corresponding 2D-primitives afford the
reconstruction of a 3-dimensional equivalent called
3D–primitive, and encoded by the vector:
Π = (X, Θ,Φ,(C
l
,C
m
,C
r
)) (2)
Moreover, if two primitives are collinear and simi-
lar in an image, and their correspondence in the sec-
ond image are also, then the two reconstructed 3D-
primitives are grouped. Extracted 2D and 3D primi-
tives for a sample stereo image pair are illustrated in
Figure 3.
3 MOTION ESTIMATION
In the driving context, rigid body motion is prevalent.
In our work, the motion is estimated using a combi-
nation of SIFT (Lowe, 2004) and primitives features
(see Section 2) in a RANSAC (Fischler and Bolles,
1981) scheme. SIFT features are scale invariant fea-
tures that have been very successfully applied to a
variety of problems during the recent years. They
afford to be matched very robustly across different
views, and provide a robust basis to the motion esti-
mation scheme. On the other hand, their localisation
is imprecise. The primitives provide very accurate
(sub–pixel) localisation, but are less readily matched.
Moreover, because the primitives are local contour
descriptors they only provide motion information in
the direction normal to the contour’s local orienta-
tion. This is an issue in a driving scenario where the
main features, the road lines, are mostly radially dis-
tributed from the car’s heading direction. This means
that most of the extracted primitives yield very lit-
tle information about translation in the z axis. This
limitation can be remediated by using a mixture of
features for motion estimation. In this work, we in-
tegrated both features in a RANSAC scheme, where
SIFT are used for the initial estimation, and the con-
sensus are formed using a combination of SIFT and
primitives. For the motion estimation algorithm we
chose (Rosenhahn et al., 2001), since in addition of
being able to deal with different visual entities, it does
optimization using a twist formulation which directly
acts on the parameters of rigid-body motions.
The results presented in this chapter contain mo-
tion estimation results for feature sets consisting of
primitives, SIFT and a combination of both. Results
are shown in Figure 4, where each column depicts
the translational and rotational motion components
for one type of feature set.
The first row in Figure 4 depicting the transla-
tion, the z-axis corresponds to the car’s forward mo-
tion. Here, results show that SIFT and the combina-
tion of SIFT and primitives provide much more sta-
ble results. However outliers still remain, caused by
speed-bumps, potholes etc. These results in blurred
images as on frame 720-730 caused by a bump on the
bridge, making matching and motion estimation a dif-
ficult task.
The second row in Figure 4 depicts the rota-
tional component where y-axis corresponds to rota-
tion caused by steering input. The rotation results for
the y-axis using SIFT and the combination of SIFT
and primitives nicely corresponds to the satellite im-
age presented in Figure 2. This correspondence be-
tween sub-parts of the road and sub-parts of the mo-
tion estimation plot is shown in Figure 10 (b). Fig-
ure 5 shows translation and rotation obtained after ap-
plying a Bootstrap Filter (Gordon et al., 1993) (us-
ing 1000 particles). This results in the elimination of
the largest variations in the estimated motions and to
more stable results.
4 DISAMBIGUATION
Since 3D-primitives are reconstructed from stereo,
they suffer from noise and ambiguity. Noise is due to
the relatively large distance to the objects observed,
and the relatively small baseline of the stereo rig (33
cm). The ambiguity rises from the matching prob-
lem: despite their multi-modal aspect, the primitives
only describe a very small area of the image, and
similar primitives abound in an image. The epipo-
lar constraint limits the matching problem, yet it is
unavoidable that some ambiguous stereo matches oc-
cur (Faugeras, 1993). We introduce two means of
disambiguation making use of temporal (Section 4.1)
and spatial (Section 4.2) regularities employed by the
early cognitive vision system.
4.1 Temporal Disambiguation
A first means of disambiguation is to track prim-
itives over time. We perform this tracking in the
3D space, to reduce the likelihood of false positives.
This involves resolving three problems: first, estimat-
ing the motion; second, matching predicted and ob-
served primitives; and third, accumulating represen-
tation over time.
Using the 3D-primitives extracted at a time t and
the computed ego–motion of the car between times
t and t + δt, we can generate predictions for the vi-
sual representation at time t + δt. Moreover, conflict-
ing hypotheses (reconstructed from ambiguous stereo
matches) will generate distinct predictions. In most
VISAPP 2009 - International Conference on Computer Vision Theory and Applications
498