Figure 3: Illustration of the feature extraction level: a)
one image; b) the 2D–primitives extracted; c) a detail of
b); d) symbolic representation of a 2D–primitives, with 1-
orientation, 2-phase, 3-colour, and 4-optical flow.
the contrast transition across the contour (e.g., bright
to dark edge), c encodes colour information on both
sides of the contour (sampled locally), and f encodes
the local optical flow.
Such 2D-primitives are matched across stereo-
pairs of images, allowing for the reconstruction of a
3D equivalent, called 3D-primitives, and described by
the following feature vector:
Π
Π
Π = (X, T, Φ, C) (5)
where X is the 3D-primitive’s position, T the 3D tan-
gent to the contour, Φ is the phase, and C encodes
colour information of the 3D-contour. This process is
illustrated in Fig. 3, where a) shows a detail of the im-
age, b) the extracted 2D-primitives and c) a magnified
detail. In d), the symbolic representation of primitives
is illustrated, with 1) indicating orientation, 2) phase,
3) colour, and 4) optical flow. Moreover, SIFT fea-
tures (Lowe, 2004) are also extracted from the images
to allow for more robust matching.
2.2.2 Rigid Body Motion (RBM) Estimation
The motion of the camera and of the IMOs is eval-
uated using correspondences of 3D-primitives and
SIFT features across time. In this case, because we
consider only vehicles, we restrict ourselves to Rigid
Body Motions (RBMs). The mathematical formu-
lation of the RBM that we use is from (Rosenhahn
et al., 2001), and has three advantages: First, the mo-
tion is optimised in 3D space; second, it allows for
solving the motion jointly for different kind of con-
straint equations that stem from different type of im-
age features (in this case, local edge descriptors and
SIFT); third, it minimises the error directly in SE(3),
and therefore does not require additional measures
to handle degenerate cases. As been shown in (Pilz
et al., 2009), a combination of heterogeneous features
(edges and SIFT features) leads to an improved ro-
bustness and accuracy of the RBM estimate. Outliers
are discarded using RANSAC (Fischler and Bolles,
1981).
2.2.3 Tracking and Filtering
All 3D-primitives are tracked using independent
Kalman Filters (Kalman, 1960). The prediction stage
is provided by the estimated motion. The position un-
certainty of the 3D–primitives is re-projected in the
image domain, into a 2 × 2 covariance matrix. Using
this covariance matrix we estimate the likelihood for
the 3D-primitive to find a match at each location by
a normal distribution combined with a uniform dis-
tribution (that expresses the chance for a correct 3D–
primitive not to be matched). We will write the fact
that a primitive Π
Π
Π
i
that predicts a primitive
ˆ
Π
Π
Π
i,t
at time
t is matched (as described above) as µ
i,t
and evaluate
its likelihood as:
p[µ
i,t
] =
e
1
2
(∆x)
>
i,t
Σ
−1
∆,i,
(∆x)
i,t
(2π)
p
|Σ
∆,i,t
|
+ β (6)
The matrix Σ
∆,i,t
=
ˆ
Σ
x,i,t
+
˜
Σ
x,i,t
is the sum of the re–
projected position uncertainty for both the predicted
(
ˆ
Σ
x,i,t
) and the observed (
˜
Σ
x,i,t
) primitives in this im-
age. In this equation, β = p[¯µ|Π
Π
Π] Also, (∆x)
t
=
ˆx
t|t−1
− ˜x
t
is the difference between the position of
the two re–projected primitives, where ˆx
t|t−1
is the
predicted position and ˜x
t
is the position of the poten-
tial match. If the confidence p [µ
i,t
] is larger than the
chance value γ = p [µ|Π], the the match is considered
valid. Furthermore, the similarity between the prim-
itives (in orientation, phase, and colour) is also con-
sidered, and matches with a too low similarity (lower
than τ = 0.9) are disregarded.
Moreover the confidence in the existence of the
accumulated primitive is updated depending on how
consistently concordant evidence has been found in
the image. The probability is evaluated from
p[Π
Π
Π
i,t
|¯µ
i,t
)] = (1 + κ
i,t
)
−1
, (7)
where κ is evaluated recursively
κ
i,t
=
γ
p[µ
i,t
]
κ
i,t−1
. (8)
with κ
1
= p[Π] is the prior probability that a 3D–
primitive is correct.
If an hypothesis’ confidence p [Π
Π
Π
i,t
|¯µ
i,t
)] falls be-
low a threshold τ
min
, then it is deemed erroneous and
discarded; if it raises above a threshold τ
max
, then it is
deemed verified up to certainty, and its confidence is
not updated any more. This allows for the preserva-
tion of features during occlusion. This is effectively a
soft version of the classical n-scan strategy in tracking
(Reid, 1979).
Based on this filtering process, at the third level
a 3D model of the object becomes accumulated and
a final decision for the acceptance or removal of the
IMO is made.
VISAPP 2010 - International Conference on Computer Vision Theory and Applications
240