called ’no news is good news’. The constraint argues
that if two image points do not have a contrast dif-
ference in-between, then they can be assumed to be
on the same 3D surface (see (Kalkan et al., 2006) for
a quantification of this assumption). (Grimson, 1982)
assumes that 3D orientation is available, and the input
3D points are dense enough for second order differen-
tiation.
In (Guy and Medioni, 1994), 3D points with sur-
face orientation are interpolated using a perceptual
constraint called co-surfacity which produces a 3D
association field (which is called the Diabolo field by
the authors) similar to the association field used in 2D
perceptual contour grouping studies. If the points do
not have 3D orientation, they estimate the 3D orienta-
tion first (by fitting a surface model locally) and then
apply the surface interpolation step.
The most relevant studies to our paper are (Hoff
and Ahuja, 1989; Lee et al., 2002). They both argued
that stereo matching and surface interpolation should
not be sequential but rather simultaneous. (Hoff and
Ahuja, 1989) fits local planes to disparity estimates
from zero-crossings of a stereo pair to estimate rough
surface estimates which are then interpolated taking
into account the occlusions, whereas in this paper, we
are concerned with predictions (1) of higher level fea-
tures, (2) using long-range relations and (3) voting
mechanisms. Moreover, as the authors have tested
their approach only on very textured scenes, the appli-
cability of the approach to homogeneous image areas
is not clear. (Lee et al., 2002) employs the following
steps: A dense disparity map is computed, and the
disparities corresponding to inliers, surfaces and sur-
face discontinuities are marked and combined using
tensor voting. The surfaces are then extracted from
the dense disparities using marching cubes approach.
Our work is different from the above mentioned
works in that: Our approach does not assume that the
input stereo points are dense enough to compute their
3D orientation. Instead, our method relies on the 3D
line-orientations of the edge segments which are ex-
tracted using a feature-based stereo algorithm (pro-
posed in (Pugeault and Kr
¨
uger, 2003)). The second
difference is that we employ a voting method which is
different from tensor-voting ((Lee and Medioni, 1998;
Lee et al., 2002)) in that it allows long-range interac-
tions in empty image areas and only in certain direc-
tions in much less computations than tensor-voting, in
order to predict both the depth and the surface orien-
tation.
We would like to distinguish depth prediction
from surface interpolation because surface interpola-
tion assumes that there is already a dense depth map
of the scene available in order to estimate the 3D ori-
entation at points (see, e.g., (Grimson, 1982; Guy and
Medioni, 1994; Lee and Medioni, 1998; Lee et al.,
2002; Terzopoulos, 1988)) whereas our understand-
ing of depth prediction makes use of only 3D line-
orientations at edge-segments which are computed
using a feature-based stereo proposed in (Pugeault
and Kr
¨
uger, 2003).
1.2 Contributions and Outline
Our contributions can be listed as:
• A novel voting-based method for predicting depth
at homogeneous image areas using just the 3D
line orientation at 3D local edge-features.
• Our votes have reliability measures which are
based on the coplanarity statistics of 3D local sur-
face patches provided in (Kalkan et al., 2007).
• Comparison with dense stereo on real and artifi-
cial scenes where we control the amount and the
type of texture to see the effect on performance
of the different approaches. We show that differ-
ent approaches are suited to different kinds of im-
age settings (i.e., textured/weakly-textured), and
the results suggest that a combination of different
approaches is suitable for a model that can per-
form well in all kinds of images.
The paper is organized as follows: In section 2,
we introduce how the images are represented in terms
of local image features. Section 3 describes the 2D
and 3D relations between the local image features that
are utilized in the depth prediction process. Section 4
explains how the depth prediction is performed. In
section 5, the results are presented and discussed. Fi-
nally, section 6 concludes the paper with a summary
and outlook.
2 VISUAL FEATURES
The visual features that we utilize (called primitives in
the rest of the paper) are local, multi-modal features
that were intoduced in (Kr
¨
uger et al., 2004).
An edge-like primitive can be formulated as:
π
e
= (x,θ,ω,(c
l
,c
m
,c
r
)), (1)
where x is the image position of the primitive; θ is
the 2D orientation; ω represents the contrast transi-
tion; and, (c
l
,c
m
,c
r
) is the representation of the color,
corresponding to the left (c
l
), the middle (c
m
) and the
right side (c
r
) of the primitive.
As the underlying structure of a homogeneous im-
age structure is different from that of an edge-like
DEPTH PREDICTION AT HOMOGENEOUS IMAGE STRUCTURES
521