is described. They utilise a combination of 3D point
correspondences, and a smoothness of motion con-
straint to deduce vehicle motion. In (Ess et al., 2007),
the ground plane and pedestrians are simultaneously
extracted using stereo cameras mounted on a trolley.
Although they show impressive results, appearance
based detection is not sufficient enough for the pur-
poses of our application. Other work that deal with
the problem of segmenting independent motion with
camera egomotion include (Rabe et al., 2007), (Yuan
et al., 2007) and (Yu et al., 2005).
Depth estimation or disparity computation is of-
ten carried out based on the assumption that depth
discontinuity boundaries collocate with intensity or
colour discontinuity boundaries. The search for this
collocation is based on intensity similarity match-
ing from one image to the other, which includes
stages such as matching cost computation and aggre-
gation, disparity computation and refinement. Op-
timisation plays an important role in disparity esti-
mation (Gong and Yang, 2007). Recent comparative
studies, such as (Scharstein and Szeliski, 2002), have
shown that graph cut (Veksler, 2003) and belief prop-
agation (Felzenszwalb and Huttenlocher, 2006) are
two powerful techniques to achieve accurate disparity
maps. However, both are computationally expensive
and require hardware solutions to achieve near real
time performance, e.g. (Yang et al., 2006).
In section 2, we first provide an overview of the
proposed method, and then elaborate each of its stages
in some subsections. Then, experimental results are
reported in section 3, followed with conclusion in sec-
tion 5.
2 PROPOSED APPROACH
The primary aim here with respect to generic object
detection is to identify objects moving independently
in the scene. We are not concerned with the specific
class of objects, but just to extract sufficient informa-
tion for a later cognitive module to interpret if the mo-
tion of the object can pose a danger to the visually-
impaired user. This is achieved by tracking a sparse
set of feature points, which implicitly label moving
objects, and segmenting features which exhibit mo-
tion that is not consistent with that generated by the
movement of the stereo cameras. Sparse point track-
ing has previously been applied successfully to mo-
tion based segmentation in (Hannuna, 2007). Dense
depth maps are simultaneously extracted, yielding lo-
cations and 3D trajectories for each feature point.
These depth maps are also required for input into a
later stage of the CASBliP project (the sonification
process to generate a sound map), so they do not incur
extra computational burden compared to using sparse
depth maps.
The Kanade-Lucas-Tomasi (KLT) (Shi and
Tomasi, 1994) tracker is used to generate this sparse
set of points in tandem with the depth estimation
process. This tracker preferentially annotates high
entropy regions. As well as facilitating tracking,
this also ensures that values taken from the depth
maps in the vicinity of the KLT points are likely
to be relatively reliable as it is probable that good
correspondences have been achieved for these
regions.
Points corresponding to independently moving
objects are segmented using MLESAC (Torr and Zis-
serman, 2000) (Maximum Likelihood Sample Con-
sensus), based on the assumption that apparent move-
ment generated by camera egomotion is dominant.
Outliers to this dominant motion then generally cor-
respond to independently moving objects. Bounding
boxes are fitted iteratively to the segmented points un-
der the assumption that independently moving objects
are of fixed size and at different depths in the scene.
Segmented points are aligned with depth maps to
ascertain depths for moving object annotation. The
mode depth of these points is used to scale a bounding
box which is robustly fit to the segmented points, such
that the number of inliers are maximised. To segment
more than one object, the bounding box algorithm is
reapplied to the ‘bounding box outliers’ produced in
the previous iteration. This needs to be done judi-
ciously, as these outliers may be misclassified back-
ground points that are distributed disparately in the
image. However, if a bimodal (or indeed multimodal)
distribution of foreground depths is present, objects
will be segmented in order of how numerously they
are annotated at a consistent depth. In the current real-
time implementation of the navigation system, bound-
ing boxes are only fitted to the object which is most
numerously annotated at a consistent depth. This is,
almost without exception, the nearest object. Only
processing the nearest object simplifies the sound map
provided to user, thus providing only the most rel-
evant information and making the audio feed easier
to interpret. For example, in a scene where there are
two objects present at different depths, the one with
the greater number of tracking points will be robustly
identified first, assuming both sets of tracking points
demonstrate similar variance in their depth values.
2.1 Depth Estimation
In order to estimate the distance of objects from the
user in an efficient manner, a stereo grid with two
GENERIC MOTION BASED OBJECT SEGMENTATION FOR ASSISTED NAVIGATION
451