tracked object. The most popular method in this class
is the one of Comaniciu et al. (Comaniciu et al.,
2000; Comaniciu et al., 2003), where approximate
“mean shift” iterations are used to conduct the iter-
ative search. Graph cuts have also been used for illu-
mination invariant kernel tracking in (Freedman and
Turek, 2005).
These three types of tracking techniques have dif-
ferent advantages and limitations, and can serve dif-
ferent purposes. The ”detect-before-track” methods
can deal with the entries of new objects and the exit of
existing ones. They use external observations that, if
they are of good quality, might allow robust tracking.
However this kind of tracking usually outputs bound-
ing boxes only. By contrast, silhouette tracking has
the advantage of directly providing the segmentation
of the tracked object. With the use of recent graph
cuts techniques, convergence to the global minima is
obtained for modest computational cost. Finally ker-
nel tracking methods, by capturing global color dis-
tribution of a tracked object, allow robust tracking at
low cost in a wide range of color videos.
In this paper, we address the problem of mul-
tiple objects tracking and segmentation by combin-
ing the advantages of the three classes of approaches.
We suppose that, at each instant, the moving objects
are approximately known from a preprocessing al-
gorithm. Here, we use a simple background sub-
traction but more complex alternatives could be ap-
plied. An important novelty of our method is that
the use of external observations does not require the
addition of a preliminary association step. The as-
sociation between the tracked objects and the obser-
vations is jointly conducted with the segmentation
and the tracking within the proposed minimization
method. The connected components of the detected
foreground mask serve as high-level observations. At
each time instant, tracked object masks are propa-
gated using their associated optical flow, which pro-
vides predictions. Color and motion distributions are
computed on the objects segmented in previous frame
and used to evaluate individual pixel likelihood in
the current frame. We introduce for each object a
binary labeling objective function that combines all
these ingredients (low-level pixel-wise features, high-
level observations obtained via an independent detec-
tion module and motion predictions) with a contrast-
sensitive contextual regularization. The minimiza-
tion of each of these energy functions with min-
cut/max-flow provides the segmentation of one of the
tracked objects in the new frame. Our algorithm
also deals with the introduction of new objects and
their associated tracker. When multiple objects trig-
ger a single detection due to their spatial vicinity,
the proposed method, as most detect-before-track ap-
proaches, can get confused. To circumvent this prob-
lem, we propose to minimize a secondary multi-label
energy function which allows the individual segmen-
tation of concerned objects.
In section 2, notations are introduced and an
overview of the method is given. The primary en-
ergy function associated to each tracked object is in-
troduced in section 3. The introduction of new objects
and the handling of complete occlusions are also ex-
plained in this section. The secondary energy function
permitting the separation of objects wrongly merged
in the first stage is introduced in section 4. Exper-
imental results are reported in section 5, where we
demonstrate the ability of the method to detect, track
and precisely segment persons and groups, possibly
with partial or complete occlusions and missing ob-
servations. The experiments also demonstrate that the
second stage of minimization allows the segmentation
of individual persons when spatial proximity makes
them merge at the foreground detection level.
2 PRINCIPLE AND NOTATIONS
In all this paper, P will denote the set of N pixels of
a frame from an input image sequence. To each pixel
s of the image at time t is associated a feature vector
z
s,t
= (z
(C)
s,t
,z
(M)
s,t
), where z
(C)
s,t
is a 3-dimensional vector
in RGB color space and z
(M)
s,t
is a 2-dimensional vector
of optical flow values. Using an incremental multi-
scale implementation of Lucas and Kanade algorithm
(Lucas and Kanade, 1981), the optical flow is in fact
only computed at pixels with sufficiently contrasted
surroundings. For the other pixels, color constitutes
the only low-level feature. However, for notational
convenience, we shall assume in the following that
optical flow is available at each pixel.
We assume that, at time t, k
t
objects are tracked.
The i
th
object at time t is denoted as O
(i)
t
and is defined
as a mask of pixels, O
(i)
t
⊂ P .
The goal of this paper is to perform both seg-
mentation and tracking to get the object O
(i)
t
corre-
sponding to the object O
(i)
t−1
of previous frame. Con-
trary to sequential segmentation techniques (Juan and
Boykov, 2006; Kohli and Torr, 2005; Paragios and
Deriche, 1999), we bring in object-level “observa-
tions”. They may be of various kinds (e.g., obtained
by a class-specific object detector, or motion/color de-
tectors). Here we consider that these observations
come from a preprocessing step of background sub-
traction. Each observation amounts to a connected
component of the foreground map after background
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
448