a graph whose nodes are the pixels of the video,
i.e. {(x,y,t) : (x,y) ∈ Ω,t ∈ {0,. .. ,N}}. There are
two types of edges in the graph: spatial and tem-
poral ones. Spatial edges connect a pixel (x,y,t)
to its 8-neighborhood in frame t. Temporal edges
are defined using the pre-computed forward optical
flow. If the flow vector for pixel (x,y,t) is (u, v),
then we add to the graph an edge joining pixel (x,y,t)
to pixel (x
′
,y
′
,t
′
) = (x + [u], y+ [v],t + 1), where the
square brackets denote the nearest integer. This graph
gives us the 3D neighborhood of each point (x,y,t).
This permits an easy adaptation of the algorithm in
(Koepfler et al., 1994)
For a givenλ and following(Koepfler et al., 1994),
the energy is optimized with a regionmergingstrategy
that computes a 2-normal segmentation. A 2-normal
segmentation is defined by the property that merging
any pair of its regions increases the energy of the seg-
mentation. Notice that 2-normal segmentations are
typically not local minima of the functional; however,
they are fast to compute and useful enough for our
purposes. The region merging strategy consists in it-
eratively coarsening a given pre-segmentation, which
is stored as a region-adjacency graph. Each edge of
this graph is marked by the energy gain that would
be obtained by merging the corresponding pair of re-
gions into one. Then, at each step of the algorithm
the optimal merge – the one that leads to the best im-
provementof the energy – is performed, thus reducing
the region adjacency graph by one region and one or
more edges. The energy gain is recomputed for the
neighbouring regions and the algorithm continues to
merge as long as they produce some improvement of
the energy functional. Note that the parameter λ of
the energy functional controls the number of regions
of the resulting segmentation. When finding 2-normal
segmentations by region merging, this parameter can
be automatically set by specifying directly the desired
number of regions.
In practice, we do not know which value of λ
will produce a good segmentation. For that reason
we proceed as follows: we create a set of partitions
that are obtained by successively increasing λ (e.g.
dyadically). Each partition is computed by taking as
input the previously obtained partition and merging
regions as described in the previous paragraph. The
algorithm starts with a low value of λ using the time-
connected graph described above, and stops when the
trivial partition is obtained. The history of all the
mergings is stored in a (binary) tree, whose nodes rep-
resent each region of the segmentation at some itera-
tion. The leafs of this tree are the pixels of the input
video. The internal nodes of this tree are regions ap-
pearing at some iteration, and the root of the tree is
the whole video. While the tree is being built, each
node is marked with the value of λ at which the cor-
responding region has been created.
Figure 1: This figure illustrates the concept of tubes as de-
scribed in the paper. We see a tube that ends at t, a tube that
starts at t + 1 and a tube that continues through frame t.
Once this tree is built, it can be cut at any desired
value of λ in real-time, to produce segmentations at
different scales. We call tubes the spatio-temporal re-
gions of the resulting partition (see Figure 1). The
tubes encode a temporally coherent segmentation of
the objects in the video, which can be used for sev-
eral purposes (e.g. tracking). We use them here in
order to determine potential occlusions by analyzing
their temporal boundaries. Any connected tube O has
a starting and and ending times, denoted by T
s
O
and
T
e
O
, respectively. The section of O at time t is de-
noted by O(t) = {(x, y) ∈ Ω : (x,y,t) ∈ O}. Thus O
starts (resp. ends) with the spatial region O(T
s
O
) (resp.
O(T
e
O
)).
2.3 Intermediate Frame Interpolation
Given two video frames I
0
and I
1
at times t = 0 and
t = 1 respectively, our purpose is to interpolate an in-
termediate frame at time t = δ ∈ (0, 1). For that issue
the forward and backward motion fields are first esti-
mated. In order to tackle with the occlusion effects
we use the information of the temporal boundaries
obtained with our time coherent segmentation algo-
rithm. Intuitively, a given pixel (x,y,t) at t = 0 is for-
ward projected if it is not in a dying statio-temporal
region. Similarly, a given pixel (x, y,t) at t = 1 is
backward projected if it is not in a spatio-temporal
region that has birth there. Let us now go into the
details of the algorithm.
The frame interpolation algorithm starts by mark-
ing all pixels to be interpolated as holes. Then two
stages are performed: in the first stage a forward pro-
jection is done; in the second a backward projection
is performed to (partially) fill in the holes that the first
stage may have left.
For the first stage we use the forward optical flow
(u,v) from t = 0 to t = 1. Let F (t = 0) be the set
of pixels of frame t = 0 which do not belong to tubes
that end at time t = 0. This information is contained
FRAME INTERPOLATION WITH OCCLUSION DETECTION USING A TIME COHERENT SEGMENTATION
369