the optical flow to predict the new shape and posi-
tion of the object in adjacent frames in order to shrink
the space of objects to explore. The objects are in-
serted in a layered directed acyclic graph, where the
longest paths are the ones presenting the more ”ob-
jectness”. To our knowledge their proposition shows
the best score on SegTrack2011.
The aforementioned methods do not need to align
the frames, they suppose that the frame rate is fast
enough to consider that the object moves smoothly,
i.e. its position and shape are very similar in next
frames. Some other approaches relies on the frame
alignment, once aligned their subtraction highlights
the areas of motion, these areas potentially are fore-
ground objects. The main advantage of such approach
is to reduce the number of candidate areas for ”ob-
jectness”. (Sole et al., 2007) and (Ghanem et al.,
2012) address this problem in sports videos where
camera moves are quite smooth and typical. (Kong
et al., 2010) aligns two video sequences in order to
detect suspect objects in streets in videos shot from
a moving car. Closer to our works (Granados et al.,
2012) propose some background in-painting, i.e., re-
moving a foreground object in a shot, based on mul-
tiple frame alignment. Each frame is decomposed in
several planes, in order to process multiple homogra-
phies with the next and previous frames. The quality
of their results are very impressive. The main draw-
back of the frame alignment approach is that if it is
not accurate some false positive and loud noise occur
and disrupt the algorithms.
3 VabCut:A VIDEO MOTION
EXTENSION OF GrabCut
The idea of VabCut is simple, whereas GrabCut works
on the RGB space for segmenting still images, we
propose a solution that conjointly works on the mo-
tion information for segmenting video objects. The
state of the art methods mostly separate image and
motion, VabCut set them on a same level for segmen-
tation: the motion is considered as an extra colour.
The approach consists of two main steps: 1. com-
pute the motion M-layer of the frame F
t
using tempo-
rally close frames, 2. perform VabCut algorithm with
the RGB layer and the M-layer of F
t
as inputs.
For computing the M-layer, different approaches
can be used, basically direct registration or point reg-
istration methods. We here propose an original fast
but robust point method frame alignment solution.
3.1 Robust Frame Alignment
The frame alignment is based on an estimation of
the camera motion between two frames. Two main
approaches for alignment can be distinguished: di-
rect registration or point based registration. The first
approach gives more precise correspondences, but is
usually computationally very heavy (Bergen et al.,
1992), the second one is faster but usually more ap-
proximative. We here propose an original point based
approach which aims to keep the computation burden
low while tackling some issues of the simple points
matching approach.
A simple point matching process happens in 3
steps: 1. points of interest detection and descriptor
computation, 2. descriptor matching, 3. outliers re-
moval and homography computation.
Concerning the points of interest and descriptors,
during the ten last years, the computer vision commu-
nity has been very influenced by David Lowe’s works,
especially the well-known SIFT descriptors ((Lowe,
2004; Brown and Lowe, 2007)). This descriptor has
shown some great capabilities for image matching
even under some strong transformations. Since then
the family of points of interest (PoI) and descriptors
have been flourishing (SURF, FAST, BRIEF, ROOT-
SIFT, etc). For temporally close image matching, we
have 2 concerns, first, any PoI detector and any de-
scriptor has some weakness (not representative, not
discriminative, not robust, etc) depending on the vi-
sual contents of the images, second, the objects of in-
terest in a video may attract the PoI detector while the
background is not well described.
The aforementioned third step of a matching pro-
cess is typically performed by RANSAC family algo-
rithms, which are known to be robust to noise. The
result of a RANSAC processing is an homography
matrix that aligns the two frames. However if the
number of outliers is too high compared to inliers the
process fails. For example, it may mistakenly align
foreground objects, suggesting that background has
moved and the foreground objects have not.
To prevent this failure we propose an iterative and
collaborative process that identifies and remove the
outliers even when over represented in order to refine
the frame alignment. In a nutshell, the process is as
follows: 1. compute and match multiple sets of PoI
and local descriptors, 2. for each set run RANSAC
algorithm for removing some outliers and compute
an homography between the two frames, 3. evalu-
ate the quality of each resulting homography, 4. take
the worst homography has a marker of outliers, re-
move the potential outliers from each sets, next iter-
ation from RANSAC step. Each step is developed in
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
364