Figure 1: Video browsing procedure.
plate to be updated from the current frame.
Recently, some authors wanted to bypass an ex-
haustive off-line learning stage. Purely on-line learn-
ing has been proposed by Ellis et al. in (Ellis
et al., 2008), where a bank of local linear predic-
tors (LLiPs), spatially disposed over the object, are
on-line learned and the appearance model of the ob-
ject is learnt on-the-fly by clustering sub-sampled im-
age templates. The templates are clustered using the
medoidshift algorithm. The clusters of appearance
templates allow to identify different views or aspects
of the target and also allow to choose the bank of
LLiPs most suitable for current appearance. The al-
gorithm also evaluates the performance of particular
LLiPs. When the performance of some predictor is
too low, it is discarded and a new predictor is learned
on-line as a replacement. In comparison to our work,
we do not throw away the predictors in sequence, but
we incrementally train them with new object appear-
ances in order to improve their performance.
Our learnable and adaptive tracking method, cou-
pled with a sparsely applied SIFT (Lowe, 2004) or
SURF (Bay et al., 2006) based detector, is applied for
faster than real-time linear video browsing. The goal
is to find all object occurrences in a movie. One of
possible solutions of video browsing task would be to
use a general object detector in every frame. As it ap-
pears (Yilmaz et al., 2006), (Murphy-Chutorian and
Trivedi, 2009), it is preferable to use a combination
of an object detector and a tracker in order to speed
up the browsing algorithm and also to increase the
true positive detections. We indeed aim at processing
rates higher than real-time which would allow almost
interactive processing of lengthy videos. Our yet pre-
liminary Matlab implementation can search through
videos up to eight times faster than the real video
frame rate.
2 LEARNING, TRACKING,
VALIDATION AND
INCREMENTAL LEARNING
User initiates the whole process by selecting a rect-
angular patch with the object of interest in one im-
Figure 2: A typical video scan process. Vertical red lines
depict frames, where the object detection was run. Red
cross means negative detection or tracking failure. Green
line shows backward and forward object tracking. Green
circle means positive object detection and yellow circle de-
picts successful validation.
age. This sample patch is artificially perturbed and
a sequential predictor is learned (Zimmermann et al.,
2009). Computation of a few SIFT or SURF object
descriptors completes the initial phase of the algo-
rithm, see Figure 1. The scanning phase of algorithm
combines predictor based tracking, its validation, and
a sparse object detection. The predictor is incremen-
tally re-trained for new object appearances. Exam-
ples for the incremental learning are selected automat-
ically with no user interaction.
The scanning phase starts with the object detec-
tion running every n−th frame (typically with the step
of 20 frames) until the first object location is found.
The tracker starts from this frame on the detected po-
sition both in backward and forward directions. Back-
ward tracking scans frames which were skipped dur-
ing the detection phase and runs until the loss-of-track
or until it reaches the frame with last found occur-
rence of the object. Forward tracking runs until the
loss-of-track or end of sequence. The detector starts
again once the track is lost. Tracking itself is vali-
dated every m−th frame (typically every 10 frames).
The scanning procedure is depicted on Figure 2.
One object sample represents only one object ap-
pearance. The predictor is incrementally re-trained as
more examples become available from the scanning
procedure. The next iteration naturally scans only im-
ages where the object was not tracked in the preceding
iterations.
Training examples for incremental learning are se-
lected automatically. The most problematic images-
examples are actually the most useful for incremental
training of the predictor. In order to evaluate the use-
fulness of a particular example we suggest a stability
measure. The measure is based on few extra predic-
tions of the predictor on a single frame. It means,
that we let the sequential predictor track the object in
a single static image and we observe the predictors’
behavior. See Section 2.3 for more details.
The sequential linear predictor validates itself.
Naturally, an object detector may be also used to val-
idate the tracking. For example well trained face de-
tector will do the same or better job when used to val-
idate human face tracking. Motivation for using the
sequential predictor for validation is its extreme ef-
VISAPP 2010 - International Conference on Computer Vision Theory and Applications
468