and mobile size are compared with target and mobile
size at previous frame, respectively. If the size change
(∆s) in both cases is greater that 40% we do not al-
low to adapt the target size and also the parameters of
the neural network are not modified. Here, we try to
avoid abrupt changes obtained from the object classi-
fication which happens in a case of shadows, crowds
or another disturbances coming from the motion in-
formation. Then, the fusion rules is used to decide
whether it is necessary to correct the dimension and
the localization of the target. We check the difference
between the target and the mobile size (∆s
i
) in cur-
rent frame. If it is greater than 10% we compare the
target size with the history of mobiles (max 5 previ-
ous mobiles). If the history suggest that the mobile
size is invariant then we accept the new size and the
new center. After establishing of the target dimen-
sion the neural network is trained using samples from
the target and its local background region. The new
probability map is computed and the new dimension
of the target is determined. If the difference between
new and previous size is greater than 10% we assume
that the neural network fails. All neurons are removed
and the network is retrained.
After the adaptation step, the tracker checks if
there are any occlusions in the scene. Occlusion
events are presented in Listing 2. We distinguish two
types of occlusion. First, the static occlusion means
that the moving target is occluded by static item. This
decision is made only if the density value of the prob-
ability map decreases more than 10%. However, the
dynamic occlusion is detected when the target crosses
another target and the common area is greater than
60% of the area belonging to one of them. We assume
that the first target is occluded if the density value is
less than the density of second target.
3 EXPERIMENTAL RESULTS
We have tested many challenging video sequences to
illustrate advantages and limitations of our tracker.
The online learning phase enhances the ability to
track under changing background and illumination
conditions, changing in appearance and scale and im-
proper initialization. First, the experiments were per-
formed on the data obtained from the Gerhome lab-
oratory which promotes research in the domain of
activity monitoring and assisted living (Zouba et al.,
2008). Next, we tested our approach on TREC Video
Retrieval Evaluation (organized by NIST, TRECVID
2008) data obtained from Gatwick Airport surveil-
lance system. Below we present two example se-
quences. More evaluations can be found in (B ˛ak et al.,
2008).
Gerhome Video Sequence in Fig. 2 is presented.
At first frame a target is initialized as a bounding box
with a label which is obtained from the object clas-
sification module. During initialization for each new
target the neural network is created and an identity
is assigned. Next, this neural network is trained us-
ing features computed from the target and the local
background region. At frame 67 we can observe im-
portant issues. The motion information for object ini-
tialization is not always correct which leads to track
noise (as ‘2-PERSON’) due to illumination change.
Nevertheless if in subsequent frames the target is not
confirmed by the motion information coming from
the object classification, the target is assumed to be
a noise and afterwards is removed. However a more
important issue is that the tracked person is split into
two targets (‘0-PERSON’ and ‘3-PERSON’), caused
also by noise coming from motion information. We
do not apply any merging algorithm for tracked tar-
gets because it is very difficult to decide whether few
targets in fact represent one real object or several dif-
ferent objects. The neural network is not also helpful
in that case because parts of a real object could have
a completely different appearance model which pre-
vents from merging such kind of targets. At frame 93
we can see ‘0-PERSON’ marked as occluded target
due to movement of object behind a cupboard door.
At frame 133 we show that the neural tracker is able
to capture true dimension of the object in following
frames.
We also tested our approach on a long-term
sequence. An elderly woman in her apartment
(real world scene with occlusions and illumination
changes) was tracked during 1 hour and 23 minutes
(50.000 frames). A woman left for short periods of
time the observed room (and came back) 7 times. The
tracker was confused only 22 times (id switched).
TRECVID Video Sequence in Fig. 3 is pre-
sented. During this complex sequence many objects
are crossing each other. It is shown that the neural
tracker has the ability to manage with dynamic and
static occlusions. The neural tracker is used to make
decision which targets are occluded. For each target
the probability map of the overlapping area is com-
puted and the most probable one is chosen. The most
probable target means the target which has the largest
density value of the probability map. During dynamic
occlusion localization stage does not use probability
map to localize target but its history which is based
on the target displacement. Also the adaptation pro-
cess is suspended. The history contains the informa-
VISAPP 2009 - International Conference on Computer Vision Theory and Applications
414