Coherent Selection of Independent Trackers for Real-time Object

Tracking

Salma Moujtahid, Stefan Duffner and Atilla Baskurt

Universit´e de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, F-69621, Villeurbanne, France

Keywords:

Visual Object Tracking, Classiﬁer Fusion, Tracker Selection, Online Update.

Abstract:

This paper presents a new method for combining several independent and heterogeneous tracking algorithms

for the task of online single-object tracking. The proposed algorithm runs several trackers in parallel, where

each of them relies on a different set of complementary low-level features. Only one tracker is selected at

a given frame, and the choice is based on a spatio-temporal coherence criterion and normalised conﬁdence

estimates. The key idea is that the individual trackers are kept completely independent, which reduces the

risk of drift in situations where for example a tracker with an inaccurate or inappropriate appearance model

negatively impacts the performance of the others. Moreover, the proposed approach is able to switch between

different tracking methods when the scene conditions or the object appearance rapidly change. We experimen-

tally show with a set of Online Adaboost-based trackers that this formulation of multiple trackers improves the

tracking results in comparison to more classical combinations of trackers. And we further improve the overall

performance and computational efﬁciency by introducing a selective update step in the tracking framework.

1 INTRODUCTION

Visual tracking of arbitrary objects in complex scenes

is a challenging task that has gained increasing inter-

est in computer vision research during the past years.

We consider here the problem of online tracking of a

single object in a video stream, i.e. the object’s posi-

tion at a given time is only estimated from its previous

position(s) and previous and current video frames but

not from future observations. Also, the camera is not

required to be ﬁxed and no prior information about

background nor the object’s appearance or motion is

used. In this setting, the object’s shape and appear-

ance and the backgroundcan change considerably due

to different lighting conditions, viewing angles, defor-

mations, and partial occlusions. To successfully track

an object, the algorithm needs to dynamically adapt to

the visual changes of the object’s appearance as well

as its environment. This adaptation however bears

the risk of gradually including background informa-

tion into the object model, which leads the tracker to

drift and eventually lose the object.

1.1 Related Work

One way of increasing the robustness to changing ap-

pearance and external conditions is by using several

different appearance (or motion) models which are

(dynamically) selected or merged in a way that makes

the tracker less sensitive to these changes at a given

time. Some existing methods (Yilmaz et al., 2004;

Hua et al., 2006; Stalder et al., 2009) follow this ap-

proach by (online) training a foreground/background-

classiﬁer based on several visual features character-

ising different visual aspects like colour, texture or

shape, e.g. Histogram of Oriented Gradients (HOG),

colour histograms, ﬁlter responses, or local descrip-

tors like SIFT, etc. However, this low-level fusion

of features leads to problems, when some of the vi-

sual attributes are suddenly altered or hidden due to

changes in lighting, the object’s view point, deforma-

tions or partial occlusions, for example. Other meth-

ods fuse different modalities in a more explicit way

and at a higher level. For example, Collins et al.

(Collins and Liu, 2005) compute likelihood (or con-

ﬁdence) maps with different linear combinations of

RGB channels for each video frame. Then they per-

form mean shift to track a given object in each likeli-

hood map and fuse the result of each “tracker” using

the median. Yin et al. (Yin et al., 2008) extend this

approach with different features (e.g. HOG, saliency)

and using adaptive weights depending on the fore-

ground/background separability of each feature. Tri-

esch (Triesch and v. d. Malsburg, 2001) also used dif-

584

Moujtahid S., Duffner S. and Baskurt A..

Coherent Selection of Independent Trackers for Real-time Object Tracking.

DOI: 10.5220/0005311305840592

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 584-592

ISBN: 978-989-758-091-8

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

ferent visual cues to produce saliency maps and their

reliabilities to integrate them into a tracking result.

Some tracking methods relying on Particle Fil-

ters (Perez et al., 2004; Maggio et al., 2007; Badri-

narayanan et al., 2007; Moreno-Noguer et al., 2008;

Nickel and Stiefelhagen, 2008; Duffner et al., 2009)

combine the modalities by using different observation

likelihood functions or different state spaces and then

integrate these multiple cues in a probabilistic way.

Leichter et al. (Leichter et al., 2006) proposed a more

general probabilistic framework to fuse the output of

different independent trackers. More recently, Kwon

et al. (Kwon and Lee, 2010) proposed in their VTD

method to combine different motion and appearance

models using Interactive MCMC, and in VTS (Kwon

and Lee, 2011) to sample from a space of trackers

with different properties in order to increase the over-

all tracking robustness. However, the interaction be-

tween trackers can become relatively complex and

difﬁcult to control. Also, many appearance models

may need to be evaluated for each video frame, which

makes these approaches computationally expensive.

When treating visual tracking as a detection or

classiﬁcation problem, one can make use of so-called

ensemble classiﬁers for combining several classiﬁers

to increase the discriminative power (Kittler et al.,

1998). Avidan et al. (Avidan, 2007), for example,

proposed to use Adaboost where several “weak” clas-

siﬁers are combined into a “strong” one and updated

dynamically in order to classify each pixel as back-

ground or object. Similarly, Grabner et al. (Grabner

and Bischof, 2006) use a large dynamic pool of weak

classiﬁers operating on the whole object image patch

or parts of it, and only a small part of them is selected

at each frame to form the ﬁnal classiﬁer.

Recently, Bailer et al. (Bailer et al., 2014) pro-

posed a method to fuse trackers only by considering

their estimated bounding boxes of the object, using

a speciﬁc energy minimisation framework that takes

into account the global performance of each tracker

as well as trajectories over time. They show that by

combining many existing recent tracking algorithms,

the state-of-the-art can be improved.

Finally, the most similar work to ours is the one

from Stenger et al. (Stenger et al., 2009), where sev-

eral different tracking methods working with different

visual features are combined in a parallel or sequen-

tial way based on their normalised conﬁdence values.

They test their approach on a set of trackers and all

possible pairs and triplets of them. However, the eval-

uation is performed only on a small set of videos for

the tasks of hand and face tracking, and they use an

additional hand/face detector, trained off-line. More-

over, in our work, by using the same architecture (On-

line Adaboost) for the individual trackers as well as

for the baseline methods allows for a more rigorous

evaluation of the contribution of individual trackers

and the proposed scheme for combining them.

1.2 Motivation

The problem with approaches that use likeli-

hood/conﬁdence maps and also methods based on

tracker sampling (e.g. VTS) is that the likelihood

needs to be computed on many image positions (or

at least a search window) and possibly at different

scales, which is computationally expensive. Further-

more, existing ensemble classiﬁers (e.g. using Ad-

aboost) often use highly correlated weak classiﬁers

or features, and can only gradually adapt to changes

in object appearance or scene variations or otherwise

suffer from drift. This motivates our choice of comb-

ing several independent trackers at a higher level,

each of them using features based on different visual

aspects like colour, texture and shape. By recurrently

selecting the most suitable tracker, the overall sys-

tem can switch rapidly between appearance models

depending on the changes of the scene and the object.

The remainder of this paper is organised as fol-

lows. In Section 2, we outline the proposed approach.

In Section 3, we brieﬂy explain the individual track-

ers used in our approach, and in Section 4, the pro-

posed algorithm for tracker selection and update is

presented. Experimental results illustrating the supe-

riority of the proposed method compared to standard

fusion methods are shown in Section 5.

2 PRINCIPLE APPROACH

The framework of the approach is demonstrated in

Fig. 1. The different competing trackers work in par-

allel, each of them estimating the object’s state, i.e.

a bounding box, and the conﬁdence in this result. In

principle, any kind of trackers can be used as long as it

provides a conﬁdence measure, score, or probability.

Here, we choose trackers relying on the Online Ad-

aboost (OAB) algorithm (Grabner and Bischof, 2006)

enhanced with different types of features. We will

brieﬂy describe this method and the visual features in

the following section. The conﬁdence value is nor-

malised based on parameters that have been trained

before on a separate training data set. This normal-

isation is an important step as the different trackers

can be heterogeneous and their conﬁdence values may

have different dynamics.

Then, in a given frame, a process of tracker elim-

ination is applied based on the overall temporal and

CoherentSelectionofIndependentTrackersforReal-timeObjectTracking

585

!"#$%!&'

()*+,

-./%0

!"#$%!&1

!"#$%!&2

%3*0!"4&")/&

5*",."4&

#06%!%)#%&

%)70!#%3%),&

5%4%#,%/&

8%9,

,!"#$%!

!"#$%!&:

80+)/.);&80<

#0)7./%)#%9&

=&80+)/.);&

80<%9

>0)7./%)#%?

8"9%/

9%4%#,.0)&

>0)7./%)#%

)0!3"4.9",.0)

5%4%#,.-%&

+*/",%

Figure 1: Procedure of the competing tracker approach. One out of L trackers is selected based on their conﬁdence and the

spatio-temporal coherence of the solution.

spatial coherence of the result. This ﬁlters out the

trackers outside a given limit resulting in a smoother

and spatially more coherent tracking result. In the ﬁ-

nal step, the best tracker for the current frame is se-

lected from the remaining individual trackers based

on their normalised conﬁdence values. All the track-

ers keep their individual states (i.e. bounding boxes)

and are re-considered for selection at each point in

time. The advantage of our tracker selection ap-

proach is that it avoids unnatural jumps of the result-

ing bounding boxes while still being able to quickly

change from a less conﬁdent tracker to more conﬁ-

dent one, for example when the object’s appearance

rapidly changes.

We remind that we do not perform any fusion of

the individual tracking results which combines the

output of all trackers or a subset of them. Also there

is no direct interaction between the different track-

ers. Each tracker remains independent, and only one

tracker is selected based on the individual conﬁdence

values and the global spatio-temporal coherence.

3 INDIVIDUAL TRACKERS

The concept for this paper is to use relatively simple

and fast trackers, that are based on low-level indepen-

dent and complementary features. Here, we propose

to use different OAB trackers, each of them using

a different type of visual feature which will be de-

scribed in Section 3.2. The choice of using Adaboost-

based trackers is motivated by their computational ef-

ﬁciency and the simplicity of using different indepen-

dent features with the same architecture. Of course,

a different tracking method (e.g. structured output

SVMs) can be used as well.

3.1 Online Adaboost

Online Adaboost (OAB) (Grabner and Bischof, 2006)

is an extension of the well-known Adaboost algo-

rithm (Freund and Schapire, 1997) for on-line learn-

ing that has been effectively applied to the on-line

tracking problem (Grabner et al., 2006) by training

a binary classiﬁer with foreground and background

image patches. In this approach, a “strong” clas-

siﬁer H uses a set of “selectors” h

sel

, (n ∈ 1..N),

each of them iteratively choosing the most discrim-

inative features from a global pool of “weak” clas-

siﬁers h

, (m ∈ 1..M). Each h

(x) ∈ {−1, +1} per-

forms a simple binary classiﬁcation using a single

low-level feature extracted from a candidate image re-

gion (explained in the following section). Then, the

ﬁnal strong classiﬁer output for a given example im-

age patch x is:

H(x) = Φ

∑

n=1

sel

(x)

, (1)

where the α

∈ (0, 1) are the weak classiﬁer weights:







0 if e

< 0.5

log



1− e



otherwise,

(2)

with e

being the classiﬁcation error of h

sel

. The func-

tion Φ(·) is a decision function (e.g. the Heavyside

step function), which is of minor importance in our

approach as we only use the conﬁdence value (see

Section 4.1). At each frame, the object is tracked by

applying the classiﬁer several times on image patches

in a region around the last tracking position (the

search window) and choosing the patch x

∗

with the

highest conﬁdence (see Section 4.1). Then the fea-

ture distributions in the weak classiﬁers correspond-

ing to object and background as well as the classi-

ﬁers’ weights α

are incrementally updated based on

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

586

the new image information and the respective classi-

ﬁcation errors. For more details we refer to (Grabner

et al., 2006).

3.2 Low Level Features

As mentioned before, each individual tracker relies

on different types of features. In this paper, we pro-

pose to use three types of complementary features

(and three individual trackers) explained in the fol-

lowing. Clearly, the proposed method is not limited

to these three, and more trackers and/or other feature

types can be added. A feature is computed on a sub-

region of the image patch that corresponds to the ob-

ject’s bounding box.

• Haar-like Features: as originally proposed in

(Grabner et al., 2006; Viola and Jones, 2001).

They correspond to differences of the mean pixel

intensities of adjacent rectangular regions.

• Histogram of Oriented Gradient (HOG) Fea-

tures: a simpliﬁcation of the original HOG (Dalal

and Triggs, 2005) where the gradients computed

on a region are quantised into a set of 8 orienta-

tions and 3 magnitudes, and the value of a single

bin is used as a feature.

• Histogram Of Colour (HOC) Features: HSV

histograms of a region are computed with a 2 ×

3×8 quantisation on the Hue and Saturation chan-

nels. As for the HOG features, the value of a sin-

gle bin is used as a feature.

The feature selection (sub-regions, bins) is performed

by the standard Adaboost algorithm on a subset of

random samples.

4 TRACKER COMBINATION

Each of the individual trackers alone does not per-

form very well on average, measured on a challeng-

ing tracking benchmark data set. However, as we

will show experimentally, our proposed combination

scheme of multiple trackers using different types of

features outperforms each of these individual trackers

as well as a single tracker that combines all types of

features in the classical way.

4.1 Conﬁdence Measure and

Normalisation

A variety of conﬁdence measures exist depending on

the tracking or classiﬁcation algorithm. Here, we use

the one originally proposed for OAB. Given a set of

trackers T

, (k ∈ 1..L), the conﬁdence c

for T

is ob-

tained by weighting the results of each one of its se-

lectors h

sel

. For an example x, we have:

(x) =

∑

n=1

sel

(x). (3)

This conﬁdence measure expresses, in a way, the pro-

portion of selectors h

sel

having correctly classiﬁed ex-

ample x. Note that it is directly related to the output

of the strong classiﬁer (Eq. 1).

Having different trackers, the conﬁdence values

for each one will have different dynamics. In order

to be able to compare the conﬁdence values, we nor-

malise them for each tracker T

using the mean µ

and

standard deviation σ

computed on a separate data

set: ALOV300++ (Smeulder et al., 2014), similar to

(Stenger et al., 2009): c

′

−µ

. These videos cover

various changes in illuminations, transparency, clut-

ter, occlusion, zoom, appearance, motion patterns and

contrast, which makes them suitable for estimating µ

and σ

in typical real-world scenarios.

4.2 Spatial and Temporal Coherence

Using the normalised conﬁdence to choose the best

tracker at a given frame t is not sufﬁcient, as we will

see in the experiments section. When the bound-

ing boxes of the individual trackers are further apart

(because one or several of them have drifted away),

jumps can occur in the overall tracking process since

no continuity is present in the choice of best conﬁ-

dence. In order to avoid these jumps and to make the

output smoother, a spatial and temporal coherence cri-

terion is introduced, which ignores the result of those

trackers that are too far away from the previous ob-

ject’s position.

At each frame t, we have a set of trackers

, (k ∈ 1..L), that given an example, return a con-

ﬁdence c

and a bounding box B

surrounding the

object in the image. This bounding box is deﬁned

by the position of its centre (x

, y

) and its dimen-

sions (w

, h

). The bounding box B

t−1

of the previous

frame t − 1 coming from the selected tracker is saved

and used to compute its distance to each of the current

tracker’s bounding boxes B

. Then, at frame t, tracker

is (temporarily) eliminated if:

max





− x

t−1



− Θ



− y

t−1



− Θ



> 0. (4)

Θ = (Θ

, Θ

) is a two-dimensional distance threshold

proportional to the size of B

t−1

, that is Θ

= β

t−1

and Θ

= β

t−1

. The optimal coefﬁcient β is com-

puted empirically by running the algorithm on the

ALOV300++ data set and left constant for all of our

experiments.

CoherentSelectionofIndependentTrackersforReal-timeObjectTracking

587

After this step, only a subset of trackers

, (i ∈ 1..L

′

), with L

′

≤ L is left. The ﬁnal decision on

the best tracker T

with bounding box B

is given by

selecting the tracker with maximal normalised conﬁ-

dence c

′

s = argmax

i∈1..L

′



′



(5)

In the case where T

, (i ∈ 1..L

′

) is empty, then only the

maximal conﬁdence criterion is applied.

4.3 Selective Update

In the original OAB architecture, the trackers are

completely independent in the sense that each tracker

, (k ∈ 1..L), updates its discriminative model using

its resulting bounding boxes B

at frame t.

One update scheme would be to use the selected

tracker’s bounding box to update the remaining track-

ers. This idea proved to be ineffective due to the rapid

drift of the now co-dependent trackers. In fact, the

trackers start drifting when the background informa-

tion is falsely included into the foreground model.

Instead we introduce a selective update strategy

with a slower update rate (and still independent track-

ers) giving us a slower learning rate of the models and

reduced computational time, which will help prevent

this drifting phenomenon. To this end, we choose to

only update the previously selected tracker with its re-

sulting bounding box, since we consider that the non-

selected trackers may have eventually drifted. Thus

we do L− 1 less updates than in the original learning

architecture.

5 EXPERIMENTS

The experiments for the proposed approach are con-

ducted on the VOT2013 data set (Kristan et al., 2013).

VOT2013 is a visual object tracking challenge held

in 2013 in order to benchmark on-line tracking al-

gorithms. The data set contains 23 videos and 8416

frames. We did not use the provided VOT2013 eval-

uation framework because the capability of our pro-

posed fusion method to return to the object after los-

ing it is not taken into account in the VOT2013 frame-

work but on contrary is penalised as the trackers are

stopped and reinitialised from the point of loss. Fur-

ther, the aim here is not necessarily to show that the

proposed method outperforms existing state-of-the-

art tracking algorithms. We rather want to evaluate

and show the beneﬁt of our tracker selection approach

compared to classical tracker fusion methods.

5.1 Compared Methods

To be able to correctly evaluate the performance of

the proposed approach and to comprehend the impact

of the different components, we compared it to a cer-

tain number of baseline methods. In the following, we

describe the different methods that have been tested.

• Proposed Method (PM): As described above,

the proposed approach uses L = 3 trackers de-

noted HAAR, HOG and HOC with 50 selectors

each. The spatio-temporal coherence and nor-

malised conﬁdence is used for tracker selection.

All the trackers are updated every frame.

• Proposed Method with Selective Update

(PM+). This method is the same as PM, but

with the selective update strategy described in

Section 4.3.

• Best Conﬁdence (BC). Using the same L = 3

trackers as the proposed method, the selection of

the best tracker is only based on the normalised

conﬁdence values.

• Fusion of Features (FoF). The OAB method al-

lows different sets of low-level features to be used

together in the global pool of weak classiﬁers.

In the FoF baseline, one strong classiﬁer is con-

structed based on all the different feature types.

This type of feature fusion is a classical way of

combining different types of low-level observa-

tions. In order to provide for a fair comparison,

we used the same number of features as in the pro-

posed approach, i.e. 3× 50 selectors.

• Fusion of Features with Minimal Update

(FoF+). To be able to correctly compare to the

proposed method with selective update, we de-

creased the number of updates of FoF by L − 1.

To this end, the tracker is only updated one frame

out of L.

• Centroid of Trackers (CoT). This is a high-level

fusion approach, where the resulting bounding

box is computed from the mean of the bounding

boxes of the individual trackers weighted by their

respective conﬁdence values.

5.2 Evaluation Measures

To evaluate the methods mentioned above, we used

the common F-score which combines the measures of

precision and recall. Based on the areas of the result-

ing bounding box B

, the ground-truth box B

and

their intersection, the F-score f

at frame t is deﬁned

as:

= 2×

precision

× recall

precision

+ recall

= 2×

, B

)

+ B

(6)

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

588

Table 1: Success rates for the different methods on the

VOT2013 data set.

method success rate

Best theoretical selection (93.43%)

Best Conﬁdence (BC) 78.32%

Fusion of Features (FoF) 80.13%

Centroid of Trackers (CoT) 61.21%

Proposed Method (PM) 81.10%

Best theor. sel. + selective update (93.86%)

Fusion of Features + min. update (FoF+) 77.45%

Proposed Method + selec. update (PM+) 83.93%

Table 2: Prediction rates for Best Conﬁdence and the pro-

posed method on the VOT2013 data set.

method prediction rate

Best Conﬁdence (BC) 76.22%

Proposed Method (PM) 77.99%

with precision

)

and recall

)

For each video, the trackers are initialised with the

ground truth in the ﬁrst frame and run until the end of

the video. At each frame, the different methods return

the F-score associated with their result. The tracking

method is considered lost when f

= 0.

Using the F-score, two performance measures are

introduced: success rate and prediction rate. The

success rate represents the number of frames hav-

ing f

> 0.1. The relatively low threshold captures

the overall tracking robustness of the method, not so

much its bounding box accuracy. The prediction rate

on the other hand represents the proportion of frames

where the method correctly predicted the best tracker.

In fact, knowing the ground truth position of the ob-

ject, the F-score of each individual tracker is com-

puted, then the best theoretical tracker (and F-score)

at each frame in known. Based on this, we can com-

pute the number of frames where a selection method

has correctly predicted the best individual tracker. We

allow for a 20% margin on the best F-score, i.e. the

two best trackers are considered equivalent if the dif-

ference of their F-score is lower than 20%.

To provide an upper bound, we also computed the

success rate of a best theoretical selection, i.e. the suc-

cess rate of a method that always predicts the tracker

with the highest F-score. This measure is important

as it expresses the general performance of the indi-

vidual trackers. All presented numbers are averaged

measures over 10 tracker runs.

5.3 Results

As shown in Table 1 and Table 2, the success and

prediction rate of our proposed method (PM) outper-

forms Best Conﬁdence (BC). In Fig. 2 and Fig. 3, we

also introduce some results from the video ’bolt’. It is

a particularly challenging sequence due to the number

of similar objects and colours, complex background,

and change in appearance. We can see in Fig. 3 (1)

that the use of our spatio-temporal coherence criterion

eliminates the jumps that are introduced when only

using the normalised conﬁdence BC. Although sur-

passing BC, PM cannot avoid all the jumps. However,

thanks to the simultaneous use of the spatio-temporal

coherence and the conﬁdence, the proposed method

has the ability to relock on the object without any re-

initialisation of the trackers. This can be illustrated in

Fig. 2 (a), where we can observe that the tracker HOC

(green) is the only one able to correctly track the ob-

ject. BC (b) has a very discontinuous tracking while

PM (d) succeeded to lock on the HOC tracker.

The proposed method also outperforms the Fusion

of Features (FoF) approach. In this approach, the fea-

tures are not independent, since they are all updated

with the same data. As a consequence, as soon as

a tracker begins to drift, wrong data or noise is in-

troduced, and it is impossible for the trackers to re-

cover as illustrated in Fig. 4 and Fig. 2 (c) : The FoF

method started drifting in frame 5, and completely

loses the object afterwards. Unlike this approach,

PM (d) uses independent trackers. Thus, the selection

scheme makes it possible to switch from one tracker

to another if it is lost. PM succeeds in tracking the ob-

ject at almost the performance of the best theoretical

results (Fig. 3).

As for the Centroid of Trackers, the low success

rate (Table 1) shows that it is not a valuable method

as the individual trackers do not always perform well

simultaneously, and a lost tracker may considerably

disturb the overall result.

Finally, the bottom of Table 1 shows the success

rates for the proposed method with selective update

(PM+) and the corresponding baseline FoF+. The se-

lective update slightly increases the performance and

produces the best results among the compared meth-

ods. In Fig. 3 (2), we can see that PM+ successfully

tracks the object whereas the FoF+ only tracks cor-

rectly for a certain time and then starts drifting and

loses the object. The selective update also improves

the speed by around 50%. The proposed method runs

at around 6 fps (with non-optimised and single-core

C++ code), where the majority of computation time

is spent on feature computation and update of the

Adaboost trackers. The combination method using

temporal and spatial coherence with conﬁdence (PM)

adds very little computational overhead.

CoherentSelectionofIndependentTrackersforReal-timeObjectTracking

589

Figure 2: Comparison of tracking results on the “bolt” video for (a) the individual trackers (white: ground-truth, pink: Haar

tracker, blue: HOG tracker, green: HOC Tracker), (b) Best Conﬁdence (BC), (c) Fusion of Features (FoF), and (d) the

proposed method (PM).

" &" " " "&" # " #&" $ "

!"#$%&

%'(&

+,-./0123

46.7/7869-:.6;:

" &" " " "&" # " #&" $ "

!"#$%&

%'(&

+,-./01234

5653

Figure 3: F-score measures of the “bolt” video, (1) proposed method (PM), Best conﬁdence (BC), best theoretical results and

(2) proposed method with selective update (PM+), Fusion of Features with minimal update (FoF+).

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

590

& '& & & &'& " & "'& ( & ('& # & #'&

!"#$%&

%'(&

)*+,-./01

232

Figure 4: Comparison of results on the “dh” video for (a)

Fusion of Features (FoF), (b) the proposed method (PM)

where (pink: Haar, blue: HOG, green: HOC Tracker).

6 CONCLUSION

In this paper, we presented a simple method to select

from a pool of trackers the most suitable one. It in-

tegrates a spatial and temporal coherence criterion, a

consistent conﬁdence evaluation for tracker selection

and a selective update strategy. We used OAB-based

trackers with simple low-level features. Experimen-

tal results demonstrate that, even in very challenging

sequences, the proposed method improves the overall

robustness and outperforms classical tracker combi-

nation strategies.

Future work will concentrate on introducing new

and better performing trackers also introducing more

low-level features (e.g. motion) to complete the set of

trackers and achieve a best theoretical selection rate

of 100% (c.f. Table 1).

REFERENCES

Avidan, S. (2007). Ensemble tracking. IEEE Trans. on

PAMI, 29(2):261–271.

Badrinarayanan, V., Perez, P., Le Clerc, F., and Oisel, L.

(2007). Probabilistic color and adaptive multi-feature

tracking with dynamically switched priority between

cues. In Proc. of ICCV.

Bailer, C., Pagani, A., and Stricker, D. (2014). A superior

tracking approach: Building a strong tracker through

fusion. In Proc. of ECCV, pages 170–185.

Collins, R. T. and Liu, Y. (2005). On-line selection of dis-

criminative tracking features. IEEE Trans. on PAMI,

27(10):1631–1643.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In Proc. of CVPR.

Duffner, S., Odobez, J.-M., and Ricci, E. (2009). Dynamic

partitioned sampling for tracking with discriminative

features. In Proc. of BMVC, London, UK.

Freund, Y. and Schapire, R. (1997). A decision-theoretic

generalization of on-line learning and an application

to boosting. Journal of Computer and System Sci-

ences, 55(1):119–139.

Grabner, H. and Bischof, H. (2006). On-line boosting and

vision. In Proc. of CVPR, pages 260–267.

Grabner, H., Grabner, M., and Bischof, H. (2006). Real-

time tracking via on-line boosting. In Proc. of BMVC,

pages 47–56.

Hua, C., Wu, H., Chen, Q., Wada, T., and City, W. (2006).

A pixel-wise object tracking algorithm with target and

background sample. In Proc. of ICPR.

Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. (1998).

On combining classiﬁers. IEEE Trans. on PAMI,

20(3):226–239.

Kristan, M., Cehovin, L., Pﬂugfelder, R., Nebehay, G., Fer-

nandez, G., Matas, J., and et al. (2013). The Visual

Object Tracking VOT2013 challenge results. In Proc.

of ICCV (Workshops).

Kwon, J. and Lee, K. (2010). Visual tracking decomposi-

tion. In Proc. of CVPR, pages 1269–1276.

Kwon, J. and Lee, K. (2011). Tracking by sampling track-

ers. In Proc. of ICCV.

Leichter, I., Lindenbaum, M., and Rivlin, E. (2006). A gen-

eral framework for combining visual trackers – “black

boxes” approach. IJCV, 67(3):343–363.

Maggio, E., Smeraldi, F., and Cavallaro, A. (2007). Adap-

tive multifeature tracking in a particle ﬁltering frame-

work. IEEE on Circuits and Systems for Video Tech-

nology, 17(10):1348–1359.

Moreno-Noguer, F., Sanfeliu, A., and Dimitris, S. (2008).

Dependent multiple cue integration for robust track-

ing. IEEE Trans. on PAMI, 30(4):670–685.

Nickel, K. and Stiefelhagen, R. (2008). Dynamic integra-

tion of generalized cues for person tracking. In Proc.

of ECCV, pages 514–526.

Perez, P., Vermaak, J., and Blake, A. (2004). Data fu-

sion for visual tracking with particles. Proc. of IEEE,

93(3):495–513.

Smeulder, W. M. A., Dung, M. C., Cucchiara, R., Calderara,

S., Deghghan, A., and Shah, M. (2014). Visual track-

ing: an experimental survey. IEEE Trans. on PAMI.

Stalder, S., Grabner, H., and Gool, L. V. (2009). Beyond

semi-supervised tracking: Tracking should be as sim-

ple as detection, but not simpler than recognition. In

ICCV (WS on On-line Comp. Vis.)), pages 1409–1416.

Stenger, B., Woodley, T., and Cipolla, R. (2009). Learning

to track with multiple observers. In Proc. of CVPR.

Triesch, J. and v. d. Malsburg, C. (2001). Democratic in-

tegration: Self-organized integration of adaptive cues.

Neural Computation, 13(9):2049–2074.

CoherentSelectionofIndependentTrackersforReal-timeObjectTracking

591

Viola, P. and Jones, M. (2001). Rapid object detection us-

ing a boosted cascade of simple features. In Proc. of

CVPR, volume 1.

Yilmaz, A., Li, X., and Shah, M. (2004). Object contour

tracking using level sets. In Proc. of ACCV.

Yin, Z., Porikli, F., and Collins, R. T. (2008). Likelihood

map fusion for visual object tracking. In IEEE Work-

shop on Applications of Computer Vision, pages 1–7.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

592