AUTOMATIC PROCESS TO BUILD A CONTEXTUALIZED

DETECTOR

Thierry Chesnais

, Nicolas Allezard

, Yoann Dhome

and Thierry Chateau

CEA, LIST, Vision and Content Engineering Laboratory, Point Courrier 94, F-91191 Gif-sur-Yvette, France

Lasmea, UMR6602, CNRS, Blaise Pascal University, Clermont-Ferrand, France

Keywords:

Video Surveillance, Object Detection, Pedestrian Detection, Semi-supervised Learning, Oracle.

Abstract:

This article tackles the real-time pedestrian detection problem using a stationary uncalibrated camera. More

precisely we try to specialize a classiﬁer by taking into account the context of the scene. To achieve this goal,

we introduce an ofﬂine semi-supervised approach which uses an oracle. This latter must automatically label a

video, in order to obtain contextualized training data. The proposed oracle is composed of several detectors.

Each of them is trained on a different signal: appearance, background subtraction and optical ﬂow signals.

Then we merge their responses and keep the more conﬁdent detections. A specialized detector is then built on

the resulting dataset. Designed for improving camera network installation procedure, the presented method is

completely automatic and does not need any knowledge about the scene.

1 INTRODUCTION

In computer vision, the problem of real-time (at least

10 frames by second) and robust object detection, in

particular pedestrian detection, is still a hot research

topic. These algorithms are useful in video surveil-

lance context or in Advanced Driver Assistance Sys-

tems. Some of the last advances made in this ﬁeld

have been publied in (Enzweiler and Gavrila, 2009)

(Ger

onimo et al., 2010) (Doll

ar et al., 2011). The

variability of appearance in the pedestrian class is im-

portant (size, posture, lighting). So it is important to

consider the context of the scene to build a specialized

detector.

Classical approaches to detect objects are based

on machine learning. Support vector machine (Vap-

nik, 1995)(Dalal and Triggs, 2005) and boosting al-

gorithms are the mainly used methods. These pro-

cesses consist to extract the best discriminative fea-

tures between pedestrian and background from a la-

beled training dataset. Then the obtained detector

compares the selected features of a new image with

these of the database to predict the presence of a

pedestrian.

However to reach the best detection performances

with these methods, the training data must be as rich

as possible and own features similar to those com-

puted during the detection step. Consequently it is

essential for the training dataset to be contextualized.

Figure 1: A detector trained on a generic dataset (left) reach

lower performances than a contextualized classiﬁer (right)

when the point of view of the learning and the detection

bases are too different.

Building such a base with thousands examples well

annotated and aligned is an expensive manual task.

That is why it is not realistic to collect examples for

each camera during the deployment of a CCTV equip-

ment. In this case it is often necessary to use a clas-

siﬁer train on a generic training dataset, hoping this

process will not degrade performances too much (see

the ﬁgure 1).

During the last years several approaches have

513

Chesnais T., Allezard N., Dhome Y. and Chateau T..

AUTOMATIC PROCESS TO BUILD A CONTEXTUALIZED DETECTOR.

DOI: 10.5220/0003822105130520

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 513-520

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

been proposed to tackle the problem of automati-

cally building a training dataset in order to exploit

large amount of images recorded by cameras. Semi-

supervised methods are often part of the proposed so-

lutions in bibliography, because they are designed to

use, directly in the training set, labeled but especially

unlabeled data.

The training dataset is called contextualized when

it contains a lot of speciﬁcs information coming from

the scene. The data could be integrated in the special-

ized classiﬁer in several ways:

• collecting a large database to train a one shot clas-

siﬁer is the principle of ofﬂine methods;

• training the classiﬁer as soon as new samples

are available, is the principle of online methods.

These latter have been generalized in computer vi-

sion by (Grabner and Bischof, 2006).

Our goal is to propose a new semi-supervised

method. Using an oracle will permit to automati-

cally build a classiﬁer which will be adapted to the

particular context of the scene. We choose to train

our detector with an ofﬂine method for two reasons.

Firstly our procedure occurs at the time of a camera

network installation. Although we have all necessary

time to obtain and treat a lot of examples, we prefer to

avoid training an online classiﬁer during exploitation

and keeps all computer resources for detections. Sec-

ondly even if there are some online strong methods

(Leistner et al., 2009), there is still a risk of drifting

that seems not compatible with a long-term use.

In this study, we focus on how to build an oracle.

After having detailed the most used semi-supervised

methods, we describe, in the third part our strategy to

create the oracle. The part 4 presents an evaluation of

the proposed process consisting in an analysis of the

behaviour of the oracle and a comparison with a state

of art classiﬁer.

2 STATE OF ART

There are a lot of families of semi-supervised meth-

ods. The most common approaches are the self-

learning ones, the co-training ones and the methods

based on an oracle.

The self-learning (Rosenberg et al., 2005) ap-

proach consists in using the output of a classiﬁer to

annotate a new example. If a classiﬁer is very conﬁ-

dent about a sample, this latter is added to the base.

This method lacks of robustness suffering from a drift

problem. Mislabeled examples will indeed disrupt the

classiﬁer, change its behaviour for the next samples

and in consequence make the phenomenon worse.

Moreover if the conﬁdent threshold used to separate

classes is too low, a lot of false positives will be incor-

porated in the base. On the contrary if the threshold is

too high, only perfectly identiﬁed samples, the ones

containing little information, are kept.

The co-training introduced by (Blum and

Mitchell, 1998) is a formalism in which two classi-

ﬁers are trained in parallel. Each of them uses a dif-

ferent and independent part of the data. For exam-

ple (Levin et al., 2003) train two classiﬁers, one on

appearance signal and the other one on background

subtraction signal. The co-training algorithm uses the

fact that an example must have the same label with

both classiﬁers even if they are not trained on the same

data. If one of the detectors labels with conﬁdence a

sample, the other one being unsure, the sample is in-

corporated in the base of the second classiﬁer. During

the training phase, each classiﬁer improves its per-

formance thanks to the conﬁdence of the other one.

Endly we obtain two well trained detectors. Even if

detectors are independent, the problem here is, like

with the self-learning, the outputs of the classiﬁers are

still directly used to label samples. Drift problem are

not completely excluded because parts of the data are

seldom independent.

Methods based on an oracle use an external en-

tity to build a dataset. This entity annotates all ex-

amples before adding them in the training data. Fi-

nal detector does not affect the outputs of the oracle

reducing the drift problem. The capacity of an ora-

cle to ﬁnd good samples without error determines the

performance of the ﬁnal classiﬁer. If the oracle does

not work well on a video the whole system is useless.

A lot of different classiﬁers have already been pro-

posed. (Wu, 2008) uses a part based classiﬁer applied

on appearance signal. If the oracle ﬁnd some pedes-

trian parts, the sample is added in the training data.

One drawback of the method is the fact that the or-

acle is composed of only one classiﬁer dealing with

only one signal. Another problem is the difﬁculty of

detecting pedestrian parts and merging them. To add

robustness, (Stalder et al., 2009) uses an oracle with

several stages. First step consists in detecting people

in the picture. In a second part trackers are initialized

on this detection. The author’s goal is to obtain some

spatio-temporal continuity between oracle detections

to incorporate samples which have not been detected.

Contrary to Wu’s approach, this allows to ﬁnd some

hard examples. A last stage uses 3D information. The

main drawback of this scheme is its structure. If a

stage failed, errors are inevitably passed to the next

one without any possibility to correct them.

We propose an oracle working in a no-sequential

way in order to improve robustness.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

514

Figure 2: Oracle diagram. The oracle is formed by three independent classiﬁers based on appearance, background subtraction

and optical ﬂow signals. Each one gives a set of detections which will be merged (see ﬁgure 3) to build a contextualized

training dataset. The resulting output is used to create a contextualized powerful detector.

3 CONTEXTUALIZATION OF A

DETECTOR

In this part we describe the different steps of our

method to build an oracle. In the same way than

co-training approach, this latter is formed by several

classiﬁers working on different and independent sig-

nals. Unlike Stalder’s approach (Stalder et al., 2009)

which uses a sequential oracle, our method has the

ability, after a merging phase, to suppress bad detec-

tions given by each of the signals. This capacity im-

proves the training set.

3.1 Oracle

3.1.1 Speciﬁcations

The oracle must automatically annotate a video which

means ﬁnding relevant observations and the associ-

ated labels.

The oracle is a pedestrian detector with charac-

teristics different from the contextualized detector.

In addition to be real-time, this latter must detect

as many pedestrians as possible with minimal false

positive rate. In other words, it must have both a

high recall:

number of good detections

number of pedestrians

and a high precision:

number of good detections

number of detections

. For the oracle, it is possible to

release some constraints. It does not need to be real-

time since our method counts two steps. The ﬁrst one

could last for a long time. Moreover our purpose is

to build a training dataset. It is not penalizing to miss

some pedestrians since the video is long enough to of-

fer a lot of positive examples. To sum up, the oracle

could have a lower recall than the ﬁnal detector, but in

order to minimize the label noise in the contextualized

base, it must be as precise as possible.

3.1.2 Constitution

To satisfy these speciﬁcations we decide to use a com-

bination of elementary classiﬁers (ﬁgure 2). Each of

them is trained on a different and independent sig-

nal like in the co-training method. Therefore a merg-

ing phase (ﬁgure 3) is able to correct some errors, by

cross-validating responses given by each classiﬁer.

In this article we have exploited three signals: ap-

pearance (descriptor based on gradient), background

subtraction (Stauffer and Grimson, 1999) and opti-

cal ﬂow (Black, 1996). Each classiﬁer is based on

a different descriptor implying that it uses a different

generic training data. The three classiﬁers are running

in parallel.

3.1.3 Building a Contextualized Training Data

The three previously trained classiﬁers build a base

by scanning context images. For every position and

scale in an image, each classiﬁer gives a detection

score. The next step of the process implies to merge

the conﬁdence maps. Unfortunately the classiﬁers are

a priori independent and their outputs are not com-

parable. There are two solutions: working directly

with the conﬁdence maps after normalizing them to

AUTOMATIC PROCESS TO BUILD A CONTEXTUALIZED DETECTOR

515

be sure they are comparable; or working on the detec-

tions given by each classiﬁer after a clustering. We

choose the last option. In a similar manner as (Dalal

and Triggs, 2005), we use a meanshift to group all

the boxes which have a positive score. Each resulting

fused detection has a score which corresponds to the

sum of the group boxes’ scores. For an image we get

a set of detections (box and score) for each classiﬁer.

The positive examples correspond to observa-

tions with which the generic classiﬁers are conﬁdent.

The merging phase between the detectors is a delicate

step. If this process is too restrictive, some hard and

thus interesting examples can be missed. On the con-

trary if this step is too weak, the training data would

be polluted by lots of false positive samples.

A detection is incorporated in the base only if it

appears in the output of several classiﬁers. A ma-

jority vote is done as explained in ﬁgure 3. First a

greedy association is performed between the detec-

tions coming from the appearance detector and these

coming from the background subtraction one. Only

the associated detections coming from appearance are

added in the contextualized base. The association is

performed using an overlap criterion between detec-

tions. Like (Everingham et al., 2009) we use the cri-

terion: sim

detection

, B

) =

Area(B

∩B

)

Area(B

∪B

)

. If two boxes

have a similarity under 0.5, they are considered not-

linked. A second association is realized between the

remaining detections coming from appearance and

these coming from the optical ﬂow detector. As previ-

ously, the associated detections coming from the ap-

pearance are integrated to the base whereas the not

associated boxes are thrown. In fact this vote corre-

sponds to a check on detections from appearance sig-

nal with these from background subtraction and opti-

cal ﬂow used as validation. These last two detectors

are characterized by their lack of precision whereas

the one based on appearance signal gives a more ac-

curate response. That is why the merge step favours

the detections coming from appearance and validate

them with the two other classiﬁers.

A last ﬁlter on the boxes score is done in the train-

ing dataset. As all detections come from appearance

signal, scores are comparable. During this ﬁnal step

about 50% of the less conﬁdent examples are sup-

pressed.

After selecting the positive examples, we need to

compute the negative samples. Our strategy consists

in choosing random boxes in the whole image except

in areas where there is at least a positive detection.

As the oracle has a low recall, it does not detect all

pedestrians. Some can be incorporated in the negative

base. It is rather unlikely because there are a lot more

negative examples in an image than positive ones.

Figure 3: Workﬂow, illustrating the three classiﬁers merge

process, designed to provide positive samples.

As we are working on a static scene, a lot of ob-

servations are similar. The risk here is to create a base

not rich enough with to few hard examples. To deal

with this problem, we add in the base some examples

which contains pedestrian parts. They correspond to

an image example intersecting with a oracle detection.

However both samples must not overlap too much and

verifying the previously deﬁned criterion:

sim

detection

pedestrian

, B

negative example

) < 0.5

3.2 Building a Contextualized Detector

To create a contextualized detector, we need to train a

new classiﬁer with the contextualized training data.

A possibility is to do an ofﬂine training using a

dataset containing the three signals and let the boost-

ing algorithm choose a good combination between

them. With this approach, a maximum of information

from the training set is exploited. In our experiments

we remark than the ﬁnal detector obtained with this

method is not signiﬁcantly performing better than a

detector only trained on an appearance signal, how-

ever the computing time increases a lot from one to

the other solution. In consequence we choose a sim-

pler classiﬁer based only on the appearance signal as

ﬁnal detector.

4 EVALUATIONS

In this section we do a method assessment split in two

parts. Firstly we study characteristics of the oracle

presented in section 3. Secondly we compare perfor-

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

516

mances of the ﬁnal detector with a state-of-art one and

show its competitiveness.

The algorithm has been tested on the freely avail-

able datasets: PETS 2006

, PETS 2007

We evaluate our system with the method de-

scribed in (Agarwal et al., 2004) and illustrated with

precision-recall curves. Precision is Pr =

T P

T P+FP

and

recall is R =

T P

where TP is the number of true posi-

tives, FP the number of false positives and P the num-

ber of pedestrian. We plot the curves R depending on

(1 − Pr). The optimal point is located in (0, 1). F-

Measure is deﬁned as FM = 2.

Pr.R

Pr+R

The similarity criterion used between the ground

truth (GT) and a test box (B) before the clustering is:

sim(GT, B) =

(GT

− B

)

(0.5 ×w(GT ))

(GT

− B

)

(0.5 ×h(GT ))

with:

• cx and cy corresponding respectively to the ab-

scissa and to the ordinate of the centre of a box,

• w(GT ) and h(GT ), respectively the width and the

height of the ground truth box.

Two boxes are similar if sim(GT, B) ≤ 1. This

criterion could be seen as a deﬁnition of an ellipse

around the center of a ground truth box. If the center

of a detection fall in this ellipse it is considered as pos-

itive. If two detections are linked to the same ground

truth box, only one true positive sample is counted.

Others boxes correspond to false detections.

4.1 Characteristics of the Oracle

In this paragraph, we study the oracle characteristics

by checking if it corresponds to the speciﬁcations. As

previously explain we train three generic classiﬁers on

appearance, background subtraction and optical ﬂow

signals. Each classiﬁer is trained with 400 rounds

of boosting (Real-AdaBoost using decision stumps

(Friedman et al., 1998) (Schapire and Singer, 1999))

without cascade. With the same detection threshold,

not using a cascade increases the recall of the detec-

tor (more detections) but decreases its precision (more

false positives). This latter is optimized by the classi-

ﬁers merging phase.

The appearance classiﬁer uses a descriptor based

on gradient. We use the same descriptor for the opti-

cal ﬂow. Horizontal and vertical components of opti-

cal ﬂow correspond to horizontal and vertical compo-

nents of the gradient of appearance. However for the

background subtraction descriptor we decide to use

http://www.cvg.rdg.ac.uk/PETS2006/

http://www.cvg.rdg.ac.uk/PETS2007/

Haar wavelets. Unfortunately with these features, it

is impossible to know if a homogeneous area is a part

of an object or just background. To solve this prob-

lem we add in our descriptor the mean of the current

wavelet window.

We collect, from the INRIA person dataset

, 2417

positive and 25742 negative examples to train our

appearance detector. This dataset have no temporal

information. Consequently we build two new inde-

pendent datasets for the others classiﬁers. We train

the background detector with 787 positive and 6466

negative samples and the optical ﬂow detector with

776 positive and 8000 negative examples. Detection

threshold are set to 0 for all classiﬁers.

4.1.1 PETS 2006

We evaluate our method on the view 4 of the PETS

2006 dataset. The examples of the training data are

coming from S2-T3-C and we test the ﬁnal detector

on about 1000 frames from S7-T6-B.

(a) Positive examples. (b) Negative examples pos-

sibly containing pedestrian

parts.

Figure 4: Contextualized training data extracted from the

PETS 2006 dataset.

The ﬁgure 4 shows some samples of the training

data after the merge of the classiﬁers. For positive ex-

amples a large majority of thumbnails are effectively

a pedestrian. However there are still two main issues:

• When several pedestrians are close, the clustering

could not always separate them correctly and that

tends to misaligned the resulting example,

• The size of the thumbnail is not always adapted to

the object.

As we hoped, almost all the negative examples

correspond to areas without pedestrian or include lim-

ited pedestrian parts.

http://pascal.inrialpes.fr/data/human/

AUTOMATIC PROCESS TO BUILD A CONTEXTUALIZED DETECTOR

517

On the training video (S2-T3-C - 4), the oracle

obtain a recall of 0.16 and a precision of 0.99. No-

tice that because of the ﬁltering step after the merg-

ing phase, the recall value depends on the number of

frames we use to build the training dataset. As ex-

pected the oracle has a very high precision. The re-

sult are obtained without any knowledge of the scene

(like the ground plane or a 3D model of the scene)

and without any threshold since all detectors have the

same detection threshold (the threshold is ﬁxed to 0).

Figure 5: Precision-Recall curves illustrating detection per-

formances on PETS 2006 for the three classiﬁers forming

the oracle and the contextualized detector.

We can see on the ﬁgure 5 the precision-recall

curves of each detector involved in the oracle and

the curve of the contextualized detector. This lat-

ter only uses appearance information and is trained

in the same way than the oracle appearance detector.

The training data is the only difference between them.

1800 positive and 8000 negative examples are kept af-

ter the ﬁlter step in the classiﬁer merging process and

are used for the training.

Table 1: Results of the different classiﬁers applied on PETS

2006 - S7-T6-B - 4.

Recall Precision F-Measure

Appearance 0.71 0.66 0.69

Background 0.47 0.38 0.42

Optical ﬂow 0.30 0.26 0.28

Oracle 0.49 0.99 0.65

Contextualized

detector 0.85 0.90 0.87

The table 1 gives recall and precision values for

each classiﬁer where its f-measure is maximized. The

precision of the background subtraction and optical

ﬂow classiﬁers are weak. This can be explained by

the fact that they are not very discriminant. Their

detections are spread around a target and often, two

close pedestrians are confused after clustering. As

we explain, when we have presented the merging

step, background subtraction and optical ﬂow detec-

tors could be seen as presence captors, reliable to pre-

dict pedestrian presence but inaccurate in location,

whereas the appearance detector is less robust but its

detections are well localized.

4.1.2 PETS 2007

We test our algorithm on the third view of PETS 2007.

It is to notice that the pedestrians are shot with a high

angle and are generally leaned in this video. This

point of view is interesting to illustrate the interest of

our approach because of their differences with the IN-

RIA dataset (pedestrians are taken from the front and

are straight), used to train our initial classiﬁer based

on appearance.

The contextualized training data has been built on

the sequence called S03. Each detector is evaluated

on the 1000 ﬁrst images of the ﬁfth video.

(a) Positive examples. (b) Negative examples pos-

sibly containing pedestrian

parts.

Figure 6: Contextualized training data extracted from the

PETS 2007 dataset.

The ﬁgure 6 shows samples from the training data

collected after the fusion of classiﬁers forming the or-

acle.

The ﬁgure 7 presents the recall-precision curves

for each classiﬁer of the oracle and the contextualized

detector.

In the same way than PETS 2006, the table 2 con-

tains precision and recall values for each classiﬁer

where its F-measure is maximized. In the oracle case,

results are only given on positive examples.

Contrary to previous video, classiﬁers based on

background subtraction and optical ﬂow signals have

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

518

Figure 7: Precision-recall curves on PETS 2007 for three

classiﬁers in the oracle and the ﬁnal detector.

Table 2: Results of the different classiﬁers applied on PETS

2007 - S05 - 3.

Recall Precision F-Measure

Appearance 0.82 0.34 0.48

Background 0.47 0.82 0.60

Optical Flow 0.54 0.72 0.62

Oracle 0.40 0.99 0.57

Contextualized

detector 0.90 0.98 0.94

better performances than the one on appearance sig-

nal. It could be easily explained:

• Examples collected on this sequence and these

coming from the training data have very different

appearances. Consequently a classiﬁer using only

this signal has poor performances.

• There are several groups of people in this video.

Classiﬁers which are not very discriminant are not

too penalized because even if a detection is far

from a pedestrian, an other one could be caught

in the detection window.

As curves prove it on these two datasets, a contex-

tualized detector reaches better recall and precision

than the generic oracle only based on appearance.

4.2 Performances of the Contextualized

Detector

In this part we compare performances of the contextu-

alized detector with a state of art one. We choose the

Dalal and Triggs (Dalal and Triggs, 2005) detector

available in OpenCV. We apply the same evaluation

criterion.

Curves on the ﬁgure 8 show results on the two se-

quences used in this paper. On the PETS 2006 video

both classiﬁers have similar results. Dalal and Triggs

(a) PETS 2006.

(b) PETS 2007.

Figure 8: Recall-precision curves of our ﬁnal detector

(green) and the one of Dalal and Triggs (red) on PETS 2006

and 2007.

detector already reaches a high level of performances.

Therefore although our approach improves the de-

tection rate, both classiﬁers could be use on this se-

quence. On the contrary when the video and the learn-

ing dataset have a very different point of view, Dalal

and Triggs detector is not very successful in detecting

most of pedestrians. In this case, our method, using an

oracle formed by basic classiﬁers, can achieve good

performances because the detector is contextualized.

When the training data and the scene are too dif-

ferent, a contextualized detector improves results sig-

niﬁcantly.

5 CONCLUSIONS

We proposed a semi-supervised method. It is aimed

at automatically training a contextualized detector. To

achieve this goal we create an oracle composed of

several classiﬁers. Each of them works on a distinct

signal. A merging step of the different responses is

then done to build a specialized training database.

This set is then used to train a ﬁnal detector incor-

AUTOMATIC PROCESS TO BUILD A CONTEXTUALIZED DETECTOR

519

porating contextualized information.

Even if our approach gives some good results, sev-

eral improvements are possible.

• As previously notice, classiﬁers based on back-

ground subtraction and optical ﬂow are not very

precise. They are less discriminant in fact and

they tend to merge proximate detections. To mit-

igate this phenomenon, it is possible to use cali-

brated cameras in order to remove aberrant detec-

tions. Unfortunately this requires a manual step

during the camera network installation. That is

why we do not use it.

• In this study, we choose to build an oracle with

three signals. However it is possible to use other

ones. For example if we have a stereo camera,

it is possible to learn a classiﬁer directly on the

disparity maps and add it in the oracle.

• We choose to use an ofﬂine algorithm to train our

ﬁnal detector. However it could be interesting to

study the behavior of our system with an online

training. This has the advantage to allowed a regu-

lar update of the classiﬁer, in the hope to tackle the

problem of changes in the scene (lighting, back-

ground. . . ).

REFERENCES

Agarwal, S., Awan, A., and Roth, D. (2004). Learning to

detect objects in images via a sparse, part-based repre-

sentation. Pattern Analysis and Machine Intelligence.

Black, M. (1996). The Robust Estimation of Multiple Mo-

tions: Parametric and Piecewise-Smooth Flow Fields.

Computer Vision and Image Understanding.

Blum, A. and Mitchell, T. (1998). Combining labeled and

unlabeled data with co-training. In Proceedings of the

eleventh annual conference on Computational learn-

ing theory.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Int. Conf. on Computer

Vision and Pattern Recognition.

Doll

ar, P., Wojek, C., Schiele, B., and Perona, P. (2011).

Pedestrian detection: An evaluation of the state of the

art. Pattern Analysis and Machine Intelligence.

Enzweiler, M. and Gavrila, D. M. (2009). Monocular

pedestrian detection: Survey and experiments. Pat-

tern Analysis and Machine Intelligence.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn,

J., and Zisserman, A. (2009). The PASCAL Visual

Object Classes Challenge 2009 (VOC2009) Results.

Friedman, J., Hastie, T., and Tibshirani, R. (1998). Addi-

tive logistic regression: a statistical view of boosting.

Annals of Statistics.

Ger

onimo, D., L

opez, A. M., Sappa, A. D., and Graf, T.

(2010). Survey of pedestrian detection for advanced

driver assistance systems. Pattern Analysis and Ma-

chine Intelligence.

Grabner, H. and Bischof, H. (2006). On-line boosting and

vision. In Int. Conf. on Computer Vision and Pattern

Recognition.

Leistner, C., Saffari, A., Roth, P. M., and H., B. (2009). On

robustness of on-line boosting - a competitive study.

In Int. Conf. on Computer Vision - Workshop on On-

line Learning for Computer Vision.

Levin, A., Viola, P., and Freund, Y. (2003). Unsupervised

improvement of visual detectors using co-training. Int.

Conf. on Computer Vision.

Rosenberg, C., Hebert, M., and Schneiderman, H. (2005).

Semi-supervised self-training of object detection

models. IEEE Workshop on Applications of Computer

Vision.

Schapire, R. E. and Singer, Y. (1999). Improved boosting al-

gorithms using conﬁdence-rated predictions. Machine

Learning.

Stalder, S., Grabner, H., and Gool, L. V. (2009). Explor-

ing context to learn scene speciﬁc object detectors. In

IEEE International Workshop on Performance Evalu-

ation of Tracking and Surveillance.

Stauffer, C. and Grimson, W. E. L. (1999). Adaptive back-

ground mixture models for real-time tracking. In Int.

Conf. on Computer Vision and Pattern Recognition.

Vapnik, V. N. (1995). The nature of statistical learning the-

ory. Springer.

Wu, B. (2008). Part based object detec-

tion, segmentation, and tracking by boost-

ing simple feature based weak classiﬁers.

http://sites.google.com/site/bowuhomepage/curriculu

m-vitae.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

520