PRINCIPLED DETECTION-BY-CLASSIFICATION FROM

MULTIPLE VIEWS

J´erˆome Berclaz

∗

, Franc¸ois Fleuret

∗†

and Pascal Fua

∗

Computer Vision Laboratory, EPFL, Lausanne, Switzerland

†

IDIAP Research Institute, Martigny, Switzerland

Keywords:

People detection, Classiﬁcation, Bayesian framework.

Abstract:

Machine-learning based classiﬁcation techniques have been shown to be effective at detecting objects in com-

plex scenes. However, the ﬁnal results are often obtained from the alarms produced by the classiﬁers through a

post-processing which typically relies on ad hoc heuristics. Spatially close alarms are assumed to be triggered

by the same target and grouped together.

Here we replace those heuristics by a principled Bayesian approach, which uses knowledge about both the

classiﬁer response model and the scene geometry to combine multiple classiﬁcation answers. We demonstrate

its effectiveness for multi-view pedestrian detection.

We estimate the marginal probabilities of presence of people at any location in a scene, given the responses

of classiﬁers evaluated in each view. Our approach naturally takes into account both the occlusions and the

very low metric accuracy of the classiﬁers due to their invariance to translation and scale. Results show our

method produces one order of magnitude fewer false positives than a method that is representative of typical

state-of-the-art approaches. Moreover, the framework we propose is generic and could be applied to any

detection-by-classiﬁcation task.

1 INTRODUCTION

Detection in images is often treated as a repeated clas-

siﬁcation problem. Given a two-class classiﬁer which

predicts “target present” or “target not present” from

an input signal and a candidate pose (such as location

or scale), detection is achieved by applying it for any

possible pose and collecting the ones associated to

positiveresponses. Such schemes often yield multiple

responses for every single true positive and therefore

require post-processing to reﬁne the outcome.

This step is usually ad hoc and involves grouping

and averaging similar poses corresponding to positive

classiﬁcations. Such a procedure is standard for de-

tecting faces (Viola and Jones, 2001; Fleuret and Ge-

man, 2002), cars (Zhao and Nevatia, 2001) and pedes-

trians (Viola et al., 2003; Leibe et al., 2005). Some

people tracking approaches also introduce temporal

consistency to combine the classiﬁer responses in a

stochastic manner (Okuma et al., 2004).

In this paper, we propose a statistically consistent

Bayesian approach for processing answers from re-

peated classiﬁcation algorithms. As opposed to sim-

ple grouping-and-averaging or non-maximum sup-

pression schemes that are usually applied for this step,

our method takes into account knowledge about both

the classiﬁer response model and the scene geometry,

which yields a more accurate detection with less false

positives.

We demonstrate our approach on the problem of

multi-people detection using several widely spaced

cameras, as illustrated by Fig. 1. In this applica-

tion, a classiﬁer is repeatedly applied to every possi-

ble 3D pose in different camera views, which results

in one map of classiﬁer answers per camera view.

The several maps of classiﬁer answers are then post-

processed and combined by our algorithm to yield the

ﬁnal detection.

At the heart of our approach is a sophisticated ap-

plication of Bayes’ law. Using a model of the re-

sponses of a classiﬁer given the true occupancy, we

infer a posterior probability on the occupancy given

the classiﬁer responses. We will show that this lets us

combine the multiple and noisy classiﬁer responses in

separate camera views and infer accurate world coor-

dinates for our detections.

Our main contribution is thus a principled ap-

proach for processing detection-by-classiﬁcation re-

sults and generating a ﬁnal accurate detection out of

it. When applied to the problem of multi-people de-

tection using several cameras, our approach produces

one order of magnitude fewer false positives than a

375

Berclaz J., Fleuret F. and Fua P. (2008).

PRINCIPLED DETECTION-BY-CLASSIFICATION FROM MULTIPLE VIEWS.

In Proceedings of the Third International Conference on Computer Vision Theory and Applications, pages 375-382

DOI: 10.5220/0001081003750382

 SciTePress

Figure 1: Overview of the detection process. Video sequences are acquired by widely separated and calibrated cameras. The

ground plane of the tracked area is discretized into a ﬁnite number of locations, depicted by the black dots in the leftmost

column. (a) We ﬁrst extract from each image the rectangular sub-images that correspond to the average height of a person at

each of these locations. (b) We apply a classiﬁer trained to recognize pedestrians to each sub-image to estimate probabilities

of occupancy in the ground plane from each view independently. (c) We use the algorithm that is at the core of this paper to

combine the individual classiﬁcation score maps into a single detection score map. (d) We reproject into the original images

a person-sized rectangle located at local maxima of the probability estimate.

baseline method, that is representative of what is typ-

ically done by state-of-the-art methods. Moreover,

the framework we propose is generic and could be

used with any detection-by-classiﬁcation application,

whether single or multi view, for which a model of the

classiﬁer response is available.

2 RELATED WORK

We address a problem usually solved by simple ad

hoc solutions. Therefore, even though our frame-

work for processing detection-by-classiﬁcation re-

sults is generic, we compare it here to pedestrian de-

tection algorithms, which is the application we chose

to demonstrate our method in this paper. Some of the

multi-view pedestrian detection works we reference

below are close in spirit to our framework.

Until recently, most approaches to locating people

in video relied on recursive frame-to-frame pose es-

timation. While effective in some cases, these tech-

niques usually require manual initialization and re-

initialization if the tracking fails. As a result, there

is now increasing interest for techniques that can de-

tect people in individual frames.

A popular approach (Viola et al., 2003; Okuma

et al., 2004; Dalal and Triggs, 2005) is to use

classiﬁcation-based techniques to decide whether or

not image windows depict a person. Such global

approaches tend to be very occlusion sensitive and

bag-of-features approaches have proved more effec-

tive at detecting pedestrians monocularly in crowded

scenes (Leibe et al., 2005).

However, with the exceptions of (Khan and Shah,

Figure 2: Correspondence between camera views (left and

center pictures) and top view (right picture) is made through

rectangles computed with ground plane homographies. We

call I

(i) the rectangle on camera view c that has the average

shape and position of a pedestrian standing at location i of

the ground plane.

2006; Mittal and Davis, 2003), we are not aware of

many attempts at combining the output of detectors

across views to overcome the problems created by

occlusions in a principled way. In (Khan and Shah,

2006), the algorithm classiﬁes individual pixels as

background or part of a moving object and combines

these results across views by assuming independence

given the presence of a pedestrian at a certain ground

location. Hence, this scheme does not use a generic

pedestrian detector based on a high-levelmodel of sil-

houettes and textures. Neither does it explicitly model

the fact that a detection in one view is inﬂuenced

by the presence of distant pedestrians creating occlu-

sions, which, as we will see, can trigger many false

alarms. By contrast, the M

Tracker (Mittal and Davis,

2003) explicitly models the relation between mutliple

pedestrians and the image at the pixel level, thus nat-

urally taking occlusions into account. However, this

approach relies on temporal consistency, and since it

is based on a tight integration between the handling

of occlusions and a color-based appearance model, it

can not be generalized to use a generic pedestrian vs.

background classiﬁer.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

376

Table 1: Notation.

C number of cameras.

G number of locations in the ground plane (≃ 1000).

boolean random variable standing for the occupancy of

location k on the ground plane.

input image from camera c.

(i) rectangular human size sub-window cropped from

camera view c at ground location i.

(i, j) horizontal distance between the centers of I

(i) and

( j) on camera view c.

(i) neighborhood of i on camera c,

{ j 6= i, I

( j) ∩ I

(i) 6=

0}.

(i) sum of the responses of the binary decision trees at

ground location i in camera view c, thus an integer

value in {0,...,N

} where N

is the number of deci-

sion trees.

T vector of all the T

(i).

Q the product law with the same marginals as the real pos-

terior distribution P( ·|T). Q(X) =

∏

i=1

Q(X

expectation under X∼Q. E

(x) =

xQ(x)dx.

the marginal probability of Q, i.e. Q(X

= 1).

k.k area of a sub-image.

In contrast to the approaches described above,

our method relies on classiﬁers applied on separate

views independently. We explicitly integrate occlu-

sion effects between alarms and quantitative knowl-

edge about the classiﬁer insensitivity to pose change

into a sound Bayesian framework to combine the mul-

tiple classiﬁer answers and yield the ﬁnal detection.

3 ALGORITHM

We start by giving an overview of our algorithm, be-

fore going into more details in the following subsec-

tions. We use notations summarized in Table 1.

In our setup, an area of interest is ﬁlmed by C

widely separated and calibrated cameras. We dis-

cretize the ground plane into a regular grid of G lo-

cations separated by 25cm (Elfes, 1989), and com-

pute homographies that relate the ground plane to its

projections in the camera views. This way, we can de-

termine, for every camera view c and every location

i, the sub-image I

(i), which roughly corresponds to

the average size of a person that would be standing

at location i of the ground plane, as shown on Fig. 2.

Our algorithm involves two main steps:

1. For each camera c and ground plane location i,

the algorithm extracts sub-image I

(i). Classiﬁers

based on decision trees are then applied to every

sub-image I

(i), as shown on Fig. 3. These clas-

siﬁers have been trained at recognizing pedestri-

ans, and their answer on sub-image I

(i) can be

interpreted as a rough probability of occupancy

of ground plane location i, given the sub-image.

This ﬁrst step thus produces as many classiﬁca-

Figure 3: Generation of the classiﬁcation score maps. Im-

ages (a), (b) and (c) show sub-windows extracted from the

camera view at 3 random locations of the ground plane.

Classiﬁers are applied to sub-images I

(i) corresponding to

every ground plane location i. Images depicting background

(a) produce a low classiﬁcation score for the correspond-

ing location. Images showing badly centered pedestrian (b)

produce a slightly higher score and images featuring a well

centered pedestrian (c) receive high score.

tion score maps (see third column of Fig. 1) as

there are cameras and is described in §3.1.

2. The several classiﬁcation score maps, generated

during step 1, are now combined into a ﬁnal prob-

ability of occupancy map (called hereafter detec-

tion score map), such as the one of the fourth

column of Fig. 1. This represents an estimate

of P(X

= 1|I

,.. . ,I

), the true marginal of the

probabilities of presence at every location, given

the full signal.

We compare two approaches for the second step.

Section §3.2 describes the one, which is representa-

tive of what is usually done by state-of-the-art meth-

ods. We refer to it as the baseline because it combines

the individual classiﬁcation score maps without tak-

ing into account the interactions between presence of

pedestrian due to occlusion. By constrast, the second

approach takes into account potential occlusions and

knowledge about the classiﬁer behavior and yields a

substantial increase in performance. It is at the core

of our contribution and is discussed in §3.3.

3.1 Classiﬁcation Score Maps

We introduce the classiﬁer we use for single-view

pedestrian detection and to compute our classiﬁcation

score maps.

3.1.1 Classiﬁer as a Pedestian Detector

During a learning step, we create a set of decision

trees dedicated to the classiﬁcation of rectangular im-

ages into two classes: “person” or “background”. The

binary decision trees we use as classiﬁers are based

on thresholded Haar wavelets operating on grayscale

images (Viola and Jones, 2001). They are trained us-

ing a few thousands of images of different sizes, each

of which represents either a pedestrian correctly cen-

PRINCIPLED DETECTION-BY-CLASSIFICATION FROM MULTIPLE VIEWS

377

Figure 4: The 3 images on the left show the classiﬁcation

score maps of a scene viewed under three different angles.

The right image represents the corresponding ground truth.

tered in the rectangular frame, or background, which

could be anything else.

More speciﬁcally, for every tree, several hundreds

of features of different scales, orientations and as-

pect ratios are generated randomly and applied to our

training set. The one that best separates the two pop-

ulations according to Shanon’s entropy is kept as the

root node and the training set is split and then dropped

into two similarly-constructed sub-nodes (Breiman

et al., 1984). This process is repeated until either the

person and background sets are completely separated

or it reaches the tree maximum depth d = 5. Our clas-

siﬁer consists of a forest (Breiman, 1996) of N

= 21

decision trees built in this manner.

3.1.2 Computing Classiﬁcation Score Maps

The algorithm iterates through every camera and

ground location, extracts a sub-image corresponding

to the rectangular shape of human size, and takes its

score to be the number of trees classifying the sub-

image as “person” (Fig. 3).

If we see the individual tree responses as many

i.i.d. samples of the response of an ideal classiﬁer,

the classiﬁcation score in location i is an estimate of

the probability for such a classiﬁer to respond that i is

actually occupied given the subimage at that location.

Hence, it is a good indicator of the actual occupancy.

This produces, for each camera, a map such as the

ones depicted by the third column of Fig. 1 or by the

three left pictures in Fig. 4, which assigns a voting

score to every ground location. As shown on those

ﬁgures, detected pedestrians appear as “cone shapes”

in the axis of the camera, on the classiﬁcation score

maps. This is due to the high tolerance in scale and

limited tolerance in translation of the classiﬁers, and

hinders precise people location. Hence the need of an

extra step, which combines classiﬁcation score maps

from different camera views into one accurate detec-

tion score map. Sections §3.2 and §3.3 present two

possible methods for this operation.

3.2 Baseline Approach

The baseline approach consists of multiplying the re-

sponses of the trees from different viewpoints. This

is essentially what the product rule used in (Khan

and Shah, 2006) does. It is more sophisticated than

a crude clustering and averaging in separated views,

since it assumes the conditional independence be-

tween the different views, given the true occupancy.

Recall that T

(i) is an integer standing for the sum

of the trees’ answers at location i on camera view c,

and T is the vector of all T

(i). Formally, we have

P(X

=α| T) = P(X

=α| T

(i),... ,T

(i)) (1)

P(X

=α)

P(T

(i),... , T

(i))

P(T

(i),... ,T

(i)|X

=α) (2)

P(X

=α)

P(T

(i),... , T

(i))

∏

P(T

(i)|X

=α). (3)

Equality (1) is true under the assumption that only

the responses of the trees at location i bring informa-

tion about the occupancy at that location, equality (2)

is directly Bayes’ law, and equality (3) is true under

the assumption that given the occupancy of location

i, the tree’s responses at that location from different

camera views are independent.

We then model the probability of the trees’ re-

sponse at a certain point given that it is occupied

(α = 1) by a density proportional to the number of

trees responding at that point, and the probability of

response when the location is empty (α = 0) by a con-

stant response. This leads to a ﬁnal rule that multiplies

the responses of the trees from the different view-

points to estimate a score increasing with the prob-

ability of occupancy at that point.

3.3 Principled Approach

The baseline method of the previous section assumes

that, given the true occupancy at a certain location,

the responses of the trees at that point for different

viewpoints are independent from each other, and are

not inﬂuenced by occupancy at other locations. As

shown in Section §4, it usually triggers many false

alarms. By contrast, our principled approach relies on

an assumption of conditional independence of the tree

responses at any location i, given the occupancy of

the full grid (X

,.. . ,X

), and not anymore X

alone.

Such an assumption is far more realistic, and leads to

an algorithm which takes into account the long-range

inﬂuence of both the occlusions between pedestrians

and the presence of an individual on the classiﬁcation

score maps, due to the invariance of the classiﬁers.

3.3.1 Conditional Marginals

We want to compute numerically, at every location

i of the ground plane, P(X

|T) the conditional mar-

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

378

ginal probability of presence given the response of the

classiﬁers at all locations. We will show that comput-

ing this quantity requires P(T|X), the tree response

model given the ground occupancy. It is learnt by ap-

plying the classiﬁer on sequences for which we have a

ground truth, and is described in §3.3.2. As explained

below, there is no possible analytical way to obtain

P(X

|T) given our underlying assumptions, hence the

need to evaluate it numerically through an iterative

process. At each new iteration, the marginal proba-

bilities of presence P(X

|T) for all ground locations

i are reevaluated using their previous estimate, until

convergence.

Let X

j6=i

denote the vector

,.. . ,X

i−1

i+1

,.. . ,X

), Q the product

law with the same marginals as the posterior

∀i, Q(X

= 1) = P(X

= 1|T) and E

the expectation

under X ∼ Q, as summarized in Table 1. To obtain

a tractable form for q

= P(X

= α| T), we ﬁrst

marginalize X

j6=i

∑

6=i

P(X

= α|T,X

j6=i

)P(X

j6=i

|T)

= E[P(X

=α|T, X

j6=i

)|T], (4)

where T is equal to the observed trees’ answers and

the only random quantity in the expectation is X.

We then apply Bayes’ law to make the model of the

trees’ answers given the true occupancy state appear

= E



P(T|X

=α, X

j6=i

)P(X

=α,X

j6=i

)

P(X

j6=i

|T)P(T)



. (5)

However, there is no analytical expression for (5), and

we thus have to estimate the expectation numerically

by sampling the X

j6=i

and averaging the correspond-

ing probability. To this end, we substitute the expecta-

tion under the true posterior law by a re-weighted ex-

pectation under a product law Q with the conditional

marginals as marginal

= E



P(T| X

=α, X

j6=i

)P(X

=α, X

j6=i

)

P(X

j6=i

|T)P(T)

P(X

j6=i

|T)

Q(X

j6=i

)



= E



P(T| X

=α, X

j6=i

)

P(T)

P(X

=α, X

j6=i

)

Q(X

j6=i

)



. (6)

Such a formulation ensures that, when we estimate the

expectation numerically, the sampling of X

j6=i

will ac-

cumulate on the occupancy conﬁgurations consistent

with the tree responses, thus leading to a far better es-

timate of the averaging with a reasonable number of

samples. Finally we simplify the expression by as-

suming that the prior distribution is a product law (i.e.

P(X) =

∏

i=1

P(X

))

P(X

=α)

P(T)

P(T|X

=α,X

j6=i

)

∏

j6=i

P(X

)

Q(X

)

. (7)

We end up with an expression of each marginal as a

function of the other marginals, thus a large system of

equations to solve.

This result is intuitive: the conditional marginal

probability of presence at location i given the trees’

answers can be computed by ﬁxing X

, sampling all

the other X

according to the current estimate of Q,

and averaging the corresponding probability that the

trees respond what they actually respond. The more

the value associated to X

makes the actual tree re-

sponses likely, the highest its probability.

We get rid of the unknown P(T) quantity by com-

puting

P(X

=1|T) =

P(T)P(X

=1|T)

P(T)P(X

=0|T) + P(T)P(X

=1|T)

In the end, we obtain a large number of equations

relating the P(X

= 1|T). We can iterate these equa-

tions to estimate the conditional marginals. After ini-

tialization of all q

s to a prior value, each of these

equations can be evaluated numerically by sampling

according to a product law Q with the current esti-

mates as marginals. Experimental results show that

with such a choice, since the sampling accumulates on

the conﬁgurations consistent with the observations, a

few tens of iterations are sufﬁcient to provide good

numerical precision. Fig. 5 shows four iterations of

the detection score map convergence process.

iteration #2 iteration #5 iteration #8 iteration #10

Figure 5: Example of convergence of a detection score map

during the iterative estimation.

3.3.2 Tree Response Model

At the core of Equation (7) above is P(T|X), the re-

sponses of the trees given the true occupancy state,

where X = (X

,.. . ,X

). It must account for effects

such as occlusion and classiﬁer invariance. Assum-

ing that the trees’ responses are independent given the

true state, we write

P(T|X) =

∏

c,i

P(T

(i)|X). (8)

As shown in Fig. 6, the trees’ response at position i

can only be inﬂuenced by ground location j, whose

correspondingsub-imageI

( j) intersects the I

(i). We

call such locations the neighborhood n

(i) of i on

camera view c. Thus, Equation (8) becomes

P(T|X) =

∏

c,i

P(T

(i)|X

(i)

), (9)

PRINCIPLED DETECTION-BY-CLASSIFICATION FROM MULTIPLE VIEWS

379

where we simply ignore positions outside n

(i). The

classiﬁer response at location i can thus be interpreted

as a function of the presence of individuals in the

neighborhood of i, as opposed to the whole scene.

In the rest of the section, we show how to express

(9) numerically in some simple particular cases, and

we then extend it to the general case, thus deriving a

model for the classiﬁer response.

Empty Neighborhood. If the neighborhood of i is

empty (Fig. 8, (a) and (b)), the trees’ answer in i de-

pends only on the occupancy of i. Precisely ∀α ∈

{0,1}:

P(T

(i) = t | X

= α,∀ j ∈ n

(i), X

= 0) = µ

(t). (10)

The functionals µ

and µ

are modeled as histograms

estimated on training samples, and shown on Fig. 7.a.

Figure 6: Left image shows the neighborhood n

(i) in cam-

era view and right image shows it in top view.

One Individual in the Neighborhood. We now

consider the case where only one person is present

in the neighborhood of i, at location j. If location i is

empty, sub-image I

(i) will contain some body parts

of the person present at location j, in addition to back-

ground. This inﬂuences the classiﬁer answer in i, in a

way that depends on the “distance” between I

(i) and

( j) in the image.

To characterize this pseudo-distance be-

tween sub-images, we deﬁne functions α(i, j) =

(i)k/kI

( j)k and β(i, j) = δ

(i, j)/

(i)k,

where α(i, j) quantiﬁes the size ratio between I

(i)

and I

( j), and β(i, j) their misalignment. δ

(i, j) is

described in Table 1.

With this, we obtain the tree response model

′

(t,α(i, j),β(i, j)), which is computed as histograms

from the training samples. It is plotted on Fig. 7 (c).

We ﬁnally model the case where location i is oc-

cupied, with one person present in its neighborhood

at location j. For this purpose, we have to distinguish

positions from the neighborhood located “behind” i –

that is, further from the camera than i – and those lo-

cated closer to it. We denote the former set by n

−

(i)

and the latter by n

(i) and illustrate them geometri-

cally in Fig. 8.

When i is occupied, positions from n

−

(i) do not

inﬂuence the classiﬁer answer on I

(i), but posi-

tions from n

(i) do. As for the previous case, we

deﬁne a pseudo-distance function γ(i, j) = kI

(i) ∩

( j)k/kI

(i)k · (1 − kI

(i) ∩ I

( j)k/kI

( j)k) with re-

spect to the camera view, to characterize the relation-

ship between the relative position of i and j, and the

trees’ answer.

We then derive the tree response model for this

last case as function µ

′

(t,γ(i, j)), which is depicted

by Fig. 7 (b). It is also computed empirically as his-

tograms from the training samples.

Multiple Individuals in the Neighborhood. It is

not trivial to extend the simpliﬁed model with at most

one person in the neighborhood to the general case,

because the number of neighbor locations is of the

order of 100, which implies a huge number of oc-

cupancy conﬁgurations. We therefore simplify our

model by assuming that only the occupied location

whose sub-window intersects the most I

(i) will in-

ﬂuence the classiﬁer answer in i, on camera view c.

We denote by J

∗

(i) the occupied location from the

neighborhood of i, whose corresponding sub-window

covers the most I

(i)

∗

(i) = argmax

j∈n

(i),X

(i) ∩ I

( j)k. (11)

This assumption makes the model tractable and has

been found to hold empirically. Finally, we obtain as

response model when the neighborhood is not empty,

whether there is a single individual or several of them:

P(T

(i) = t | X

= 0, ∃ j ∈ n

(i),X

= 1)

= µ

′

(t,α(i,J

∗

(i)),β(i,J

∗

(i))) (12)

P(T

(i) = t | X

= 1, ∃ j ∈ n

(i),X

= 1)

= µ

′

(t,γ(i,J

∗

(i))) (13)

4 RESULTS

To test our approach, we acquired 30 minutes of video

sequences using three outdoor cameras with overlap-

ping ﬁelds of view. We used a 2 minutes sequence to

train the system and learn the trees response model of

§ 3.3.2 and the remaining to test it. To demonstrate

the generality of the model, we also show results in

indoor sequences that were not used for training pur-

poses. Finally, we show that our method yields mean-

ingful results even from single views.

Baseline vs. Principled Approaches. To compare

the approaches of § 3.2 and § 3.3, we randomly se-

lected 100 frames of the outdoor sequences, manually

labeled the true pedestrian locations, and compared

them to both their outputs.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

380

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25 30

Trees answer

Person

Background

0.2

0.4

0.6

0.8

0.05

0.1

0.15

0.2

0.25

trees answer

γ(i,j)

0.5

1.5

2.5

0.5

1.5

β(i,j)

α(i,j)

average trees answer

(a) (b) (c)

Figure 7: Tree response model. (a) shows the classiﬁer answer distributions for a forest of 31 trees, (b) plots the distribution

of the classiﬁer answer as a function of γ(i, j) and (c) displays the average trees’ answer as a function of α(i, j) and β(i, j).

(a) (b) (c) (d)

Figure 8: The images above illustrate the four cases used by the tree response model for the grid position i, colored in white.

Grid positions highlighted in gray represent the neighborhood n

(i) of position i (see also Fig. 6 right, for a top view). The

visible neighborhood n

(i) is shown in light gray, whereas the neighborhood n

−

(i) located beyond position i is painted in

dark gray. In case (a), neither location i nor its neighborhood is occupied. In case (b), location i is occupied, but its visible

neighborhood n

(i) is empty. Note that there might or might not be people in n

−

(i). In case (c), location i is empty, but

there is at least one person in its neighborhood n

(i). Finally in case (d), location i is occupied, as well as at least one of the

locations in n

(i). As in case (b), it does not matter whether n

−

(i) is occupied.

The result depicted by Fig. 9. shows that the

principled approach yields much better estimates of

the number of people than the baseline approach,

which triggers many false positives. When setting

the post-processing threshold so that both approaches

have about 10% of false negatives, our approach out-

performs the baseline one, by producing only about

0.06% of false positives instead of 0.81%. This re-

sult is depicted by the ROC curves of Fig. 9.b. Since

our method relies on a strong model and produces

very peaked occupancy probabilities, detection fail-

ures cases produce incorrect occupancy maps. This

explains the crossing of the ROC curves at very high

detection rates.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-5 0 5 10 15 20

Error in number of people detections

Baseline approach

Our approach

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1e-04 0.001 0.01 0.1 1

True positives

False positives

Baseline approach

Our approach

(a) (b)

Figure 9: Comparing the performance of the baseline and

principled approaches. (a) Error distribution in the esti-

mate of the number of people present in the scene. (b) ROC

curves for the two methods. These graphs demonstrate that

the principled approach truly provides a better estimate of

the number of people present in the scene, and a better false

positives vs. false negatives ratio.

Indoor and Outdoor Sequences. Fig. 10 depicts

our results in the outdoor and indoor sequences. In

both cases, people are correctly detected in spite of

very real difﬁculties: In the outdoor images, there

are strong shadows, which could create problem for

methods based on background subtraction but do not

affect our results. The occlusions in the indoor im-

ages are very signiﬁcant but are nevertheless handled

correctly, especially when one recalls that we do not

enforce any form of temporal consistency and treat

every time frame independently.

Thanks to the tree response model of Sec-

tion 3.3.2, we can retrieve occupancy maps from

the noisy classiﬁer answers, even when using single

views as shown in last raw of Fig. 10. The procedure

used is the same as in the multi-view case, except that

we do no longer multiply tree’s answers from multi-

ple cameras in Equation 8. Occlusions are no longer

handled, as evidenced by the fact that a half-hidden

person in the second image is missed. Nevertheless,

the results remain meaningful.

5 CONCLUSIONS

We have shown that explicitly computing marginal

probabilities of target presence given classiﬁer re-

sponses is more reliable and accurate than simply

averaging the responses across views for multi-view

PRINCIPLED DETECTION-BY-CLASSIFICATION FROM MULTIPLE VIEWS

381

Figure 10: Results of our algorithm on real video sequences. Each one of the ﬁrst three rows shows several views taken at

the same time instant from different angles. Boxes are located on local maxima of the estimated probabilities of occupancy.

The last column depicts the corresponding detection score map before thresholding. The last row shows two detection results

obtained from single images.

people detection purposes. This is especially true

in challenging situations with complex interactions

between true alarms due to occlusion and very low

metric accuracy in the classiﬁer responses. Exper-

iments show that this method allows for a reduc-

tion of one order of magnitude of false positives.

As a result, we have been able to demonstrate reli-

able people detection at single time frames and with-

out having to impose any temporal consistency con-

straints. Finally, our approach to post-processing

multiple classiﬁer outputs is generic and could be

applied to other detection-by-classiﬁcation problems,

for which a model of the classiﬁer response given

the true detection state is available, either directly or

through learning.

REFERENCES

Breiman, L. (1996). Bagging predictors. Machine Learn-

ing, 24(2):123–140.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J.

(1984). Classiﬁcation and Regression Trees. Chap-

man & Hall, New York.

Dalal, N. and Triggs, B. (2005). Histograms of Oriented

Gradients for Human Detection. In CVPR.

Elfes, A. (1989). Occupancy Grids: A Probabilistic Frame-

work for Robot Perception and Navigation. PhD the-

sis, Carnegie Mellon University.

Fleuret, F. and Geman, D. (2002). Fast Face Detection with

Precise Pose Estimation. In CVPR.

Khan, S. and Shah, M. (2006). A multiview approach to

tracking people in crowded scenes using a planar ho-

mography constraint. In ECCV.

Leibe, B., Seemann, E., and Schiele, B. (2005). Pedestrian

detection in crowded scenes. In CVPR.

Mittal, A. and Davis, L. (2003). M2tracker: A multi-view

approach to segmenting and tracking people in a clut-

tered scene. IJCV.

Okuma, K., Taleghani, A., de Freitas, N., Little, J., and

Lowe, D. (2004). A boosted particle ﬁlter: multitarget

detection and tracking. In ECCV.

Viola, P. and Jones, M. (2001). Rapid Object Detection us-

ing a Boosted Cascade of Simple Features. In CVPR.

Viola, P., Jones, M., and D.Snow (2003). Detecting pedes-

trians using patterns of motion and appearance. In

ICCV.

Zhao, T. and Nevatia, R. (2001). Car detection in low reso-

lution aerial image. In ICCV.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

382