Figure 4: The 3 images on the left show the classification
score maps of a scene viewed under three different angles.
The right image represents the corresponding ground truth.
tered in the rectangular frame, or background, which
could be anything else.
More specifically, for every tree, several hundreds
of features of different scales, orientations and as-
pect ratios are generated randomly and applied to our
training set. The one that best separates the two pop-
ulations according to Shanon’s entropy is kept as the
root node and the training set is split and then dropped
into two similarly-constructed sub-nodes (Breiman
et al., 1984). This process is repeated until either the
person and background sets are completely separated
or it reaches the tree maximum depth d = 5. Our clas-
sifier consists of a forest (Breiman, 1996) of N
T
= 21
decision trees built in this manner.
3.1.2 Computing Classification Score Maps
The algorithm iterates through every camera and
ground location, extracts a sub-image corresponding
to the rectangular shape of human size, and takes its
score to be the number of trees classifying the sub-
image as “person” (Fig. 3).
If we see the individual tree responses as many
i.i.d. samples of the response of an ideal classifier,
the classification score in location i is an estimate of
the probability for such a classifier to respond that i is
actually occupied given the subimage at that location.
Hence, it is a good indicator of the actual occupancy.
This produces, for each camera, a map such as the
ones depicted by the third column of Fig. 1 or by the
three left pictures in Fig. 4, which assigns a voting
score to every ground location. As shown on those
figures, detected pedestrians appear as “cone shapes”
in the axis of the camera, on the classification score
maps. This is due to the high tolerance in scale and
limited tolerance in translation of the classifiers, and
hinders precise people location. Hence the need of an
extra step, which combines classification score maps
from different camera views into one accurate detec-
tion score map. Sections §3.2 and §3.3 present two
possible methods for this operation.
3.2 Baseline Approach
The baseline approach consists of multiplying the re-
sponses of the trees from different viewpoints. This
is essentially what the product rule used in (Khan
and Shah, 2006) does. It is more sophisticated than
a crude clustering and averaging in separated views,
since it assumes the conditional independence be-
tween the different views, given the true occupancy.
Recall that T
c
(i) is an integer standing for the sum
of the trees’ answers at location i on camera view c,
and T is the vector of all T
c
(i). Formally, we have
P(X
i
=α| T) = P(X
i
=α| T
1
(i),... ,T
C
(i)) (1)
=
P(X
i
=α)
P(T
1
(i),... , T
C
(i))
P(T
1
(i),... ,T
C
(i)|X
i
=α) (2)
=
P(X
i
=α)
P(T
1
(i),... , T
C
(i))
∏
c
P(T
c
(i)|X
i
=α). (3)
Equality (1) is true under the assumption that only
the responses of the trees at location i bring informa-
tion about the occupancy at that location, equality (2)
is directly Bayes’ law, and equality (3) is true under
the assumption that given the occupancy of location
i, the tree’s responses at that location from different
camera views are independent.
We then model the probability of the trees’ re-
sponse at a certain point given that it is occupied
(α = 1) by a density proportional to the number of
trees responding at that point, and the probability of
response when the location is empty (α = 0) by a con-
stant response. This leads to a final rule that multiplies
the responses of the trees from the different view-
points to estimate a score increasing with the prob-
ability of occupancy at that point.
3.3 Principled Approach
The baseline method of the previous section assumes
that, given the true occupancy at a certain location,
the responses of the trees at that point for different
viewpoints are independent from each other, and are
not influenced by occupancy at other locations. As
shown in Section §4, it usually triggers many false
alarms. By contrast, our principled approach relies on
an assumption of conditional independence of the tree
responses at any location i, given the occupancy of
the full grid (X
1
,.. . ,X
G
), and not anymore X
i
alone.
Such an assumption is far more realistic, and leads to
an algorithm which takes into account the long-range
influence of both the occlusions between pedestrians
and the presence of an individual on the classification
score maps, due to the invariance of the classifiers.
3.3.1 Conditional Marginals
We want to compute numerically, at every location
i of the ground plane, P(X
i
|T) the conditional mar-
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
378