Figure 4: Example of Haar-features of two, three and four
rectangles. The value of a feature is given by the difference
of the sum of pixel values in the regions with different col-
ors. In this case, the value of the feature is given by the
difference between the black and white region.
same dimensions of the original image but each point
represents the sum of every pixel above and to the left
of the current pixel position
ii(x, y) =
∑
x
0
≤x,y
0
≤y
i(x
0
, y
0
). (1)
We can calculate the integral image ii with only
one pass over each image pixel. With this integral im-
age, we can calculate the summation of a given rect-
angle feature with only three accesses to the integral
image.
The authors propose to view the Haar-like features
as weak classifiers. For that, they use the Adaboost
algorithm to select the best features among the more
than 180,000 possible ones. For each weak classi-
fier, the Adaboosting training algorithm determines
the best threshold that results the lowest classification
error for object and non-object classes. A weak clas-
sifier h
j
(x) is given by a feature f
j
, a threshold θ
j
and
a polarization (+/-) p
j
h
j
(x) =
1 if p
j
f
j
(x) < p
j
θ
j
0 otherwise.
(2)
where x is a 24 ×24-pixel image region. For each
weak classifier selection, the most appropriate feature
is selected and receives a weight associated with each
training sample evaluated.
With several weak classifiers, it is possible to
combine them to build a strong classification proce-
dure. The authors propose this combination using a
cascade setup. In a cascade classifier scheme, the out-
come of one classifier is the input the next one giving
rise to a strong classifier as Figure 5 depicts.
Figure 5: Viola and Jones (Viola and Jones, 2001) cascade
of classifiers. Each weak classifier classifies such sub image
to detect whether or not it has the object of interest. If a sub
image passes over all the classifiers, it is tagged as having
the object of interest.
3.2 Stage #2 – Observation Projection
onto the World Plane
The result of each detector for each camera represents
object observations in the image world for each cam-
era. However, we are interested in the soccer player
localization in the world plane that represents the ac-
tual soccer court in which the players are. The world
plane is represented in 3D coordinates. As we men-
tioned before, we have some control points in the
soccer court whose location we know a priori (e.g.,
penalty marks and corner marks). With such points,
we can use a video frame in a camera to find such
correspondences manually.
The homography maps the coordinates between
the planes. In our case, the objects of interest move on
the soccer court and, therefore, are always on a plane
in the 3D world. We can use the homography of spe-
cific points of the object detections (e.g., the foot of a
player) to find their localization in the world coordi-
nates.
Each player as found by a detector is represented
by a rectangle in the image plane of a given camera.
In our work, we consider the midpoint of the basis of
such rectangle as a good representation of a player’s
feet in the image plane. As we expect, the estimation
of the player’s feet position is not perfect and conse-
quently its projection to the world coordinates does
not represent the exact point in which the player is
at. In addition the the detector error, the homography
also contains intrinsic errors.
After the projection, we can have more than one
point associated with the same player and we need a
fusion approach to better estimate the player positions
taking advantage of the multiple camera detections.
3.3 Stage #3 – Multiple Camera Fusion
After the detection of the players from multiple cam-
eras, we have a set of observations in the image plane
of each camera (each rectangle represents the detec-
tion of a player in a given camera). Assuming that
the midpoints in the base of each rectangle is a good
choice for the localization of the players’ feet, we
project such midpoints onto the world plane (repre-
senting the soccer court) using the homography ma-
trix related to the camera under consideration.
Due to detection as well errors in projections,
these points do not correspond to the exact localiza-
tion of the players. However, the projected points are
a good estimation for the player’s localization in a re-
gion.
With possibly more than one detection per player
as well as with possible projection errors, the question
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
208