where 1 is the closest distance, 2 the second-
closest distance and the distance ratio (typically
0.8). The disparity of the point on the segment
is computed from the image coordinate difference of
the two matching pixels.
To get more accurate 3D position, we estimate
the sub-pixel disparity by considering the minimum
ED that satisfies (3) and the two neighbouring ED
values instead of just taking the point of the
minimum ED as the matching point. We fit a
parabola to the three values and analytically solve
for the minimum to get the sub-pixel correction.
The centre of the pixels with the largest rounded
disparity instead of only the pixel with largest
disparity on the potential head top segment is
determined as the head point. Thus both the disparity
and the position of the head point have sub-pixel
resolution, making the localization of the 3D head
point more robust and accurate. With the head point
in the image and its disparity (not rounded), the 3D
head position can be computed by triangulation.
3 PEDESTRIAN TRACKING
Once the 3D positions of pedestrians, denoted as the
3D head points, are obtained in each frame, they are
tracked by assuming constant moving direction and
velocity within two consecutive frames. The position
of a person is predicted at the next time interval and
a search is implemented in a neighbourhood around
the predicted point. The position of the person is
then updated by the estimated 3D head point that is
nearest to the predicted point. If no head point is
found in the search area, the person’s location is
updated using the predicted one. The person is
deleted if not found over certain extend period of
time. Similarly, if an object is not associated with
any object in the previous frame over some frame
intervals, it is regarded as a new target.
4 EXPERIMENTS
We test our approach using a publicly available
visual surveillance simulation test bed, ObjectVideo
Virtual Video. Two virtual scenes of the train station
concourse are created, one with flat ground and the
other with a small bump, whose cross section is a
trapezoid, added on the flat ground. Seven people
walk in an area of about 180*160 inches, which is an
average crowded scene: the blobs of people don’t
merge in the overhead view.
The ceiling is 348 inches high from the flat part
of the ground. Two identical synchronized cameras
are installed on the ceiling with perpendicular views.
The baseline is 40 inches. The frame rate is 15
frames per second and the frame size is 640*480
pixels. A PTZ camera is installed on the wall in both
scenes with the resolution 320*240 pixels and the
height 160 inches. We let a group of people walk on
the planar ground and then let the same group walk
on the non-planar ground using the same paths.
The images in figure 2 are captured by the left
camera when people walk on the planar and non-
planar ground, respectively. The foreground
centroids and the detected head points are marked as
in red and white. The detected head points are very
close to the head top centres in both scenes. The
dashed line square shows the bump area.
(a) Planar ground (b) Non-planar ground
Figure 2: The frames captured by the left camera with
people walking on the planar and non-planar ground.
The estimated 3D tracks are projected to the X-Y
plane and Z plane (height) separately. The X-Y
plane tracking results are shown in figures 3, where
the solid lines are the ground truth and the dashed
lines the estimated trajectories. The square is the
FOV centre. The number at the one end of each
trajectory denotes its object ID. The trajectories are
very close to the ground truth. Since the bump
changes people’s speeds, the tracks in the two scenes
are a little different though the same paths are set.
The Z plane tracking results are not shown
because of limit space. The errors of Z and X-Y
plane values in the two scenes are tabulated in table
1. The 3D head position errors can result from the
two reasons: a) the estimated potential head top
segment is slightly off the head top centre due to
pedestrians’ movement which makes the foreground
blob not perfectly symmetrical about the line from
image centre to blob centre; b) robust corresponding
points (in section 2.2) are not found on the head top
part of the segment or are not established correctly.
b) can cause relatively big error in both X-Y and Z
plane yet rarely happens. The ellipse in figure 3(b)
highlights the relatively big errors due to reason b).
The main errors are caused by reason a) instead and
are usually smaller than the head radius(see table 1).
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
384