
 
where  1 is the closest distance, 2  the  second-
closest distance and  the distance ratio (typically 
0.8). The disparity of the point on the segment 
is computed from the image coordinate difference of 
the two matching pixels.  
To get more accurate 3D position, we estimate 
the sub-pixel disparity by considering the minimum 
ED that satisfies (3) and the two neighbouring ED 
values instead of just taking the point of the 
minimum ED as the matching point. We fit a 
parabola to the three values and analytically solve 
for the minimum to get the sub-pixel correction. 
The centre of the pixels with the largest rounded 
disparity instead of only the pixel with largest 
disparity on the potential head top segment is 
determined as the head point. Thus both the disparity 
and the position of the head point have sub-pixel 
resolution, making the localization of the 3D head 
point more robust and accurate. With the head point 
in the image and its disparity (not rounded), the 3D 
head position can be computed by triangulation. 
 
3 PEDESTRIAN TRACKING 
Once the 3D positions of pedestrians, denoted as the 
3D head points, are obtained in each frame, they are 
tracked by assuming constant moving direction and 
velocity within two consecutive frames. The position 
of a person is predicted at the next time interval and 
a search is implemented in a neighbourhood around 
the predicted point. The position of the person is 
then updated by the estimated 3D head point that is 
nearest to the predicted point. If no head point is 
found in the search area, the person’s location is 
updated using the predicted one. The person is 
deleted if not found over certain extend period of 
time. Similarly, if an object is not associated with 
any object in the previous frame over some frame 
intervals, it is regarded as a new target.  
4 EXPERIMENTS 
We test our approach using a publicly available 
visual surveillance simulation test bed, ObjectVideo 
Virtual Video. Two virtual scenes of the train station 
concourse are created, one with flat ground and the 
other with a small bump, whose cross section is a 
trapezoid, added on the flat ground. Seven people 
walk in an area of about 180*160 inches, which is an 
average crowded scene: the blobs of people don’t 
merge in the overhead view. 
The ceiling is 348 inches high from the flat part 
of the ground. Two identical synchronized cameras 
are installed on the ceiling with perpendicular views. 
The baseline is 40 inches. The frame rate is 15 
frames per second and the frame size is 640*480 
pixels. A PTZ camera is installed on the wall in both 
scenes with the resolution 320*240 pixels and the 
height 160 inches. We let a group of people walk on 
the planar ground and then let the same group walk 
on the non-planar ground using the same paths. 
The images in figure 2 are captured by the left 
camera when people walk on the planar and non-
planar ground, respectively. The foreground 
centroids and the detected head points are marked as 
in red and white. The detected head points are very 
close to the head top centres in both scenes. The 
dashed line square shows the bump area. 
  
   
(a)  Planar ground                    (b) Non-planar ground 
Figure 2: The frames captured by the left camera with 
people walking on the planar and non-planar ground. 
The estimated 3D tracks are projected to the X-Y 
plane and Z plane (height) separately. The X-Y 
plane tracking results are shown in figures 3, where 
the solid lines are the ground truth and the dashed 
lines the estimated trajectories. The square is the 
FOV centre. The number at the one end of each 
trajectory denotes its object ID. The trajectories are 
very close to the ground truth. Since the bump 
changes people’s speeds, the tracks in the two scenes 
are a little different though the same paths are set.  
The Z plane tracking results are not shown 
because of limit space. The errors of Z and X-Y 
plane values in the two scenes are tabulated in table 
1. The 3D head position errors can result from the 
two reasons: a) the estimated potential head top 
segment is slightly off the head top centre due to 
pedestrians’ movement which makes the foreground 
blob not perfectly symmetrical about the line from 
image centre to blob centre; b) robust corresponding 
points (in section 2.2) are not found on the head top 
part of the segment or are not established correctly. 
b) can cause relatively big error in both X-Y and Z 
plane yet rarely happens. The ellipse in figure 3(b) 
highlights the relatively big errors due to reason b). 
The main errors are caused by reason a) instead and 
are usually smaller than the head radius(see table 1). 
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
384