improve performance, the tracker combines detec-
tion and prediction of vehicles’ positions on the next
frame.
Figure 2: The proposed methodology.
3.1 Object Detection
We use the Faster RCNN algorithm as a detector. It is
composed of two modules. The first module is a Re-
gion Proposal Network (RPN) which takes an image
as input and proposes regions with a wide range of
scales and aspect ratios. The second module is a clas-
sifier, which takes as input the proposed regions and
returns the positions and the types of vehicles (Ren
et al., 2015). In this work, the Faster R-CNN uses
the intermediate features of ResNet-50 to aid in the
region proposal task.
In our study, the vehicles appear in some specific
shapes, which is due to the fact that our cameras were
oriented to capture the rear-ends of vehicles, and also
to the dimensions of the heavy vehicles. So we modi-
fied some scales and aspect ratios which influence the
RPN, so that the proposed regions match all of the
vehicle types that we can observe in the scene.
The video frames are captured with a camera
placed on an 8m-high highway bridge. The camera is
oriented so that it captures the rear-ends of the pass-
ing vehicles. To train the vehicle detector in this set-
ting, we generated a dataset of frames containing ve-
hicles that we have labeled by specifying their types
and positions. We generated 925 images, including
1226 vehicles, 323 trucks, 858 cars, and 45 buses.
To increase the dataset size, we applied the following
data augmentation techniques: horizontal flip, adding
of a Gaussian Noise, and adding of a salt-and-pepper
noise. Thus, the detector was trained using 7400
frames.
3.2 Tracker
The detector and tracker are applied to every frame.
The result of this operation gives one of the following
4 cases: (1) a tracked vehicle which is detected in the
current frame, (2) a new vehicle is detected but not yet
tracked, (3) a tracked vehicle which is not detected in
the current frame (this is referred to as a predicted
vehicle), and (4) a predicted vehicle which was not
detected.
The tracker has two sources of information: pre-
dicted vehicles and detected vehicles. We use the
Munkres algorithm (Munkres, 1957) to assign each
detection to the appropriate tracked vehicle. The al-
gorithm takes as input the prediction and detection re-
sults and measures the cost of associating each detec-
tion to a tracked vehicle. The cost is calculated using
the sum of distances between the centroids of the pre-
dicted and detected vehicles.
To predict new positions of vehicles, we assume
that the vehicle speed does not significantly vary from
one frame to the next, so we use a simplified version
of the Kalman filter to construct our predictor. The
state vector consists of the centroids’ coordinates and
velocities along the 2 axes. The coordinates are ob-
tained directly from the detector, whereas the velocity
is calculated using the previous and the current cen-
troid’s positions.
When a new vehicle is detected, we start to track
it; but as it can be a false positive detection, we con-
sider it as a temporary vehicle until we succeed to
consistently track it over a determined number of suc-
cessive frames, in which case it is considered a real
tracked vehicle.
We predict the next positions of tracked vehicles
using the last velocity and the last position. In some
cases, the detector can fail in detecting the vehicle
in the scene, (e.g. occluded vehicle), so the pre-
dicted position will be used as the real positions of
the tracked vehicle, and the velocity will no be up-
dated. This prediction in the absence of detection will
continue over a number of frames beyond which the
vehicle is considered to be lost and thus removed from
the list of tracked vehicles. We also remove from this
list the tracked vehicles whose coordinates go beyond
the region of interest.
3.3 Removing the Projective Distortion
Vehicle detection and tracking are essential in many
road traffic applications, such as vehicle counting,
speed estimation, lane occupation estimation, head-
way estimation, etc.. Counting vehicles does not re-
quire precise positioning of the vehicles on the road.
However, to estimate lane occupations, speeds, and
headways, we need to estimate the vehicles’ positions
accurately, and we need to be able to measure real dis-
tances, i.e. values must be converted from the pixel
domain to the real-world domain.
Headway and Following Distance Estimation using a Monocular Camera and Deep Learning
847