To investigate the exact accuracy of the statistics,
we conducted a one time experiment in which we
compared our automatic traffic statistics in a street to
the correct value (determined by a human observer).
We achieved a precision of 96% and recall of 95%,
outperforming the state-of-the-art method in literature
whose precision and recall is 92% and 87% respec-
tively.
2 RELATED WORK
The work of (Bochinski et al., 2017) experimen-
tally shows that their simple tracker based on the
intersection-over-union (IOU) of detector responses
at sufficiently high frame rates outperforms the state-
of-the-art tracker at only a fraction of the computa-
tional cost. However, their method assumes that the
detector produces a detection per frame for every ob-
ject being tracked allowing only few missed detec-
tions. This assumption is often invalid when an object
is occluded for a few frames. Although the computa-
tional cost of their tracker is very low, its requirement
for high frame rate videos to ensure a large overlap
between detections in consecutive frames poses a high
computational load on CNN-based object detection.
The shortcomings of the tracker of (Bochinski
et al., 2017) are addressed in the Simple Online and
Real-time Tracking (SORT) of (Bewley et al., 2016)
while keeping a low computational cost. The SORT
tracker deploys Kalman filtering not only to filter
noise in trajectories but also to handle missing de-
tections. Similar to the work of (Bochinski et al.,
2017), the assignment of detections to existing trajec-
tories are based on the intersection-over-union (IOU)
distance between each detection and all the predicted
bounding boxes of the Kalman filter. If no matched
detection is found, i.e., when the detector failed to de-
tect the object because it was occluded or corrupted
by image noise, the Kalman filter prediction becomes
the estimated state of the object. When there is a
matched detection, the estimated state is corrected by
incorporating information from the matched detec-
tion. The work of (Tran et al., 2021) utilizes SORT
tracker in their turning movement counting system
which is designed to be deployed on edge devices.
Since the detection-to-trajectory assignment of the
SORT tracker is solely based on the motion model
of the Kalman filter and the IOU distance, the SORT
tracker experiences more identity switches between
tracked objects than the state-of-the-art trackers al-
though it outperforms in terms of Multiple Object
Tracking Accuracy (MOTA). To tackle the identity
switching problem of SORT tracker, (Wojke et al.,
2017) extend the detection-to-trajectory assignment
method of SORT by integrating appearance informa-
tion. They experimentally show that their extended
method, i.e. extending SORT tracker with a deep as-
sociation metric (DeepSORT), reduces the number of
identify switches by 45% while maintaining overall
competitive performance at high frame rates. How-
ever, identity switching between road users with sim-
ilar appearance still occur when they are close by.
Some CNN-based trackers (Xu and Niu, 2021;
Gloudemans and Work, 2021) perform detection and
association across frames jointly by utilizing feed-
back information from object tracking. Since the
previous object location and appearance information
from the tracker is used as region proposal/prior in
detection and association to narrow down the search
space, this approach is faster than detect–associate–
track approach. (Gloudemans and Work, 2021) fol-
low this approach to generate trajectories for TMC
application. Since object detection is never performed
on a full frame, they claim that their method is ap-
proximately 50% faster than state-of-the-art methods
in comparison. However, evaluation result indicates
that their accuracy is lower than the DeepSort-based
method (Lu et al., 2021).
The aforementioned trackers assume a very gen-
eral tracking scenario where the cameras are not cali-
brated. However, trajectories on the ground plane are
often required in smart traffic applications for trajec-
tory clustering, abnormal behavior detection, analyz-
ing the interaction between road users and so on. The
projection of the road user’s position from an image
coordinates to the ground coordinates (GPS coordi-
nates) can be found by determining the transformation
(i.e., a homography) between the image plane and the
ground plane. Since an image position can be mapped
onto a ground position, Bayesian state estimation can
be applied to the ground plane instead of the image
plane.
Furthermore, a road user moving with constant ve-
locity can result in non-constant velocity movement
in the image plane. In addition to this, accelera-
tion/deceleration of the road user can cause even more
complex movement on the image plane. Therefore,
our earlier work (Nyan et al., 2020) utilizes image to
ground plane projection and tracks road users on the
ground plane using Bayesian state estimation. How-
ever in this earlier work, only size and position dif-
ference between the prediction of the Bayesian filter
and the detector responses are considered in the cost
function formulation for track–detection association.
Incorporating appearance information in cost function
as in the work of (Wojke et al., 2017) could result in
performance improvement.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
786