1.0 1.2 1.4 1.6 1.8 2.0
Image / Object ratio
0.0
0.2
0.4
0.6
0.8
1.0
Average overlap
CCOT vs naive (Our score)
CCOT
Naive
Figure 6: Performance of the CCOT and the full frame trac-
ker for relative object sizes using our proposed score. At
lower relative size (larger object), the naive frame tracker
outperforms the state-of-the-art CCOT approach as it is gua-
ranteed to cover the entire object, while the CCOT typically
has some offset error. At smaller object sizes, our proposed
score heavily penalizes the naive frame tracker.
Figure 7: Example frames from the CarBlur sequence, with
a naive method that outputs close to the entire image as each
detection (red box). The ground truth annotation is the blue
box. Due to severe motion blur and highly irregular mo-
vements in the sequence tracking is difficult. The traditi-
onal IoU score for this frame is 0.26 (left), while our new
unbiased metric provides a far lower score of 0.11 for both
the left and right images. This suggests that using the IoU
is not optimal in many cases.
frames from the cropped CarBlur sequence. As the vi-
deo is extremely unstable tracking is difficult, due to
motion blur and sudden movements. Here a predicted
bounding box generated by the naive tracker provides
a decent score of 0.22, despite always outputting the
entire frame. When instead using our unbiased score
the penalty for for over estimation of object size is
severe enough that the overlap score is more than hal-
ved. Here the IoU gives close to twice the overlap
score compared to our own approach.
5 CONCLUSIONS AND FURTHER
WORK
We have proven that the traditionally used IoU score
is biased with respect to over estimation of object si-
zes. We demonstrate this bias exists theoretically, and
derive a new unbiased overlap score. We note that
most tracking datasets are heavily biased in favor of
smaller objects, and construct a new dataset by crop-
ping parts of images at varying sizes. This demonstra-
tes a major issue with current tracking benchmarks as
situations with large objects directly correspond to si-
tuations when the tracked objects are close. We de-
monstrate the effect of using a biased metric in situa-
tions where the tracked object covers the majority of
the image, and compare to our new unbiased score.
Finally we have demonstrated the effect of introdu-
cing larger objects into tracking sequences by genera-
ting such a sequence, and comparing the performance
of a stationary tracker with that of a state of the art
method. While the CCOT significantly outperforms
the stationary tracker for smaller objects (as is ex-
pected), for larger objects the naive approach simply
outputting the entire image is quite successful. In the
future we aim to investigate the effect of this bias in
object detection scenarios. It would also be relevant
to construct a new tracking dataset where the tracked
objects size is more evenly distributed than what is
currently typical. Acknowledgements: This work was
supported by VR starting Grant (2016-05543), Veten-
skapsr
˚
adet through the framework grant EMC
2
, and
the Wallenberg Autonomous Systems and Software
program (WASP).
REFERENCES
Danelljan, M., Robinson, A., Khan, F. S., and Felsberg, M.
(2016). Beyond correlation filters: Learning continu-
ous convolution operators for visual tracking. In Euro-
pean Conference on Computer Vision, pages 472–488.
Springer.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,
and Zisserman, A. (2008). The pascal visual object
classes challenge 2007 (voc 2007) results (2007).
Felsberg, M., Berg, A., Hager, G., Ahlberg, J., Kristan, M.,
Matas, J., Leonardis, A., Cehovin, L., Fernandez, G.,
Vojir, T., et al. (2015). The thermal infrared visual ob-
ject tracking vot-tir2015 challenge results. In Procee-
dings of the IEEE International Conference on Com-
puter Vision Workshops, pages 76–88.
Felsberg, M., Kristan, M., Matas, J., Leonardis, A., Pflug-
felder, R., Hager, G., Berg, A., Eldesokey, A., Ahl-
berg, J., ehovin, L., Vojr, T., Lukei, A., and Fernndez,
G. (2016). The thermal infrared visual object tracking
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
586