to more challenging fast-moving objec ts, with quite
light computational overhead in both settings. Ap-
plied to YOLOv3 (Redmon and Farhadi, 2018), our
system achieves so far the best speed/accuracy trade-
off for offline video object d e te ction and competitive
accuracy improvements f or online object detection,
e.g., 80.9% of mAP on our offline setting and 78.2%
of mAP on online setting both at 38 fps on a Titan X
GPU or at 51 fps on a GTX 1080Ti GPU.
Key contributions of this paper are:
• Our proposed box-level post-processing method
achieves important accuracy improvements to per-
frame detection baseline for both online and off-
line settings, e.g, applied to YOLOv3, our met-
hod brings 6 .9% of mAP gains (from 74% to
80.9%) for offline object detection and 4.2% of
mAP gains (fro m 74% to 78.2%) for online de-
tection on ImageNet VI D validation set;
• Quite light computational overhead in both set-
tings makes our method applicable in most practi-
cal vision applications, i.e., less than 2.5ms/frame
additional computation for both settings;
• Our proposed method could be applied to de-
tectors trained on video-based dataset or image-
based dataset, which makes it more universal and
practical for industrial app lica tions.
This paper is organized as follows. In Section
2, we sum up some representative related works on
image/video object detection. In Section 3, we ex-
plain the theoretical details of our proposed box-level
post-processing method. In Section 4 , we present our
experimental results on ImageNet VID dataset and
compare our results with other state-of-art methods.
Section 5 concludes the paper.
2 RELATED WORK
In this section we present related works for both
image and video detection and th en p osition our con-
tribution regarding the state-of-the-art works.
2.1 Object Detection from Image
Object detection fr om image is one of the most central
study areas in computer vision. Detector s based on
deep neural network greatly exceeds the accuracy of
classic object detectors based on h and-designed fea-
tures, e.g., Viola-Jones (Viola and Jones, 2004), DPM
(Felzenszwalb and al., 2010), etc. Modern convolu-
tional object de tectors can be divided into two para-
digms: Two-stage objec t detectors and Single-stage
object detectors. According to (Huang and al., 2016),
the former are generally more accurate while the latter
are usually faster.
The domina nt paradigm is b ased on a two-stage
approa c h, which firstly generates a spar se set of re-
gions of interest (RoIs) then classifies these regions
and refines their bboxes. Faster R-CNN (Ren and al.,
2015) introduces the Region Pro posal Network (RPN)
that generates RoIs and shares full-image convolutio-
nal features with Fast R-CNN (Girshick, 2015). R-
FCN (Dai and al., 2016) proposes position-sensitive
score maps to share almost a ll computation on the en-
tire ima ge.
Single-stage detectors perform objec t classifica-
tion and bounding box regression simu ltaneously on
feature maps. YOLO (Redmon and al., 2015) divides
the image in to regions and predicts bound ing boxes
and classification scores for each region. SSD (Liu
and al., 2015) predicts classification scores and bbox
adjustments of each default boxes of different aspect
ratios and scales per feature map location. RetinaNet
(Lin and al., 2017) proposes a novel loss function, Fo-
cal Loss, to address class imbalance pro blem.
While evaluating CNNs on Microsof t-COCO da-
taset, the r ecently proposed YOLOv3 (Red mon and
Farhadi, 2018) achieves competitive accur acy (60.6%
of AP
50
) with state-of-the-art more complicated ob-
ject detectors, e.g., R-FCN (51.9% of AP
50
) or Reti-
naNet (57.5% of AP
50
) and it detects at a much faster
detection sp eed (YOLOv3 at 20 fps, R-FCN at 12 fps
and RetinaNet at 5 fps on a Titan X GPU).
2.2 Object Detection in Video
Object detection in video has drawn more an d more
attention in recent years. The introduction of Ima-
geNet (Russakovsky a nd al., 2014) object detection
from video (VID) challenge made the evaluation of
CNN designed for video easier. Instead of simply
split the video into frames and perform per-frame ob-
ject detection, re cent works focus on utilizing contex-
tual and temporal information to accelerate de tection
speed or to improve detection accuracy.
In terms of improving accuracy, recent works are
designed on two levels: bo x level and feature level.
For box-level methods: T-CNN (Kang and
al., 2016) u ses pre-computed optical flow and ob-
ject tracking to p ropagate high-confidence bounding
boxes to nearby fra mes; Seq-NMS (Han and al.,
2016) uses high-scoring object detections from ne-
arby frames to boost scores of weaker detections
within the same video clip.
Feature-level methods are generally considered to
achieve a more impor tant impr ovemen t than box-level
methods. FGFA (Zhu and al., 2017) uses optical flow