tio with a set of fixed bins. Although using a larger
number of anchors can better represent the continu-
ous space, the corresponding increases in complexity
make regressors difficult to learn scale-invariant fea-
tures and deep learning models expensive to train. On
the contrary, it is difficult to find a fit for objects with
only a few anchors. Given size variations in the data-
set, the number of anchor scales and their aspect ratios
are important hyperparameters. The detection perfor-
mance is usually susceptible to improper settings of
these hyperparameters. In order to find out how these
hyper-parameters affect final performance and how to
select the optimal ones, we designed multiple control-
led experiments on KITTI benchmark dataset (Gei-
ger et al., 2012). We found that the set of designed
anchors for object detection should adequately cover
the continuous space of object scales and aspect ra-
tios, and simultaneously keeps a minimal number of
anchors. Subject to such a contradictory criterion, the
selection of the optimal hyperparameters is a challen-
ging task.
In order to satisfy both the requirements for desig-
ning anchors, we iteratively explore whether it is pos-
sible to estimate the continuous object scale that can
cover the whole scale space instead of pre-defining
multiple discrete anchors with fixed scales and as-
pect ratios. To acquire continuous scales, we utilize
the distance of the object with respect to the sen-
sor to estimate the coarse scale of the detected ob-
jects. A corresponding detection framework based
on CNN is also proposed to validate how estimated
scales can improve detection performance. To vali-
date our method, we conducted extensive evaluations
on the KITTI benchmark with a fine-grained analy-
sis. Our proposed method can outperform state-of-
the-art with predefined anchors while using the same
CNN backbone, especially on detecting difficult ob-
jects. The proposed method can also be assembled
into multi-object detection algorithms with complex
detection frameworks (Ren et al., 2017) (Dai et al.,
2017). Our code is open-source and freely available
on github
1
.
There are three main works in our paper as follo-
wing.
• Designed controlled experiments to answer the
question of how the number of predefined anchors
affects the detection performance.
• Proposed a detection method based on estimated
size of objects.
• Conducted massive experiments on the KITTI
benchmark to validate our method.
1
Main open-source code can be found: https://github.
com/Benzlxs/Object detection estimated sclales
The rest of the paper is organized as follows.
Section 2 is an introduction to the related work, follo-
wed by Section 3 which illustrates how multiple an-
chors affect detection accuracy. Our proposed de-
tection method is presented in Section 4. Experi-
ments of the proposed method are given in Section 5.
Section 6 concludes our main work and summarizes
the contributions of this paper.
2 RELATED WORK
2.1 Mutli-size Detectors
Scale-space theory (Lindeberg, 1990) is a vital and
fundamental theory in signal processing, and signifi-
cant research has been devoted to this field. Multi-
size detectors (Ren et al., 2015) are usually utilized
to address multi-scale of objects. Multi-size detec-
tors take one-size input and apply multi-size detec-
tors to detect their corresponding objects (Ren et al.,
2015; Lin et al., 2016; He et al., 2016). The Faster-
RCNN (Ren et al., 2015) implements detection on the
final feature map using 9 different anchors with 3 dif-
ferent sizes and 3 different aspect ratios. Each an-
chor can represent one size detector that finds objects
with similar sizes. However, the final feature map
is usually at a low resolution with high-level seman-
tic information, which makes small objects detection
very challenging. To improve small objects detection,
the feature pyramid network is proposed to propagate
high-level semantic information in deeper layers back
to shallower layers with high-resolution maps; Small
objects are mainly detected from fused shallower fea-
ture maps (Lin et al., 2016). The recurrent rolling net-
work extends feature pyramid network by using a re-
current neural network to fuse feature maps from dif-
ferent layers and integrate context information (Ren
et al., 2017). However, even pyramid feature maps
may not be useful for detecting small-size object since
high-level information does not contain the semantic
feature on small objects. Therefore, to increase the fi-
nal feature map resolution, by upsampling the image,
has become the most common practical technique to
detect small objects instead of building an image py-
ramid (He et al., 2016).
Some methods are proposed to change the size
of the receptive field to accommodate multi-scale ob-
jects, which includes the dilated and deformable con-
volutional network (Dai et al., 2017; Yu and Kol-
tun, 2015). Transformation parameters are learned
by a network, similar to STN (Spatial Transformer
Networks) (Jaderberg et al., 2015), by building the
STN to perform an affine transformation on input fe-
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
40