Two-layer Residual Feature Fusion for Object Detection

Jaeseok Choi

, Kyoungmin Lee

, Jisoo Jeong

and Nojun Kwak

Seoul National University, Seoul, Korea

SK Holdings, Seoul, Korea

Keywords:

Object Detection, Computer Vision, Machine Learning, Neural Network, Deep Learning.

Abstract:

Recently, a lot of single stage detectors using multi-scale features have been actively proposed. They are much

faster than two stage detectors that use region proposal networks (RPN) without much degradation in the de-

tection performances. However, the feature maps in the lower layers close to the input which are responsible

for detecting small objects in a single stage detector have a problem of insufﬁcient representation power be-

cause they are too shallow. There is also a structural contradiction that the feature maps not only have to deliver

low-level information to next layers but also have to contain high-level abstraction for prediction. In this paper,

we propose a method to enrich the representation power of feature maps using a new feature fusion method

which makes use of the information from the consecutive layer. It also adopts a uniﬁed prediction module

which has an enhanced generalization performance. The proposed method enables more precise prediction,

which achieved higher or compatible score than other competitors such as SSD and DSSD on PASCAL VOC

and MS COCO. In addition, it maintains the advantage of fast computation of a single stage detector, which

requires much less computation than other detectors with similar performance.

1 INTRODUCTION

The development of deep neural networks (DNN) in

recent years has achieved remarkable results not only

in object detection but also in many other areas. In the

early researches of object detection using DNN, much

attention has been paid to representation learning that

can replace hand-crafted features without much con-

sideration on the speed of detectors. Recently, real-

time detectors with low computational complexities

have been actively researched.

Researches on two-stage detectors, mostly based

on Faster R-CNN (Ren et al., 2015), applied the re-

gion proposal network (RPN) and RoI pooling to the

feature maps extracted by a state-of-the-art classiﬁer,

such as ResNet-101 (He et al., 2016). On the other

hand, the single-stage methods such as YOLO (Red-

mon et al., 2016) and SSD (Liu et al., 2016) removed

RoI pooling layer and predict bounding boxes and

corresponding class conﬁdences directly while en-

abling faster detection and end-to-end learning.

Especially SSD makes use of multi-scale feature

maps generated from a backbone network such as

VGG-16 (Simonyan and Zisserman, 2014) to detect

objects in various sizes. Since each of the predic-

*J. Choi and K. Lee equally contributed to this work

Figure 1: Box-in-Box problem. Top: SSD300. Bottom:

RUN300 (proposed). SSD detects objects with overlapping

boxes which are redundant. RUN solves this problem by

utilizing the proposed two-layer feature fusion method and

uniﬁed prediction module.

tion modules composed of 3 × 3 convolution ﬁl-

ters detects bounding boxes on each layer separately,

they cannot reﬂect appropriate contextual information

from different scales. It causes the problem named as

“Box-in-Box” (Jeong et al., 2017) as shown in Fig-

ure 1. In the ﬁgure, we can see that SSD often detects

352

Choi, J., Lee, K., Jeong, J. and Kwak, N.

Two-layer Residual Feature Fusion for Object Detection.

DOI: 10.5220/0007306803520359

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 352-359

ISBN: 978-989-758-351-3

a single object with two overlapping boxes, of which

the smaller box contains partial image such as the up-

per body of a person or the front of a train.

To solve the problem, (Fu et al., 2017; Lin et al.,

2017) used ResNet and feature pyramid network

(FPN) (Lin et al., 2016) structure to inject larger con-

textual information through deep convolutional back-

bone by the use of deconvolution. However, these

structures have the disadvantage of increased com-

putational complexity, thus reduces detection speed,

which is a key advantage of a single-stage detector.

In this paper, we propose a method to solve the es-

sential problems of multi-scale single stage detectors.

More speciﬁcally, we introduce a single-stage object

detector using a feature fusion method which con-

nects just two consecutive layers. The proposed two-

layer feature fusion network has a structure where

the Resblock (He et al., 2016) and the deconvolution

layer are added on the multi-scale feature maps. It

makes detected boxes be determined with larger con-

text and be more reliable. In addition, we also adopt

a uniﬁed prediction module (Jeong et al., 2017; Lin

et al., 2017) by integrating multiple prediction mod-

ules, that had been applied separately to each layer,

into one to boost the information level of feature maps

in the earlier layers. The proposed network is named

as RUN which is an abbreviation for two-layer Resid-

ual feature fusion with Uniﬁed prediction Network.

It is not only very compact and fast compared to

other ResNet-based two-stage detectors , but it also

achieves superior or competitive performance com-

pared to other competitors.

2 RELATED WORKS

Overfeat (Sermanet et al., 2013), SPPNet (He et al.,

2014), R-CNN (Girshick et al., 2014), Fast R-CNN

(Girshick, 2015), Faster R-CNN (Ren et al., 2015)

and R-FCN (Li et al., 2016) which are classiﬁed

as region-based convolutional neural networks (R-

CNN) showed a tremendous improvement in per-

formance compared to the previous object detec-

tion techniques. These region-based approaches have

achieved huge advances over the last few years and

are still the state-of-the-art approaches among many

object detection techniques. Speciﬁcally, these ap-

proaches usually use a two-stage method of generat-

ing a number of bounding boxes and then assigning

a classiﬁcation score to the bounding boxes. Thus, al-

though classiﬁcation may be relatively accurate, these

are too slow to be used for real-time applications.

Redmon (Redmon et al., 2016) proposed a method

named as YOLO to predict bounding boxes and asso-

ciated class probabilities in a single step by framing

object detection as a regression problem. It divides in-

put images to grid maps and regresses bounding boxes

for multiple objects on each grid. This was the be-

ginning of single stage detection and subsequently in-

spired structures such as SSD (Liu et al., 2016). How-

ever, since YOLO uses only the highest-level

feature

maps for object detection, there is a lack of lower-

level information, which results in somewhat inaccu-

rate detections, especially for small objects.

In order to solve this problem, SSD (Liu et al.,

2016) utilized not only the highest-level features but

also lower-level features which have enough resolu-

tion to detect small objects. As mentioned in Inside-

Outside Net (ION) (Bell et al., 2016) and HyperNet

(Kong et al., 2016), each feature maps at different lay-

ers have different abstraction levels for an input im-

age. Therefore, it is clear that using multi-scale fea-

ture maps can improve detection performance for ob-

jects of various scales. In SSD, many default boxes

are created in the feature maps and bounding box re-

gression and classiﬁcation are performed for each box

area using 3× 3 convolutions. This method enables

multi-scale object detection without using RoI pool-

ing. In addition, it can effectively improve the detec-

tion accuracy of small objects which is a disadvantage

of YOLO (Redmon et al., 2016).

However, as mentioned in MS-CNN (Cai et al.,

2016), SSD has the problem that back-propagation

allows the gradient to cause unnecessary deforma-

tions in the feature maps since the feature maps of

the backbone network are used directly in bounding

box regression and classiﬁcation. Then, it can lead

to some instability during learning. In addition, since

each classiﬁer only uses single scale feature maps, it

cannot reﬂect larger or smaller contextual information

other than the one for the corresponding scale.

Recently, various methods have attempted to en-

hance the contextual information of each layer while

taking advantage of SSD (Liu et al., 2016). DSSD (Fu

et al., 2017) could obtain higher accuracy by changing

the base network to ResNet-101 (He et al., 2016) and

combining the FPN (Lin et al., 2016) using deconvo-

lution layers in combination with the existing multiple

layers to reﬂect the large scale context. However, with

the use of deep structure of ResNet-101 and decon-

volution layers, the processing speed degrades much

(under 16.4 images per second), which prohibits the

method to be used for real-time detection problems.

Rainbow SSD (R-SSD) (Jeong et al., 2017) pro-

posed a method to concatenate feature maps not

In this paper, the term level is used interchangeably

with layer. Highest level indicates the the farthest layer

from the input layer.

Two-layer Residual Feature Fusion for Object Detection

353

Image

Multi-scale feature maps

Convolutional backbone

Cls

Loc Regress

Cls

Loc Regress

Cls

Loc Regress

Cls

Loc Regress

Image

Multi-scale feature maps

Unified

Prediction

Module

Convolutional backbone

Cls

Loc Regress

3Way

ResBlock

3Way

ResBlock

3Way

ResBlock

2Way

ResBlock

Figure 2: Architectures of SSD and the proposed RUN. Top: SSD. Bottom: the proposed RUN. Compared with SSD, the

proposed structure has residual blocks and uniﬁed prediction module. The arrow from the bottom to the top indicates the

deconvolution branch.

only in the adjacent layers but also in all the layers

for bounding box regression and classiﬁcation using

pooling and deconvolution. It achieves higher perfor-

mance than SSD by enhancing representation power

of feature maps. Also, by making the dimension of

each layer the same, it was made possible to use a uni-

ﬁed prediction module instead of different prediction

modules for different layers. Woo (Woo et al., 2017)

proposed StairNet which utilizes both FPN (Lin et al.,

2016) structure of base VGG-16 network and uniﬁed

prediction of R-SSD.

3 THE PROPOSED

ARCHITECTURE

3.1 Two-layer Feature Fusion

Recent CNN models designed for object detection

makes use of a backbone network which is origi-

nally devised to solve image classiﬁcation problems.

Although the detection network can be trained end-

to-end, the backbone network is normally initialized

with the weights for the image classiﬁcation prob-

lems. The relation between the features and predic-

tions in the networks used for image classiﬁcation can

be expressed mathematically as follows:

= F

n−1

) = (F

◦ F

n−1

◦ ··· ◦ F

)(I) (1)

Scores = P (x

), (2)

where I is an input image, x

is the n

-level feature

map, P is a prediction function, and F

is a combi-

nation of non-linear transformations such as convolu-

tion, pooling, ReLU, etc. Here, the top feature map,

, learns information on high-level abstraction. On

the other hand, x

(k < n) has more local and low-

level information as k becomes smaller.

SSD (Liu et al., 2016) applies several feature maps

with different scales directly as an input to separate

prediction modules to calculate object positions and

classiﬁcation scores, which can be denoted by the fol-

lowing equation:

Detection =



), P

), . . . , P

)



, (3)

where s

to s

are feature indices for source feature

maps for multi-scale prediction, P

is a function that

outputs multiple objects with different positions and

scores. Combining (1) and (3), it can be expressed as

Detection = {P

), P

)). . . , P

))},

(4)

where F

) , (F

◦ ·· · ◦ F

a+1

)(x

). Here, the ear-

lier feature map x

needs to learn high-level abstrac-

tion to improve the performance of P

). At the

same time, it also needs to learn local features for ef-

ﬁcient information transfer to the next feature maps.

This not only makes learning difﬁcult, but also causes

the overall performance to decrease. It is also a prob-

lem in the above-mentioned hourglass structure.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

354

1x1x256 conv

1x1x128 conv

3x3x128 conv1x1x256 conv

branch2 branch1

HxWxD

feature

HxWx256

feature

1x1x256 conv

1x1x128 conv

3x3x128 conv1x1x256 conv

branch2

1x1x256 conv

3x3x128 conv

d=128 deconv

branch3 branch1

HxWxD

feature H/2xW/2xD

feature

HxWx256

feature

Figure 3: Feature fusion module. Top: 2-way Res-

block. Bottom: 3-way Resblock with deconvolution branch

(branch3).

To resolve this problem, SSD (Liu et al., 2016)

added L2 normalization layer between the conv4 3

layer and the prediction module, which results in a

reduced magnitude of the gradients from the predic-

tion module. Cai (Cai et al., 2016) tried to solve this

problem by adding a convolution layer only to the

conv4

3 layer. Since the above problem is not solely

on the conv4 3 layer, the aforementioned approaches

do not essentially solve the problem. To meet this con-

tradictory requirement of maintaining low-level infor-

mation while having the ﬂexibility to learn high-level

abstraction, it is desired to separate and decouple the

backbone network and the prediction module in the

training phase.

In order to solve the same problem, we propose

a new architecture that decouples backbone network

from the prediction module as shown in Figure 2. In-

stead of directly connecting the feature maps in the

backbone network to the prediction module, we in-

serted a multi-way Resblock

for each level of fea-

ture maps, which acts like a bumper that secures the

information in the backbone network from the pre-

diction modules. The detailed architecture of the pro-

posed multi-way Resblocks are shown in Figure 3.

Convolution layers and nonlinear activation units

are used for all branches of the proposed Resblock.

This prevents the gradients of the prediction mod-

ule from ﬂowing directly into the feature maps of the

backbone network. Also, it clearly distinguishes the

Here, we term the basic components of our feature fu-

sion architecture as Resblocks which are slightly different

from those in ResNets (He et al., 2016) in that instead of

identity mapping, 1 × 1 convolution is used as shown in the

red boxes in Figure 3.

features to be used for prediction from the features to

be delivered to the next layer. In other words, the pro-

posed Resblock takes the role of learning high-level

abstraction for object detection, while the backbone

network containing low-level features is designed to

be intact from the high-level detection information.

This design helps to improve the feature structure of

the SSD (Liu et al., 2016) by forcing it not to learn

high-level abstraction and to keep low-level image

features.

Also, the depths of the earlier layers (eg. conv4

used for small-sized object detection in SSD (Liu

et al., 2016) are very shallow. Therefore, in SSD,

small objects can not be detected well because the

representation power is insufﬁcient to be used in the

prediction as it is. To supplement this problem, we

used a 3 × 3 convolution layer in branch2 of the Res-

block as shown in Figure 3 to reﬂect the peripheral

contextual information.

Branch3 in the right side of Figure 3 contains a

deconvolution layer whose input is the feature maps

of the consecutive layer. This two-layer feature fu-

sion is similar to a structure proposed in (Fu et al.,

2017) and (Ren et al., 2017), and it is a proper method

to propagate large contextual information to a small

scale feature map so that even when detecting a small

object, information about its surroundings is also uti-

lized. This can reduce the cases of detecting a part of

an actual object. Thus, it can be a remedy for the box-

in-box problem described earlier. The effect of this is

intuitively shown in the right side of Figure 1. Finally,

the proposed two-layer feature fusion method in Fig-

ure 2 can be expressed as

Detection =



(

), P

(

), . . . , P

(

)



(5)

where

a,b

= B

) + B

) and

) + B

). Here, B

, B

and B

indicate

branch1, branch2 and branch3, respectively.

3.2 Uniﬁed Prediction Module

Detecting objects of various sizes has been recog-

nized as an important problem in object detection.

Traditionally, (Viola and Jones, 2004; Doll

ar et al.,

2014; Dalal and Triggs, 2005) used a single classiﬁer

to predict multi-scale feature maps extracted from the

image pyramid. There is another approach of using

multiple classiﬁers on a single input image. The latter

has the advantage of reducing the amount of computa-

tion for calculating feature maps. However, it requires

an individual classiﬁer for each object scale.

Since the neural network has been prevalent, the

two-stage detectors applied RoI Pooling to the CNN

output to extract feature maps of the same size from

Two-layer Residual Feature Fusion for Object Detection

355

Faster R-CNNa)

single feature map object-wise cropping uniﬁed prediction

uniﬁed prediction

multi-scale feature maps multi-scale prediction

multi-scale feature maps 3-way residual features

SSD

Proposed

uniﬁed predictionmulti-scale feature maps Rainbow concatenation

R-SSD

with

One Classiﬁer

Figure 4: Comparison of various object detection

schemes: a) R-CNN and its variants need object-wise crop-

ping and the prediction is done by a common uniﬁed clas-

siﬁer. b) SSD does not need any cropping but requires a

separate classiﬁer for each scale of feature maps. c) R-SSD

concatenates feature maps in different layers so that objects

in each scale can be predicted with one uniﬁed classiﬁer

with the same amount of information. d) In the proposed

method, Resblock takes the role of feature map concatena-

tion and one uniﬁed classiﬁer is used for prediction.

objects of different sizes. These feature maps were

used as the input of a single classiﬁer. Meanwhile,

other methods using multi-scale features, such as SSD

(Liu et al., 2016), adopted multiple classiﬁers since

feature maps in each scale differed not only in length

but also in the underlying contextual information. In

order to effectively learn the prediction layers of var-

ious scales, it is necessary to input objects of various

scales. SSD could dramatically increase the detection

performance through augmentation which transforms

the size of input images.

R-SSD (Jeong et al., 2017) proposed the Rainbow

concatenation which combines feature maps in dif-

ferent scales using pooling and deconvolution. This

allows to set the depth of the input features for

each prediction module to be the same. Thus, R-SSD

could use a single classiﬁer that shares the weight

of multi-scale prediction modules. Similarly, the pro-

posed two-layer feature fusion through 3-way Res-

blocks enforces all the feature maps to have the same

depth of 256 as shown in Figure 3. Thus, structurally,

it is possible to unify convolution layers of differ-

ent prediction modules like R-SSD. The idea of the

uniﬁed prediction module is similar to (Jeong et al.,

2017), but our method is different from R-SSD in in-

formation contained in the input feature maps.

This approach makes differently-scaled feature

maps have similar level of information. SSD (Liu

et al., 2016) used multiple features of various scales.

This results in an improved performance of detecting

small objects compared with YOLO (Redmon et al.,

2016; Redmon and Farhadi, 2016) which used only

the last layer of the back-bone network. However,

since its earliest feature map is obtained from much

shallower layers than the later feature maps, it still has

a limitation of insufﬁcient information for prediction.

Because uniﬁed prediction applies equally to feature

maps of all scales, it forces the output of the 3-way

Resblock between the feature map of the backbone

and the prediction module to be learned at a similar

information level. It means that uniﬁed prediction in

combination with the residual feature block makes the

feature maps in the earliest Resblocks rich in context.

In the R-SSD (Jeong et al., 2017) structure, de-

tection performance for small objects was increased

using the uniﬁed prediction module, but the overall

performance decreased. However, as shown in Table

1, the proposed RUN structure shows a meaningful

increase in overall performance due to the uniﬁed pre-

diction. This shows that our two-layer feature fusion

method using 3-way Resblock is more suitable for

uniﬁed prediction than the feature fusion method used

in R-SSD. The work in (Lin et al., 2017) also men-

tions uniﬁed prediction, but it does not include any

details and experiments on it. A brief summary of dif-

ferent object detection schemes is shown in Figure 4.

4 EXPERIMENT

We experimented the proposed method on PASCAL

VOC 2007 (Everingham et al., 2010), PASCAL VOC

2012 and MS COCO datasets (Lin et al., 2014). Our

implementation is based on the publicly available

SSD (Liu et al., 2016)

. All the experimental results

of SSD are the latest scores with data augmentation

mentioned in (Fu et al., 2017). For all the experi-

ments, the reduced VGG-16 model (Simonyan and

Zisserman, 2014) pre-trained on the ILSVRC CLS-

LOC dataset (Russakovsky et al., 2014) is used as the

backbone network. For fair comparison, most of the

settings are set to be the same as those of SSD except

https://github.com/weiliu89/caffe/tree/ssd

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

356

Table 1: PASCAL 2007 test detection results.

Method mAP

SSD 300 77.5

SSD 300 + 2WAY 78.3

SSD 300 + 2WAY + Uniﬁed Pred 78.6

SSD 300 + 3WAY 78.8

SSD 300 + 3WAY + Uniﬁed Pred 79.2

the number of proposals. It is different from SSD, be-

cause we used 6 default boxes in all the prediction

layers for uniﬁed prediction while SSD used 4 for the

conv4 3 and the top layer, and 6 for the rest.

4.1 PASCAL VOC2007

We trained our model on VOC2007 trainval and

VOC2012 trainval. We set the batch size as 32. For

the training of the 2-way model, we used a learning

rate of 10

−3

initially, then it is decreased by a fac-

tor of 10 at 80k and 100k iterations, respectively. The

training was terminated at 120k iterations. For the 3-

way model, we froze all the weights of the pre-trained

2-way model except the prediction module, then ﬁne-

tuned the network using the learning rate of 10

−3

for

40k iterations, 10

−4

for the next 20k iterations, and

−5

for the ﬁnal 10k iterations. The end-to-end train-

ing was also applied on the 3-way model, but the re-

sults were worse than the above training strategy.

Table 1 shows our result on PASCAL VOC 2007

test set. Here, Uniﬁed Pred indicates the applica-

tion of the uniﬁed prediction module and the predic-

tion modules for the ones without this indication were

trained separately as in the original SSD. As men-

tioned above, each 3-way model was ﬁne-tuned on

the corresponding 2-way model. In this experiment,

we observed that the proposed model with only 2-way

Resblock without the deconvolution path achieved

1.1% higher mAP than that of SSD. The 3-way model

which further utilizes deconvolution layers was up

to 0.6% higher than the 2-way model. The uniﬁed

prediction module made better advance in the 3-way

model than the 2-way model, which scored 79.2% and

78.4% respectively.

4.2 MS COCO

For fair comparison with SSD (Liu et al., 2016), most

of the hyper-parameters required for training were set

to the same as SSD. For training 2-way models, we

used a learning rate of 10

−3

for the ﬁrst 240k itera-

tions, 10

−4

for the next 120k iterations and 10

−5

for

the last 40k. For training 3-way models, we used a

learning rate of 10

−3

for the ﬁrst 120k iterations, 10

−4

for the next 60k iterations and 10

−5

for the last 20k,

which are exactly half of those for the 2-way models.

Other parameters such as scales and aspect ratios of

the prior box were identical to those of SSD.

Table 2 shows the performance of various methods

on MS COCO test-dev. Despite the proposed meth-

ods use a relatively shallow network, VGG-16 (Si-

monyan and Zisserman, 2014), they achieved enough

performance to compare with other methods which

use a very deep network. The fourth column indicates

that RUN3WAY300 achieved 2.9% better mAP com-

pared to SSD300 (Liu et al., 2016). It was the same

performance with SSD321 and DSSD321 (Fu et al.,

2017), which adopted ResNet-101 (He et al., 2016)

as their back-bone network. Also, RUN3WAY512

achieved 3.6% better mAP than SSD512. In particu-

lar, RUN3WAY512 achieved the highest average pre-

cision and recall for small objects among compared

methods except RetinaNet. It means that the proposed

Resblock is a quite effective module to enhance low-

level feature maps.

5 DISCUSSION

5.1 Speed vs. Accuracy

050 100 150 200

SSD

YOLO v2

DSSD

R-FCN

FPN FRCN

RetinaNet

Proposed

VGG based

ResNet based

21ms faster

2.9%mAP

better

inference time(ms)

COCO AP

Figure 5: Speed vs. Accuracy of recent methods using

public numbers on COCO. Our results (sky blue circles)

are measured on Titan X. (Best viewed in color).

The single stage detectors, which are represented

by YOLO (Redmon et al., 2016) and SSD (Liu et al.,

2016), proposed end-to-end neural networks that re-

moved the RoI Pooling of two-stage detectors. They

have achieved a lot of speed improvements, but they

could not avoid the loss of accuracy. Conversely, re-

cent single stage detectors have been studied to im-

prove performance, while suffering the loss of speed.

Unlike other approaches, the proposed RUN is de-

signed to maximize performance at high speeds on

the VGG-16 (Simonyan and Zisserman, 2014) back-

Two-layer Residual Feature Fusion for Object Detection

357

Table 2: MS COCO test-dev detection results.

Method data network

Avg. Precision, IoU: Avg. Precision, Area: Avg. Recall, #Dets: Avg. Recall, Area:

0.5:0.95 0.5 0.75 S M L 1 10 100 S M L

Faster (Ren et al., 2015) trainval VGG

21.9 42.7 - - - - - - - - - -

ION (Bell et al., 2016) train VGG

23.6 43.2 23.6 6.4 24.1 38.3 23.2 32.7 33.5 10.1 37.7 53.6

R-FCN (Li et al., 2016) trainval ResNet-101

29.9 51.9 - 10.8 32.8 45.0 - - - - - -

RetinaNet (Lin et al., 2017) trainval ResNet-101

39.1 59.1 42.3 21.8 42.7 50.2 - - - - - -

SSD300 (Liu et al., 2016) trainval35k VGG

25.1 43.1 25.8 6.6 25.9 41.4 23.7 35.1 37.2 11.2 40.4 58.4

SSD321 (Fu et al., 2017) trainval35k ResNet-101

28.0 45.4 29.3 6.2 28.3 49.3 25.9 37.8 39.9 11.5 43.3 64.9

DSSD321 (Fu et al., 2017) trainval35k ResNet-101

28.0 46.1 29.2 7.4 28.1 47.6 25.5 37.1 39.4 12.7 42.0 62.6

RUN2WAY300 trainval35k VGG

27.4 46.1 28.4 8.9 27.9 43.8 25.0 37.3 39.5 14.6 42.6 59.8

RUN3WAY300 trainval35k VGG

28.0 47.5 28.9 9.9 28.6 43.9 25.3 38.0 40.5 16.2 43.8 60.2

SSD512 (Liu et al., 2016) trainval35k VGG

28.8 48.5 30.3 10.9 31.8 43.5 26.1 39.5 42.0 16.5 46.6 60.8

SSD513 (Fu et al., 2017) trainval35k ResNet-101

31.2 50.4 33.3 10.2 34.5 49.8 28.3 42.1 44.4 17.6 49.2 65.8

DSSD513 (Fu et al., 2017) trainval35k ResNet-101

33.2 53.3 35.2 13.0 35.4 51.1 28.9 43.5 46.2 21.8 49.1 66.4

RUN2WAY512 trainval35k VGG

31.7 52.1 33.6 13.2 33.9 46.5 27.7 42.2 44.7 22.0 47.9 62.7

RUN3WAY512 trainval35k VGG

32.4 53.5 34.2 14.7 34.0 46.7 28.0 43.0 45.8 24.4 48.1 63.4

Table 3: Speed & Accuracy on PASCAL VOC2007

test. * is measured by ourselves.

Method network mAP FPS GPU

Faster R-CNN (Ren et al., 2015) VGG16 73.2 7 Titan X

Faster R-CNN (He et al., 2016) ResNet-101 76.4 2.4 K40

R-FCN (Li et al., 2016) ResNet-101 80.5 9 Titan X

SSD300 (Liu et al., 2016) VGG16 77.5 54.5* Titan X

SSD321 (Fu et al., 2017) ResNet-101 77.1 16.4 Titan X

DSSD321 (Fu et al., 2017) ResNet-101 78.6 11.8 Titan X

R-SSD300 (Jeong et al., 2017) VGG16 78.5 37.1* Titan X

StairNet (Woo et al., 2017) VGG16 78.8 30 Titan X Pascal

RUN2WAY300 VGG16 78.6

41.8 Titan X

58.4 Titan X Pascal

RUN3WAY300 VGG16 79.2

40.0 Titan X

56.3 Titan X Pascal

SSD512 (Liu et al., 2016) VGG16 79.5 24.5* Titan X

SSD513 (Fu et al., 2017) ResNet-101 80.6 8.0 Titan X

DSSD513 (Fu et al., 2017) ResNet-101 81.5 6.4 Titan X

R-SSD512 (Jeong et al., 2017) VGG16 80.8 15.8* Titan X

RUN2WAY512 VGG16 80.6

20.1 Titan X

31.8 Titan X Pascal

RUN3WAY512 VGG16 80.9

19.5 Titan X

29.8 Titan X Pascal

bone, which has signiﬁcantly fewer layers and param-

eters than ResNet (He et al., 2016). The experimented

results demonstrate the performance improvement of

RUN.

Table 3 shows that our method outperforms other

competitors with less loss of speed. Our experiments

were tested using Titan X GPU, cuDNN v5.1 and In-

tel I7-6700@3.4GHz. For exact comparison, we mea-

sured FPS of some methods on the same environment

and marked * in the table.

In Figure 5, we show the trade-off relation be-

tween the detection accuracy and inference time

by plotting the results of RUN and other meth-

ods on COCO test-dev. The RUN3WAY300 model

(25.0ms, 28.0% mAP) is 36% slower but 2.9% bet-

ter in mAP than the SSD300 (Liu et al., 2016) model

(18.3ms, 25.1% mAP). It is about 60% faster than

ResNet-101 based SSD321 (Fu et al., 2017) model

(61ms, 28.0% mAP) that has a similar performance.

Likewise, the RUN3WAY512 (51.4ms, 32.4% mAP)

is 26% slower but 3.6% better in mAP than the

SSD512 model (40.8ms, 28.8% mAP). It is about

44% faster than RetinaNet-50-500 (Lin et al., 2017)

(73ms, 32.5% mAP) that has a similar performance.

In addition, we measured FPS of our methods on

Titan X Pascal with the other environment kept the

same. Table 3 shows that even the most complex ver-

sion of our method, RUN3WAY512, can works in real

time (29.8 FPS) on Tital X Pascal.

6 CONCLUSION

The proposed RUN architecture for object detection

was originated from the awareness of the contradic-

tory requirements for multi-scale features that they

should contain low-level information on an image

as well as high-level information on objectness. The

proposed two-layer feature fusion method using the

3-way Resblock alleviated the gradient exploitation

problem and enriched contextual information, an im-

portant element for a good prediction performance.

We also showed that the generalization performance

of multi-scale prediction in our architecture can be

improved by adopting the uniﬁed prediction mod-

ule which integrates the separate prediction modules

into a common prediction module. This approach,

which can be seen to be somewhat simple, resulted

in outstanding performance on the PASCAL VOC

test. The results on COCO dataset also show how fast

and efﬁcient our algorithms are. We expect the pro-

posed method be not restricted to SSD-based methods

but also applicable to other structures utilizing multi-

scale features.

ACKNOWLEDGEMENTS

The research was supported by the Green Car devel-

opment project through the Korean MTIE (10063267)

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

358

and ICT R&D program of MSIP/IITP (2017-0-

00306).

REFERENCES

Bell, S., Zitnick, C. L., Bala, K., and Girshick, R. (2016).

Inside-outside net: Detecting objects in context with

skip pooling and recurrent neural networks. In Com-

puter Vision and Pattern Recognition (CVPR), 2016

IEEE Conference on, pages 2874–2883. IEEE.

Cai, Z., Fan, Q., Feris, R., and Vasconcelos, N. (2016). A

uniﬁed multi-scale deep convolutional neural network

for fast object detection. In ECCV.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on, volume 1, pages 886–

893. IEEE.

Doll

ar, P., Appel, R., Belongie, S., and Perona, P. (2014).

Fast feature pyramids for object detection. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 36(8):1532–1545.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. International journal of com-

puter vision, 88(2):303–338.

Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., and Berg, A. C.

(2017). Dssd: Deconvolutional single shot detector.

arXiv preprint arXiv:1701.06659.

Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE

International Conference on Computer Vision, pages

1440–1448.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

He, K., Zhang, X., Ren, S., and Sun, J. (2014). Spatial pyra-

mid pooling in deep convolutional networks for visual

recognition. In European Conference on Computer

Vision, pages 346–361. Springer.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 770–778.

Jeong, J., Park, H., and Kwak, N. (2017). Enhancement of

ssd by concatenating feature maps for object detec-

tion. arXiv preprint arXiv:1705.09587.

Kong, T., Yao, A., Chen, Y., and Sun, F. (2016). Hyper-

net: Towards accurate region proposal generation and

joint object detection. In Computer Vision and Pat-

tern Recognition (CVPR), 2016 IEEE Conference on,

pages 845–853. IEEE.

Li, Y., He, K., Sun, J., et al. (2016). R-fcn: Object detec-

tion via region-based fully convolutional networks. In

Advances in Neural Information Processing Systems,

pages 379–387.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariha-

ran, B., and Belongie, S. (2016). Feature pyra-

mid networks for object detection. arXiv preprint

arXiv:1612.03144.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017). Focal loss for dense object detection. arXiv

preprint arXiv:1708.02002.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-

manan, D., Doll

ar, P., and Zitnick, C. L. (2014). Mi-

crosoft coco: Common objects in context. In Euro-

pean Conference on Computer Vision, pages 740–755.

Springer.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,

C.-Y., and Berg, A. C. (2016). Ssd: Single shot multi-

box detector. In European Conference on Computer

Vision, pages 21–37. Springer.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 779–

788.

Redmon, J. and Farhadi, A. (2016). Yolo9000: Better, faster,

stronger. arXiv preprint arXiv:1612.08242.

Ren, J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., Tai, Y.-

W., and Xu, L. (2017). Accurate single stage detector

using recurrent rolling convolution. In CVPR.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,

S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,

Bernstein, M., et al. (2014). Imagenet large

scale visual recognition challenge. arXiv preprint

arXiv:1409.0575.

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus,

R., and LeCun, Y. (2013). Overfeat: Integrated recog-

nition, localization and detection using convolutional

networks. arXiv preprint arXiv:1312.6229.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Viola, P. and Jones, M. J. (2004). Robust real-time face

detection. International journal of computer vision,

57(2):137–154.

Woo, S., Hwang, S., and Kweon, I. S. (2017). Stairnet: Top-

down semantic aggregation for accurate one shot de-

tection. arXiv preprint arXiv:1709.05788.

Two-layer Residual Feature Fusion for Object Detection

359