and ICT R&D program of MSIP/IITP (2017-0-
00306).
REFERENCES
Bell, S., Zitnick, C. L., Bala, K., and Girshick, R. (2016).
Inside-outside net: Detecting objects in context with
skip pooling and recurrent neural networks. In Com-
puter Vision and Pattern Recognition (CVPR), 2016
IEEE Conference on, pages 2874–2883. IEEE.
Cai, Z., Fan, Q., Feris, R., and Vasconcelos, N. (2016). A
unified multi-scale deep convolutional neural network
for fast object detection. In ECCV.
Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-
dients for human detection. In Computer Vision and
Pattern Recognition, 2005. CVPR 2005. IEEE Com-
puter Society Conference on, volume 1, pages 886–
893. IEEE.
Doll
´
ar, P., Appel, R., Belongie, S., and Perona, P. (2014).
Fast feature pyramids for object detection. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 36(8):1532–1545.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,
and Zisserman, A. (2010). The pascal visual object
classes (voc) challenge. International journal of com-
puter vision, 88(2):303–338.
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., and Berg, A. C.
(2017). Dssd: Deconvolutional single shot detector.
arXiv preprint arXiv:1701.06659.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE
International Conference on Computer Vision, pages
1440–1448.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).
Rich feature hierarchies for accurate object detec-
tion and semantic segmentation. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 580–587.
He, K., Zhang, X., Ren, S., and Sun, J. (2014). Spatial pyra-
mid pooling in deep convolutional networks for visual
recognition. In European Conference on Computer
Vision, pages 346–361. Springer.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pages 770–778.
Jeong, J., Park, H., and Kwak, N. (2017). Enhancement of
ssd by concatenating feature maps for object detec-
tion. arXiv preprint arXiv:1705.09587.
Kong, T., Yao, A., Chen, Y., and Sun, F. (2016). Hyper-
net: Towards accurate region proposal generation and
joint object detection. In Computer Vision and Pat-
tern Recognition (CVPR), 2016 IEEE Conference on,
pages 845–853. IEEE.
Li, Y., He, K., Sun, J., et al. (2016). R-fcn: Object detec-
tion via region-based fully convolutional networks. In
Advances in Neural Information Processing Systems,
pages 379–387.
Lin, T.-Y., Doll
´
ar, P., Girshick, R., He, K., Hariha-
ran, B., and Belongie, S. (2016). Feature pyra-
mid networks for object detection. arXiv preprint
arXiv:1612.03144.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll
´
ar, P.
(2017). Focal loss for dense object detection. arXiv
preprint arXiv:1708.02002.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Doll
´
ar, P., and Zitnick, C. L. (2014). Mi-
crosoft coco: Common objects in context. In Euro-
pean Conference on Computer Vision, pages 740–755.
Springer.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,
C.-Y., and Berg, A. C. (2016). Ssd: Single shot multi-
box detector. In European Conference on Computer
Vision, pages 21–37. Springer.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.
(2016). You only look once: Unified, real-time object
detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 779–
788.
Redmon, J. and Farhadi, A. (2016). Yolo9000: Better, faster,
stronger. arXiv preprint arXiv:1612.08242.
Ren, J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., Tai, Y.-
W., and Xu, L. (2017). Accurate single stage detector
using recurrent rolling convolution. In CVPR.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks. In Advances in neural information
processing systems, pages 91–99.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,
S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,
Bernstein, M., et al. (2014). Imagenet large
scale visual recognition challenge. arXiv preprint
arXiv:1409.0575.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus,
R., and LeCun, Y. (2013). Overfeat: Integrated recog-
nition, localization and detection using convolutional
networks. arXiv preprint arXiv:1312.6229.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556.
Viola, P. and Jones, M. J. (2004). Robust real-time face
detection. International journal of computer vision,
57(2):137–154.
Woo, S., Hwang, S., and Kweon, I. S. (2017). Stairnet: Top-
down semantic aggregation for accurate one shot de-
tection. arXiv preprint arXiv:1709.05788.
Two-layer Residual Feature Fusion for Object Detection
359