5 CONCLUSION
We proposed a novel method to train a two-stage ob-
ject detection network from multiple datasets in which
each dataset does not need to have the full label set,
i.e. not all object categories are labeled in all datasets
that are used for training. The results indicate that the
novel approach outperforms a regular object detection
network significantly by excluding unlabeled objects
from the loss-calculation. Furthermore, the results
indicate that depending on the task even regular ap-
proaches are quite robust but can perform better when
extended with the new method which excludes regions
from the loss-calculations that have been identified as
objects of an unlabeled category for the current train-
ing sample. Thus, our method can help to speed up
learning of new object sets without going through the
time and cost intensive task of labeling all objects in
the entire dataset. It also helps in domains where la-
beled data is rare. From a run-time perspective the
proposed method is virtually identical to the original
Faster R-CNN implementation.
In addition to the study presented here, more work
is needed. In future studies, additional dataset con-
figurations need to be evaluated and it also needs to
be investigated how the method performs with single-
stage detectors like (Redmon et al., 2016; Redmon and
Farhadi, 2018; Liu et al., 2016; Lin et al., 2018). It
might also be interesting to see if the approach can be
transferred to other domains such as action recognition
(Layher et al., 2017). Furthermore, it should also be
adressed how many datasets can be simultaneously
used and how it affects system performance.
ACKNOWLEDGEMENTS
We thank Philippe Chiberre for his work on prelimi-
nary versions of the ideas outlined in this paper.
REFERENCES
Brosch, T., Neumann, H., and Roelfsema, P. R. (2015). Rein-
forcement Learning of Linking and Tracing Contours
in Recurrent Neural Networks. PLoS Computational
Biology, 11(10):e1004489.
Brosch, T., Schwenker, F., and Neumann, H. (2013).
Attention–Gated Reinforcement Learning in Neural
Networks–A Unified View. In ICANN, volume 8131
of LNCS, pages 272–9. Springer.
Csurka, G. (2017). Domain Adaptation for Visual Appli-
cations: A Comprehensive Survey. In G., C., edi-
tor, Domain Adaptation in Computer Vision Applica-
tions, chapter Advances in Computer Vision and Pat-
tern Recognition, pages 1–35. Springer.
Girshick, R. (2015). Fast R–CNN. In Proceedings of the
2015 IEEE International Conference on Computer Vi-
sion, ICCV, pages 1440–8. IEEE.
Grondman, I., Bu¸soniu, L., Lopes, G. A. D., and Babuška,
R. (2012). A Survey of Actor–Critic Reinforcement
Learning: Standard and Natural Policy Gradients. Sys-
tems, Man, and Cybernetics, 42(6):1291–1307.
He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017).
Mask R–CNN. In IEEE International Conference on
Computer Vision (ICCV), pages 2980–8. IEEE.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial
Pyramid Pooling in Deep Convolutional Networks for
Visual Recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37(9):1904–16.
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A.,
Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama,
S., and Murphy, K. (2017). Speed/Accuracy Trade–
Offs for Modern Convolutional Object Detectors. https:
//arxiv.org/pdf/1611.10012.pdf.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Ima-
geNet Classification with Deep Convolutional Neural
Networks. In NIPS.
Layher, G., Brosch, T., and Neumann, H. (2017). Real–Time
Biologically Inspired Action Recognition from Key
Poses Using a Neuromorphic Architecture. Frontiers
in Neurorobotics, 11(13):1–21.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,
R. E., Hubbard, W., and Jackel, L. D. (1989). Back-
propagation Applied to Handwritten Zip Code Recog-
nition. Neural Computation, 1(4):541–51.
Lin, M., Chen, Q., and Yan, S. (2014a). Network in Network.
https://arxiv.org/pdf/1312.4400v3.pdf.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and
Belongie, S. (2017). Feature Pyramid Networks for
Object Detection. In Conference on Computer Vision
and Pattern Recognition (CVPR), pages 936–44. IEEE.
Lin, T.-Y., Goyal, P., Grishick, R., He, K., and Dollár, P.
(2018). Focal Loss for Dense Object Detection. https:
//arxiv.org/pdf/1708.02002.pdf.
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,
R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,
and Dollár, P. (2014b). Microsoft COCO: Common
Objects in Context. In Computer Vision – ECCV 2014,
pages 740–55. Springer.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C.-Y., and Berg, A. C. (2016). SSD Single Shot
MultiBox Detector. https://arxiv.org/abs/1512.02325.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie,
C.and Sadik, A., Antonoglou, I., King, H., Kumaran,
D., Wierstra, D., Legg, S., and Hassabis, D. (2015).
Human–Level Control Through Deep Reinforcement
Learning. Nature, 518:529–33.
Nguyen-Meidine, L. T., Granger, E., Kiran, M., and Blais-
Morin, L.-A. (2017). A Comparison of CNN–based
Face and Head Detectors for Real–Time Video Surveil-
lance Applications. In Seventh International Confer-
Object Detection and Classification on Heterogeneous Datasets
311