ACKNOWLEDGEMENTS
The research leading to these results has received
funding from the German Federal Ministry
for Economic Affairs and Energy under the
VIRTUOSE-DE project.
REFERENCES
Andriluka, M., Roth, S., and Schiele, B.
(2008). People-tracking-by-detection and
people-detection-by-tracking. In IEEE Conference
on Computer Vision and Pattern Recognition, 2008.
CVPR 2008., pages 1–8.
Angelova, A., Krizhevsky, A., and Vanhoucke, V. (2015a).
Pedestrian detection with a large-field-of-view deep
network. In Proceedings of ICRA 2015.
Angelova, A., Krizhevsky, A., Vanhoucke, V., Ogale, A.,
and Ferguson, D. (2015b). Real-time pedestrian
detection with deep network cascades. In Proceedings
of BMVC 2015.
Arteta, C., Lempitsky, V., Noble, J. A., and Zisserman, A.
(2014). Interactive Object Counting, pages 504–518.
Springer International Publishing, Cham.
Baltieri, D., Vezzani, R., and Cucchiara, R. (2011). 3dpes:
3d people dataset for surveillance and forensics. In
Proceedings of the 1st International ACM Workshop
on Multimedia access to 3D Human Objects, pages
59–64, Scottsdale, Arizona, USA.
Bengio, I. G. Y. and Courville, A. (2016). Deep learning.
Book in preparation for MIT Press.
Chan, A. B., Liang, Z.-S. J., and Vasconcelos, N. (2008).
Privacy preserving crowd monitoring: Counting
people without people models or tracking. In
Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on, pages 1–7.
Chan, A. B., Morrow, M., and Vasconcelos, N.
(2009). Analysis of crowded scenes using holistic
properties. In Performance Evaluation of Tracking
and Surveillance workshop at CVPR 2009, pages
101–108, Miami, Florida.
Chen, K., Gong, S., Xiang, T., and Loy, C. C. (2013).
Cumulative attribute space for age and crowd density
estimation. In 2013 IEEE Conference on Computer
Vision and Pattern Recognition, Portland, OR, USA,
June 23-28, 2013, pages 2467–2474.
Chen, K., Loy, C. C., Gong, S., and Xiang, T. (2012).
Feature mining for localised crowd counting. In
British Machine Vision Conference, BMVC 2012,
Surrey, UK, September 3-7, 2012, pages 1–11.
Enzweiler, M. and Gavrila, D. M. (2009). Monocular
pedestrian detection: Survey and experiments. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 31(12):2179–2195.
Fiaschi, L., Koethe, U., Nair, R., and Hamprecht, F. A.
(2012). Learning to count with regression forest and
structured labels. In 21st International Conference on
Pattern Recognition (ICPR), 2012, pages 2685–2688.
Fujii, Y., Yoshinaga, S., Shimada, A., and ichiro Taniguchi,
R. (2010). The 1st international conference on
security camera network, privacy protection and
community safety 2009 real-time people counting
using blob descriptor. Procedia - Social and
Behavioral Sciences, 2(1):143 – 152.
Girshick, R. B., Donahue, J., Darrell, T., and Malik,
J. (2013). Rich feature hierarchies for accurate
object detection and semantic segmentation. CoRR,
abs/1311.2524.
Golik, P., Doetsch, P., and Ney, H. (2013). Cross-entropy
vs. squared error training: a theoretical and
experimental comparison. In Interspeech, pages
1756–1760, Lyon, France.
Hattori, H., Naresh Boddeti, V., Kitani, K. M., and
Kanade, T. (2015). Learning scene-specific pedestrian
detectors without real data. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR).
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever,
I., and Salakhutdinov, R. (2012). Improving neural
networks by preventing co-adaptation of feature
detectors. CoRR, abs/1207.0580.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long,
J., Girshick, R., Guadarrama, S., and Darrell, T.
(2014). Caffe: Convolutional architecture for fast
feature embedding. arXiv preprint arXiv:1408.5093.
Kline, M. and Berardi, L. (2005). Revisiting squared-error
and cross-entropy functions for training neural
network classifiers. Neural Comput. Appl.,
14(4):310–318.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
Imagenet classification with deep convolutional
neural networks. In Advances in Neural Information
Processing Systems 25: 26th Annual Conference
on Neural Information Processing Systems 2012.
Proceedings of a meeting held December 3-6,
2012, Lake Tahoe, Nevada, United States., pages
1106–1114.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep
learning. Nature, 521(7553):436–444.
Lempitsky, V. and Zisserman, A. (2010). Learning to count
objects in images. In Lafferty, J. D., Williams, C.
K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A.,
editors, Advances in Neural Information Processing
Systems 23, pages 1324–1332. Curran Associates, Inc.
Liu, W., Wen, Y., Yu, Z., and Yang, M. (2016).
Large-margin softmax loss for convolutional neural
networks. In ICML.
Luo, P., Wang, X., and Tang, X. (2013). Pedestrian
parsing via deep decompositional network. In IEEE
International Conference on Computer Vision, ICCV
2013, Sydney, Australia, December 1-8, 2013, pages
2648–2655.
Merad, D., Aziz, K. E., and Thome, N. (2010). Fast people
counting using head detection from skeleton graph. In
Seventh IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS), 2010,
pages 151–156.
Moody, J. E. (1991). The effective number of parameters:
An analysis of generalization and regularization in