visual object detectors that were trained in a fully su-
pervised manner.
ACKNOWLEDGEMENTS
This work has been supported by the German Re-
search Foundation (DFG) within project Fi799/9-1.
The authors would like to thank Kristian Kersting for
his helpful comments and discussions.
REFERENCES
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M.,
and Etzioni, O. (2007). Open information extraction
for the web. In IJCAI, volume 7, pages 2670–2676.
Choi, M. J., Torralba, A., and Willsky, A. S. (2012). A
tree-based context model for object recognition. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 34(2):240–252.
Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., and
Hebert, M. (2009). An empirical study of context in
object detection. In Proc. IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), pages 1271–
1278.
Donahue, J., Anne Hendricks, L., Guadarrama, S.,
Rohrbach, M., Venugopalan, S., Saenko, K., and Dar-
rell, T. (2015). Long-term recurrent convolutional net-
works for visual recognition and description. In Proc.
IEEE Conf. on Computer Vision and Pattern Recogni-
tion (CVPR).
Etzioni, O., Fader, A., Christensen, J., Soderland, S., and
Mausam, M. (2011). Open information extraction:
The second generation. In IJCAI, volume 11, pages
3–10.
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams,
C. K. I., Winn, J., and Zisserman, A. (2015). The pas-
cal visual object classes challenge: A retrospective.
International Journal of Computer Vision, 111(1):98–
136.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and
Ramanan, D. (2010). Object detection with discrim-
inatively trained part-based models. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
32(9):1627–1645.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2016).
Region-based convolutional networks for accurate ob-
ject detection and segmentation. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
38(1):142–158.
Grzeszick, R., Sudholt, S., and Fink, G. A. (2016). Opti-
mistic and pessimistic neural networks for scene and
object recognition. CoRR, abs/1609.07982.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,
Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma,
D. A., Bernstein, M., and Fei-Fei, L. (2016). Vi-
sual genome: Connecting language and vision using
crowdsourced dense image annotations.
Lampert, C. H., Nickisch, H., and Harmeling, S. (2014).
Attribute-based classification for zero-shot visual ob-
ject categorization. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 36(3):453–465.
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Doll
´
ar, P., and Zitnick, C. L. (2014). Mi-
crosoft coco: Common objects in context. In Proc.
European Conference on Computer Vision (ECCV),
pages 740–755. Springer.
Liu, C., Yuen, J., and Torralba, A. (2009). Nonpara-
metric scene parsing: Label transfer via dense scene
alignment. In Proc. IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), pages 1972–1979.
IEEE.
Manning, C. D. and Sch
¨
utze, H. (1999). Foundations of
statistical natural language processing, volume 999.
MIT Press.
Miller, G. A. and Others (1995). WordNet: a lexical
database for English. Communications of the ACM,
38(11):39–41.
Modolo, D., Vezhnevets, A., and Ferrari, V. (2015). Con-
text forest for object class detection. In Proc. British
Machine Vision Conference (BMVC).
Oliva, A. and Torralba, A. (2006). Building the gist of a
scene: The role of global image features in recogni-
tion. Progress in brain research, 155:23.
Palmer, S. E. (1999). Vision science: Photons to phe-
nomenology. MIT press Cambridge, MA.
Patterson, G., Xu, C., Su, H., and Hays, J. (2014). The sun
attribute database: Beyond categories for deeper scene
understanding. International Journal of Computer Vi-
sion, 108(1-2):59–81.
Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks. In Advances in Neural Informa-
tion Processing Systems, pages 91–99.
Rohrbach, M., Stark, M., and Schiele, B. (2011). Evaluating
knowledge transfer and zero-shot learning in a large-
scale setting. In Proc. IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), pages 1641–1648.
IEEE.
Rohrbach, M., Wei, Q., Titov, I., Thater, S., Pinkal, M.,
and Schiele, B. (2013). Translating video content to
natural language descriptions. In IEEE International
Conference on Computer Vision.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
CoRR, abs/1409.1556.
Tighe, J. and Lazebnik, S. (2010). Superparsing: scal-
able nonparametric image parsing with superpixels. In
European conference on computer vision, pages 352–
365. Springer.
Vezhnevets, A. and Ferrari, V. (2015). Object localization in
imagenet by looking out of the window. arXiv preprint
arXiv:1501.01181.
Wu, Q., Shen, C., Hengel, A. v. d., Wang, P., and Dick,
A. (2016). Image captioning and visual question an-
VISAPP 2017 - International Conference on Computer Vision Theory and Applications
128