Combining Text Semantics and Image Geometry to Improve Scene Interpretation

Dennis Medved, Fangyuan Jiang, Peter Exner, Magnus Oskarsson, Pierre Nugues, Kalle Aström

Abstract

In this paper, we describe a novel system that identifies relations between the objects extracted from an image. We started from the idea that in addition to the geometric and visual properties of the image objects, we could exploit lexical and semantic information from the text accompanying the image. As experimental set up, we gathered a corpus of images from Wikipedia as well as their associated articles. We extracted two types of objects: human beings and horses and we considered three relations that could hold between them: Ride, Lead, or None. We used geometric features as a baseline to identify the relations between the entities and we describe the improvements brought by the addition of bag-of-word features and predicate–argument structures we derived from the text. The best semantic model resulted in a relative error reduction of more than 18% over the baseline.

References

  1. Carreira, J. and Sminchisescu, C. (2010). Constrained Parametric Min-Cuts for Automatic Object Segmentation. In IEEE International Conference on Computer Vision and Pattern Recognition.
  2. Chen, N., Zhou, Q.-Y., and Prasanna, V. (2012). Understanding web images by object relation network. In Proceedings of the 21st international conference on World Wide Web, WWW 7812, pages 291-300, New York, NY, USA. ACM.
  3. Deschacht, K. and Moens, M.-F. (2007). Text analysis for automatic image annotation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 1000-1007, Prague.
  4. Exner, P. and Nugues, P. (2012). Constructing large proposition databases. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul.
  5. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874.
  6. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627- 1645.
  7. Gupta, A., Verma, Y., and Jawahar, C. (2012). Choosing linguistics over vision to describe images. In Proc. of the twenty-sixth AAAI conference on artificial intelligence.
  8. Jörgensen, C. (1998). Attributes of images in describing tasks. Information Processing and Management, 34(2- 3):161-174.
  9. Kulkarni, G., Premraj, V., Dhar, S., Siming, L., Choi, Y., Berg, A., and Berg, T. (2011). Baby talk: Understanding and generating image descriptions. In Proc. Conf. Computer Vision and Pattern Recognition.
  10. Ladicky, L., Russell, C., Kohli, P., and Torr, P. H. S. (2010). Graph cut based inference with co-occurrence statistics. In Proceedings of the 11th European conference on Computer vision: Part V, ECCV'10, pages 239-253, Berlin, Heidelberg. Springer-Verlag.
  11. Markkula, M. and Sormunen, E. (2000). End-user searching challenges indexing practices in the digital newspaper photo archive. Information retrieval, 1(4):259-285.
  12. Marszalek, M. and Schmid, C. (2007). Semantic hierarchies for visual object recognition. In Proc. Conf. Computer Vision and Pattern Recognition.
  13. Moscato, V., Picariello, A., Persia, F., and Penta, A. (2009). A system for automatic image categorization. In Semantic Computing, 2009. ICSC'09. IEEE International Conference on, pages 624-629. IEEE.
  14. Myeong, H., Chang, J. Y., and Lee, K. M. (2012). Learning object relationships via graph-based context model. In CVPR, pages 2727-2734.
  15. Paek, S., Sable, C., Hatzivassiloglou, V., Jaimes, A., Schiffman, B., Chang, S., and Mckeown, K. (1999). Integration of visual and text-based approaches for the content labeling and classification of photographs. In ACM SIGIR, volume 99.
  16. Palmer, M., Gildea, D., and Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71-105.
  17. Stamborg, M., Medved, D., Exner, P., and Nugues, P. (2012). Using syntactic dependencies to solve coreferences. In Joint Conference on EMNLP and CoNLL - Shared Task, pages 64-70, Jeju Island, Korea. Association for Computational Linguistics.
  18. Westman, S. and Oittinen, P. (2006). Image retrieval by endusers and intermediaries in a journalistic work context. In Proceedings of the 1st international conference on Information interaction in context, pages 102-110. ACM.
  19. Wikipedia (2012). Wikipedia statistics English. http://stats.wikimedia.org/EN/TablesWikipediaEN.htm.
Download


Paper Citation


in Harvard Style

Medved D., Jiang F., Exner P., Oskarsson M., Nugues P. and Aström K. (2014). Combining Text Semantics and Image Geometry to Improve Scene Interpretation . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 479-486. DOI: 10.5220/0004752004790486


in Bibtex Style

@conference{icpram14,
author={Dennis Medved and Fangyuan Jiang and Peter Exner and Magnus Oskarsson and Pierre Nugues and Kalle Aström},
title={Combining Text Semantics and Image Geometry to Improve Scene Interpretation},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={479-486},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004752004790486},
isbn={978-989-758-018-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Combining Text Semantics and Image Geometry to Improve Scene Interpretation
SN - 978-989-758-018-5
AU - Medved D.
AU - Jiang F.
AU - Exner P.
AU - Oskarsson M.
AU - Nugues P.
AU - Aström K.
PY - 2014
SP - 479
EP - 486
DO - 10.5220/0004752004790486