From Text Vocabularies to Visual Vocabularies - What Basis?

Jean Martinet


The popular "bag-of-visual-words" approach for representing and searching visual documents consists in describing images (or video keyframes) using a set of descriptors, that correspond to quantized low-level features. Most of existing approaches for visual words are inspired from works in text indexing, based on the implicit assumption that visual words can be handled the same way as text words. More specifically, these techniques implicitly rely on the same postulate as in text information retrieval, stating that the words distribution for a natural language globally follows Zipf's law -- that is to say, words from a natural language appear in a corpus with a frequency inversely proportional to their rank. However, our study shows that the visual words distribution depends on the choice of low-level features, and also especially on the choice of the clustering method. We also show that when the visual words distribution is close to this of text words, the results of an image retrieval system are increased. To the best of our knowledge, no prior study has yet been carried out to compare the distributions of text words and visual words, with the objective of establishing the theoretical foundations of visual vocabularies.


  1. Abbas, O. A. (2008). Comparisons between data clustering algorithms. International Arab Journal of Information Technology (IAJIT), 5(3):320.
  2. Baeza-Yates, R. A. and Ribeiro-Neto, B. A. (1999). Modern Information Retrieval. Addison-Wesley.
  3. Ballan, L., Bertini, M., Del-Bimbo, A., Seidenari, L., and Serra, G. (2012). Effective codebooks for human action representation and classification in unconstrained videos. IEEE Transactions on Multimedia, 14(4):1234-1245.
  4. Bay, Ess, A., Tuytelaars, T., and Gool, L. V. (2008). Surf: Speeded up robust features. Computer Vision and Image Understanding (CVIU), 110(3).
  5. Bosch, A., Zisserman, A., and Muoz, X. Scene classification via plsa. In ECCV'06, pages 517-530.
  6. Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In In journal of Computer Networks and ISDN Systems (30), pages 107- 117, Brisbane.
  7. Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and Bray, C. (2004). Visual categorization with bags of keypoints. In In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1-22.
  8. Grauman, K. and Darrell, T. (2005). The pyramid match kernel: Discriminative classification with sets of image features. In ICCV, pages 1458-1465.
  9. Jones, K. S., Walker, S., and Robertson, S. E. (2000). A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage., 36:779-808.
  10. Jurie, F. and Triggs, B. (2005). Creating efficient codebooks for visual recognition. In ICCV, pages 604-610.
  11. Kohonen, T. (2001). Self-Organizing Maps. Springer Series in Information Sciences, vol 30.
  12. Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR (2), pages 2169-2178. IEEE Computer Society.
  13. Li, T., Mei, T., and Kweon, I. S. (2008). Learning optimal compact codebook for efficient object categorization. In Proc. of WACV, pages 1-6.
  14. Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2):91-110.
  15. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research Development, 2(2):159165.
  16. Nowak, E., Jurie, F., and Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In In Proc. ECCV, pages 490-503. Springer.
  17. Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. (2008). Lost in quantization: Improving particular object retrieval in large scale image databases. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1 -8.
  18. Salton, G. (1971). The SMART Retrieval System. Prentice Hall.
  19. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. In Information Processing and Management, pages 513-523.
  20. Salton, G. and McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.
  21. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27:379-423, 623-656.
  22. Sivic, J. and Zisserman, A. (2003). Video google: a text retrieval approach to object matching in videos. In ICCV, pages 1470-1477.
  23. Tirilly, P., Claveau, V., and Gros, P. (2008). Language modeling for bag-of-visual words image categorization. In CIVR'08, pages 249-258.
  24. Wang, J., J.L., and Wiederhold, G. (2001). SIMPLIcity: Semantics-sensitive Integrated Matching for picture LIbraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9):947-963.
  25. Wu, L., Hu, Y., Li, M., Yu, N., and Hua, X.-S. (2009). Scale-invariant visual language modeling for object categorization. Multimedia, IEEE Transactions on, 11(2):286 -294.
  26. Xu, X., Zeng, W., and Zhao, Z. (2007). A structural adapting self-organizing maps neural network. In ISNN (2), pages 913-920.
  27. Zhang, J., Lazebnik, S., and Schmid, C. (2007). Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision, 73:2007.
  28. Zipf, G. K. (1932). Selective Studies and the Principle of Relative Frequency in Language. Addison- Wesley, Cambridge, MA, USA.

Paper Citation

in Harvard Style

Martinet J. (2014). From Text Vocabularies to Visual Vocabularies - What Basis? . In Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014) ISBN 978-989-758-004-8, pages 668-675. DOI: 10.5220/0004749606680675

in Bibtex Style

author={Jean Martinet},
title={From Text Vocabularies to Visual Vocabularies - What Basis?},
booktitle={Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014)},

in EndNote Style

JO - Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014)
TI - From Text Vocabularies to Visual Vocabularies - What Basis?
SN - 978-989-758-004-8
AU - Martinet J.
PY - 2014
SP - 668
EP - 675
DO - 10.5220/0004749606680675