Incorporating Ad Hoc Phrases in LSI Queries

Roger Bradford


Latent semantic indexing (LSI) is a well-established technique for information retrieval and data mining. The technique has been incorporated into a wide variety of practical applications. In these applications, LSI provides a number of valuable capabilities for information search, categorization, clustering, and discovery. However, there are some limitations that are encountered in using the technique. One such limitation is that the classical implementation of LSI does not provide a flexible mechanism for dealing with phrases. In both information retrieval and data mining applications, phrases can have significant value in specifying user information needs. In the classical implementation of LSI, the only way that a phrase can be used in a query is if that phrase has been identified a priori and treated as a unit during the process of creating the LSI index. This requirement has greatly hindered the use of phrases in LSI applications. This paper presents a method for dealing with phrases in LSI-based information systems on an ad hoc basis – at query time, without requiring any prior knowledge of the phrases of interest. The approach is fast enough to be used during real-time query execution. This new capability can enhance use of LSI in both information retrieval and knowledge discovery applications.


  1. Bradford, R., 2009. Comparability of LSI and human judgment in text analysis tasks. Proceedings, Applied Computing Conference, Athens, Greece, 359-366.
  2. Bradford, R., 2011. Implementation techniques for largescale latent semantic indexing applications. Proceedings, ACM Conference on Information and Knowledge Management, Glasgow, Scotland, October, 2011.
  3. Broschart, A., Berberich, K., Schenkel, R., 2010. Evaluating the potential of explicit phrases for retrieval quality. Proceedings, ECIR 2010, 623-626.
  4. Dumais, S., 2004. Latent semantic analysis. ARIST Review of Information Science and Technology, vol. 38, Chapter 4.
  5. Dumais, S., et al, 1988. Using latent semantic analysis to improve access to textual information. Proceedings, CHI 88, June 15-19, 1988, Washington, DC, 281-285.
  6. Fagan, J., 1989. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. JASIS, 40(2), 115-132.
  7. Furnas, G., et al, 1988. Information retrieval using a singular value decomposition model of latent semantic structure. Proceedings 11th SIGIR, 465-480.
  8. Grönqvist, L., 2005A. An evaluation of bi- and tri-gram enriched latent semantic vector models. ELECTRA Workshop, Methodologies and Evaluation of Lexical Cohesion Techniques in Real-world Applications, Salvador, Brazil, 19 August, 2005, 57-62.
  9. Grönqvist, L., 2005B. Evaluating latent semantic vector models with synonym tests and document retrieval. ELECTRA Workshop, Methodologies and Evaluation of Lexical Cohesion Techniques in Real-world Applications, Salvador, Brazil, 19 August, 2005, 86- 88.
  10. Grönqvist, L., 2006. Exploring Latent Semantic Vector Models Enriched With N-grams. PhD Thesis, Växjö University, Sweden.
  11. Harmon, D., 2005. The TREC ad hoc experiments. In TREC: Experiment and Evaluation in Information Retrieval, Voorhees and Harmon, eds, MIT Press.
  12. Hulth, A., 2004. Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction. Thesis, Stockholm University, April, 2004.
  13. Jiang, M., et al, 2004. Choosing the right bigrams for information retrieval. Proceeding of the Meeting of the International Federation of Classification Societies, 2004, 531-540.
  14. Kim, H-R., and Chan, P., 2004. Identifying variablelength meaningful phrases with correlation functions. Proceedings, ICTAI, 2004, 16th IEEE International Conference on Tools with Artificial Intelligence, 30- 38.
  15. Kraaij, W., and Pohlmann, R., 1998. Comparing the effects of syntactic vs. statistical phrase indexing strategies for Dutch. Proceedings, ECDL 98, LNCS 1513, 605-617.
  16. Lizza, M., and Sartoretto, F., 2001. A comparative analysis of LSI strategies. In Computational Information Retrieval, M. Berry ed., SIAM, 171-181.
  17. Manning, C., Raghavan, P., and Schütze, H., 2008. Introduction to Information Retrieval, Cambridge University Press, 36.
  18. Metzler, D., Strohman, T., Croft, W., 2006. Indri at TREC 2006: lessons learned from three terabyte tracks. Proceedings, Fifteenth Text REtrieval Conference, NIST Special Publication SP 500-272.
  19. Mitra, M., et al, 1997. An analysis of statistical and syntactic phrases. Proceedings of RIAO 97, Montreal, Canada, 200-214.
  20. Nakov, P., Valchanova, E., and Angelova, G. 2003. Towards deeper understanding of the LSA performance. In Proceedings, Recent Advances in Natural Language Processing, 2003, 311-318.
  21. Ogawa, Y., et al, 2000. Structuring and expanding queries in the probabilistic model. Proceedings, Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249, 427-435.
  22. Olney, A. 2009. Generalizing latent semantic analysis. In Proceedings, 2009 IEEE International Conference on Semantic Computing, 40-46.
  23. Salton, G., Yang, C., Yu, T., 1975. A theory of term importance in automatic text analysis. JASIS, 26(1), 33-44.
  24. Turpin, A., and Moffat, A., 1999. Statistical phrases for vector-space information retrieval. Proceedings, SIGIR 99, Berkley, CA, August 1999, 309-310.
  25. Weimer-Hastings, P. 2000. Adding syntactic information to LSA. In Proceedings of the 22nd Annual Meeting of the Cognitive Science Society.
  26. Wu, H., and Gunopulos, D. 2002. Evaluating the utility of statistical phrases and latent semantic indexing for text classification. Proceedings ICDM, 713-716.
  27. Zhai, C., et al, 1996. Evaluation of syntactic phrase indexing - CLARIT NLP track report. In Proceedings, Fifth TExt Retrieval Conference, NIST Special Publication 500-238, 347-358.

Paper Citation

in Harvard Style

Bradford R. (2014). Incorporating Ad Hoc Phrases in LSI Queries . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 61-70. DOI: 10.5220/0005073300610070

in Bibtex Style

author={Roger Bradford},
title={Incorporating Ad Hoc Phrases in LSI Queries},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},

in EndNote Style

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Incorporating Ad Hoc Phrases in LSI Queries
SN - 978-989-758-048-2
AU - Bradford R.
PY - 2014
SP - 61
EP - 70
DO - 10.5220/0005073300610070