ENHANCING HIGH PRECISION BY COMBINING OKAPI BM25 WITH STRUCTURAL SIMILARITY IN AN INFORMATION RETRIEVAL SYSTEM

Yaël Champclaux, Taoufiq Dkaki, Josiane Mothe

2009

Abstract

In this paper, we present a new similarity measure in the context of Information Retrieval (IR). The main objective of IR systems is to select relevant documents, related to a user’s information need, from a collection of documents. Traditional approaches for document/query comparison use surface similarity, i.e. the comparison engine uses surface attributes (indexing terms). We propose a new method which combines the use of both surface and structural similarities with the aim of enhancing precision of top retrieved documents. In a previous work, we showed that the use of structural similarity in combination with cosine improves bare cosine ranking. In this paper, we compare our method to Okapi based on BM25 on the Cranfield collection. We show that structural similarities improve average precision and precision at top 10 retrieved documents about 50%. Experiments also address the term weighting influences on system performances.In this paper, we present a graph-based model which belongs to the vector space family. A vector space model considers each document as a vector in the term space. Each coordinate of a vector is a value representing the importance in a document or in a query of an indexing term. The vector space is defined by the set of terms that the system collects during the indexing phase. Many similarity measures such as Cosine, Jaccard, Dice… are used to determine how well a document corresponds to a query. Such measures determine local similarities between a document and a query on the basis of the terms they have in common. Our goal is to exploit another type of similarities called structural similarities. These similarities identify resemblances between elements on the basis of relationships they have. The structural relationship that we use originates from the fact that documents contain words and that words are contained in documents. The idea is to compare these documents through the similarities between the words they contain while similarities between words are themselves dependent on similarities between the documents they are contained in. In a previous paper, we have shown that the use of structural similarities alone was not sufficient to improve the performance of an IRS. In this paper, we present a different method that combines the use of both structural and surface similarities with the aim of enhancing high precision. Surface similarity is computed as an Okapi measure. Selected documents are then stored in a graph then sorted using a SimRank-based score. We call this 2-stages method OkaSim. We have performed different experiments with different term-weightings on the Cranfield Corpus and show that the structural similarities can improve an Okapi ranking. We show that those similarities can improve average precision more than 50% and precision at top 10 retrieved documents about 50% of an Okapi ranking. Tests and experiments also address the term weighting influences on system performances.

References

  1. R. K. Belew, 1989. Adaptive information retrieval. In Proceedings of the Twelfth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 11-20, Cambridge, MA, June 25-28.
  2. Berge, Claude, 1958. Théorie des graphes et ses applications -Paris : Dunod.
  3. Burgess, C., Livesay, K., & Lund, K. 1998. Explorations in context space: Words, sentences, discourse. Discourse Processes, 25, 211-257.
  4. Blondel, V.D, et al., 2004. Measure of Similarity between Graph vertices: Application to synonym extraction and web searching, SIAM Rev. 46(4):647-666.
  5. Champclaux Y., Dkaki T., Mothe J. 2007. Utilisation des similarités structurelles pour l'évaluation de la pertinence en Recherche d'information. Dans: Colloque VSST 2007, Marrakech (Maroc)
  6. Champclaux Y., Dkaki T., Mothe J. 2008. Enhancing high precision using structural similarities. Dans : IADIS International Conference WWW/Internet, Freiburg, Germany, 13-OCT-08-15-OCT-08, IADIS, p. 494-498.
  7. S. Deerwester, et al., 1990. Indexing by Latent Semantic Analysis, in Journal of the Society for Information Science, 41, p. 391-407.
  8. Euler, 1736, Solutio problematis ad geometriam situs pertinentis, Commetarii Academiae Scientiarum Imperialis Petropolitanae 8(1736), 128-140.
  9. Forbus K D Gentner, D &Law, K. 1995. MAC/FAC A model of similarity-based retrieval. Cognitive Science, 19, 141-205.
  10. Halford, G. S., 1992. Analogical reasoning and conceptual complexity in cognitive development. Human Development, 35, 193-217.
  11. Hattori, et al., 2003. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways, J. Am. Chem. Soc., 125(39):11853-11865.
  12. Heymans M., Singh A. K., 2003. Deriving phylogenetic trees from the similarity analysis of metabolic pathways. Bioinformatics 19, Suppl. 1, 138--146. 1.
  13. Jones, S. S. and Smith, L. B., 1993. The place of perception in children's concepts. Cognitive Development, 8, 113-139.
  14. Kleinberg, J.M., 1999. Authoritative Sources in a hyperlinked environnement, Journal of the ACM, 46(5):604-632.
  15. Landauer, T. K., Foltz, P. W., & Laham, D. 1998. An introduction to latent semantic analysis. Discourse Processes, 25, 259-284.
  16. D.L. Medin, R.L. Goldstone, and D. Gentner, 1990 'Similarity involving attributes and relations: Judgments of similarity and difference are not inverses', Psychological Science, 1(1): 64-69.
  17. Milgram, S., 1967. The Small world problem. Psychology today 1(61).
  18. Mothe, J. 1994. Search mechanisms using a neural network-Comparison with the vector space model. 4th RIAO Intelligent Multimedia Information Retrieval Systems and Management, Vol.1, pages 275-294, New York.
  19. Porter, M.F., 1980. An algorithm for suffix stripping, Program, vol. 14, no 3, p.130-137.
  20. Robertson S. E., Walker S., Jones S., Hancock-Beaulieu M., and Gatford M.. 1994. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference Gaithersburg, USA.
  21. Rohde, D. L. T., Gonnerman, L. M., & Plaut, D. C. 2004. An improved model of semantic similarity based on lexical co-occurrence. Cognitive Science. (submitted).
  22. Salton G., Wong A.; Yang, 1975. Vector-Space model for automatic indexing. Communication of the ACM 18(11):613-620.
  23. Salton, Chris Buckley; 1988. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24(5): 513-523.
  24. SpärckJones, Karen, S.Walker,and Stephen E.Robertson 2000. A probabilistic model of information retrieval: Development and comparative experiments. Information Processing and Management pp.779-808, 809-840.
  25. Watts, D.J., 1999. Small Worlds, Princetown University Press.
Download


Paper Citation


in Harvard Style

Champclaux Y., Dkaki T. and Mothe J. (2009). ENHANCING HIGH PRECISION BY COMBINING OKAPI BM25 WITH STRUCTURAL SIMILARITY IN AN INFORMATION RETRIEVAL SYSTEM . In Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 3: ICEIS, ISBN 978-989-8111-86-9, pages 279-285. DOI: 10.5220/0002017202790285


in Bibtex Style

@conference{iceis09,
author={Yaël Champclaux and Taoufiq Dkaki and Josiane Mothe},
title={ENHANCING HIGH PRECISION BY COMBINING OKAPI BM25 WITH STRUCTURAL SIMILARITY IN AN INFORMATION RETRIEVAL SYSTEM},
booktitle={Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 3: ICEIS,},
year={2009},
pages={279-285},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002017202790285},
isbn={978-989-8111-86-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 3: ICEIS,
TI - ENHANCING HIGH PRECISION BY COMBINING OKAPI BM25 WITH STRUCTURAL SIMILARITY IN AN INFORMATION RETRIEVAL SYSTEM
SN - 978-989-8111-86-9
AU - Champclaux Y.
AU - Dkaki T.
AU - Mothe J.
PY - 2009
SP - 279
EP - 285
DO - 10.5220/0002017202790285