HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF WORD SIMILARITY - A Case Study of Pointwise Mutual Information

François Role, Mohamed Nadif

2011

Abstract

Statistical measures of word similarity are widely used in many areas of information retrieval and text mining. Among popular word co-occurrence based measures is Pointwise Mutual Information (PMI). Altough widely used, PMI has a well-known tendency to give excessive scores of relatedness to word pairs that involve low frequency words. Many variants of it have therefore been proposed, which correct this bias empirically. In contrast to this empirical approach, we propose formulae and indicators that describe the behavior of these variants in a precise way so that researchers and practitioners can make a more informed decision as to which measure to use in different scenarios.

References

  1. Bullinaria, J. and Levy, J. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, page 510.
  2. Church, K. W. and Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22-29.
  3. Croft, W. B., Metzler, D., and Strohman, T. (2010). Search engines: information retrieval in practice. Pearson - Addison Wesley.
  4. Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie : statistiques lexicales et filtres linguistiques. PhD thesis, Universit Paris 7 (1994).
  5. Evert, S. (2004). The statistics of word cooccurrences: word pairs and collocations. Unpublished doctoral dissertation, Institut für maschinelle Sprachverarbeitung, Universit ät Stuttgart.
  6. Gerlof, B. (2006). Normalized (pointwise) mutual information in collocation extraction. In Proceedings of the Biennial GSCL Conference, pages 31-40, Gunter Narr Verlag. Chiarcos, Eckart de Castilho & Stede (eds).
  7. Hoang, H. H., Kim, S. N., and Kan, M.-Y. (2009). A reexamination of lexical association measures. In MWE 7809: Proceedings of the Workshop on Multiword Expressions, pages 31-39, Morristown, NJ, USA. Association for Computational Linguistics.
  8. Lee, L. (1999). Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics, pages 25-32.
  9. Manning, C. D. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press.
  10. Pecina, P. and Schlesinger, P. (2006). Combining association measures for collocation extraction. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 651-658, Morristown, NJ, USA. Association for Computational Linguistics.
  11. Petrovic, S., Snajder, J., and Basic, B. D. (2010). Extending lexical association measures for collocation extraction. Computer Speech & Language, 24(2):383- 394.
  12. Terra, E. and Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In NAACL 7803: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 165-172, Morristown, NJ, USA. Association for Computational Linguistics.
  13. Terra, E. and Clarke, C. L. A. (2005). Comparing query formulation and lexical affinity replacements in passage retrieval. In SIGIR Workshop on Methodologies and Evaluation of Lexical Cohesion Techniques in Real-world Applications. Salvador, Brazil, August 2005, pages 11-18. ACM Press.
  14. Thanopoulos, A., Fakotakis, N., and Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In Proceedings of the LREC 2002 Conference, pages 609-613.
  15. Turney, P. D. (2001). Mining the Web for synonyms: PMIIR versus LSA on TOEFL. In ECML-01, pages 491- 502, Freiburg, Germany.
  16. Vechtomova, O. and Robertson, S. (2000). Integration of collocation statistics into the probabilistic retrieval model. In Proceedings of the 22nd British Computer Society - Information Retrieval Specialist Group Conference 2000, pages 165-177.
Download


Paper Citation


in Harvard Style

Role F. and Nadif M. (2011). HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF WORD SIMILARITY - A Case Study of Pointwise Mutual Information . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 218-223. DOI: 10.5220/0003655102260231


in Bibtex Style

@conference{kdir11,
author={François Role and Mohamed Nadif},
title={HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF WORD SIMILARITY - A Case Study of Pointwise Mutual Information},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={218-223},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003655102260231},
isbn={978-989-8425-79-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF WORD SIMILARITY - A Case Study of Pointwise Mutual Information
SN - 978-989-8425-79-9
AU - Role F.
AU - Nadif M.
PY - 2011
SP - 218
EP - 223
DO - 10.5220/0003655102260231