STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES

Antoine Doucet, Helena Ahonen-Myka

Abstract

In this paper, we review statistical techniques for the direct evaluation of descriptive phrases and introduce a new technique based on mutual information. In the experiments, we apply this technique to different types of frequent sequences, hereby finding mathematical justification of former empirical practice.

References

  1. Ahonen-Myka, H. and Doucet, A. (2005). Data mining meets collocations discovery. In Inquiries into Words, Constraints and Contexts, pages 194-203. CSLI Publications, Center for the Study of Language and Information, University of Stanford.
  2. Ahonen-Myka, H., Heinonen, O., Klemettinen, M., and Verkamo, A. I. (1999). Finding Co-occurring Text Phrases by Combining Sequence and Frequent Set Discovery. In Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pages 1-9.
  3. Benson, M. (1990). Collocations and general-purpose dictionaries. International Journal of Lexicography, 3(1):23-35.
  4. Choueka, Y., Klein, S. T., and Neuwitz, E. (1983). Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal for Literary and Linguistic computing, 4:34-38.
  5. Church, K. W. and Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22-29.
  6. Doucet, A. (2005). Advanced Document Description, a Sequential Approach. PhD thesis, University of Helsinki.
  7. Doucet, A. and Ahonen-Myka, H. (2006). Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences. In Proceedings of IS-LTC 2006, Information Society - Language Technologies Conference, pages 186-191.
  8. Fano, R. M. (1961). Transmission of Information: A statistical Theory of Information. MIT Press, Cambridge MA.
  9. Justeson, J. S. and Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm fo identification in text. Natural Language Engineering, 1:9-27.
  10. Manning, C. D. and Sch├╝tze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge MA, second edition.
  11. McKeown, K. R. and Radev, D. R. (2000). A Handbook of Natural Language Processing, chapter 5: Collocations. Marcel Dekker.
  12. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech, 27:379-423, 623-656.
  13. Smadja, F. (March 1993). Retrieving collocations from text: Xtract. Journal of Computational Linguistics, 19(1):143-177.
Download


Paper Citation


in Harvard Style

Doucet A. and Ahonen-Myka H. (2010). STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 141-149. DOI: 10.5220/0003054801410149


in Bibtex Style

@conference{kdir10,
author={Antoine Doucet and Helena Ahonen-Myka},
title={STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={141-149},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003054801410149},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES
SN - 978-989-8425-28-7
AU - Doucet A.
AU - Ahonen-Myka H.
PY - 2010
SP - 141
EP - 149
DO - 10.5220/0003054801410149