STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES
Antoine Doucet, Helena Ahonen-Myka
2010
Abstract
In this paper, we review statistical techniques for the direct evaluation of descriptive phrases and introduce a new technique based on mutual information. In the experiments, we apply this technique to different types of frequent sequences, hereby finding mathematical justification of former empirical practice.
References
- Ahonen-Myka, H. and Doucet, A. (2005). Data mining meets collocations discovery. In Inquiries into Words, Constraints and Contexts, pages 194-203. CSLI Publications, Center for the Study of Language and Information, University of Stanford.
- Ahonen-Myka, H., Heinonen, O., Klemettinen, M., and Verkamo, A. I. (1999). Finding Co-occurring Text Phrases by Combining Sequence and Frequent Set Discovery. In Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pages 1-9.
- Benson, M. (1990). Collocations and general-purpose dictionaries. International Journal of Lexicography, 3(1):23-35.
- Choueka, Y., Klein, S. T., and Neuwitz, E. (1983). Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal for Literary and Linguistic computing, 4:34-38.
- Church, K. W. and Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22-29.
- Doucet, A. (2005). Advanced Document Description, a Sequential Approach. PhD thesis, University of Helsinki.
- Doucet, A. and Ahonen-Myka, H. (2006). Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences. In Proceedings of IS-LTC 2006, Information Society - Language Technologies Conference, pages 186-191.
- Fano, R. M. (1961). Transmission of Information: A statistical Theory of Information. MIT Press, Cambridge MA.
- Justeson, J. S. and Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm fo identification in text. Natural Language Engineering, 1:9-27.
- Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge MA, second edition.
- McKeown, K. R. and Radev, D. R. (2000). A Handbook of Natural Language Processing, chapter 5: Collocations. Marcel Dekker.
- Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech, 27:379-423, 623-656.
- Smadja, F. (March 1993). Retrieving collocations from text: Xtract. Journal of Computational Linguistics, 19(1):143-177.
Paper Citation
in Harvard Style
Doucet A. and Ahonen-Myka H. (2010). STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 141-149. DOI: 10.5220/0003054801410149
in Bibtex Style
@conference{kdir10,
author={Antoine Doucet and Helena Ahonen-Myka},
title={STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={141-149},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003054801410149},
isbn={978-989-8425-28-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES
SN - 978-989-8425-28-7
AU - Doucet A.
AU - Ahonen-Myka H.
PY - 2010
SP - 141
EP - 149
DO - 10.5220/0003054801410149