AN EXTENSIVE COMPARISON OF METRICS FOR AUTOMATIC EXTRACTION OF KEY TERMS

Luis F. S. Teixeira, Gabriel P. Lopes, Rita A. Ribeiro

Abstract

In this paper we compare twenty language independent statistical-based metrics for key term extraction from any document collection. While some of those metrics are widely used, others were recently created. Two different document representations are considered in our experiments. One is based on words and multi-words and the other is based on word prefixes of fixed length (5 characters for the experiments made) for handling morphologically rich languages, namely Portuguese and Czech. English is also experimented, as a non-morphologically rich language. Results are manually evaluated and agreement between evaluators is assessed using k-Statistics. The metrics based on Tf-Idf and Phi-square proved to have higher precision and recall. The use of prefix-based representation of documents enabled a significant improvement for documents written in Portuguese.

References

  1. Cigarrán, J. M., Peñas, A., Gonzalo, J., & Verdejo, F. (2005). Automatic Selection of Noun Phrases as Document Descriptors in an FCA-Based Information Retreival System. In B. Ganter & R. Godin (Eds.), ICFCA 2005 (Vol. Lecture Notes in Computer Science 3403, pp. 49-63): Springer Berlin.
  2. Creutz, M., & Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process., 4(1), 1-34.
  3. Everitt, B. S. (2002). The Cambridge Dictionary of Statistics (2 ed.). New York: Cambridge University Press.
  4. Goldsmith, J. (2001). Unsupervised learning of the morphology of a natural language. Computational Linguistiscs, 27(2), 153-198.
  5. Gomes, L. (2009). Multi-Word Extractor, from http://hlt.di.fct.unl.pt/luis/multiwords/index.html
  6. Hulth, A. (2003). Improved Automatic Keyword Extraction Given More Linguistic Knowledge EMNLP 7803 Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 216 - 223). Stroudsburg, PA, USA: Association for Computational Linguistics.
  7. Jacquemin, C. (2001). Spotting and discovering terms through natural language processing: MIT Press.
  8. Katja, H., Manos, T., Edgar, M., & Maarten de, R. (2009). The impact of document structure on keyphrase extraction Proceeding of the 18th ACM conference on Information and knowledge management (pp. 1725- 1728). Hong Kong, China: ACM.
  9. Lemnitzer, L., & Monachesi, P. (2008). Extraction and evaluation of keywords from Learning Objects - a multilingual approach Proceedings of the Language Resources and Evaluation Conference.
  10. Liu, F., Pennell, D., Liu, F., & Liu, Y. (2009). Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, (pp. 620-628). Boulder, Colorado: Association for Computational Linguistics.
  11. Manning, C. D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge Cambridge University Press.
  12. Martínez-Fernández, J. L., García-Serrano, A., Martínez, P., & Villena, J. (2004). Automatic Keyword Extraction for News Finder Adaptive Multimedia Retrieval (Vol. 3094/2004, pp. 405-427): Springer Berlin / Heidelberg.
  13. Matsuo, Y., & Ishizuka, M. (2004). Keyword Extraction from a single Document using word Co-Occurence Statistical Information. International Journal on Articial Intelligence Tools, 13(1), 157-169.
  14. McIlroy, M. D. (2007, Updated April 6, 2010). Suffix arrays, from http://www.cs.dartmouth.edu/doug/sarray/
  15. Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Texts Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (pp. 404-411). Barcelona, Spain.
  16. Ngomo, A.-C. N. (2008). Knowledge-Free Discovery of Domain-Specific Multiword Units Proceedings of the 2008 ACM symposium on Applied computing (pp. 1561-1565). Fortaleza, Ceara, Brazil: ACM.
  17. Peter, D. T. (2000). Learning Algorithms for Keyphrase Extraction. Inf. Retr., 2(4), 303-336. doi: 10.1023/a:1009976227802
  18. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1-47.
  19. Silva, J. F. d., Dias, G., Guilloré, S., & Lopes, J. G. P. (1999). Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In P. Barahona & J. Alferes (Eds.), Progress in Artificial Intelligence (Vol. 1695, pp. 113-132): Springer-Verlag.
  20. Silva, J. F. d., & Lopes, G. P. (1999). A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units Proceedings of the 6th Meeting on the Mathematics of Language (pp. 369- 381). Orlando.
  21. Silva, J. F. d., & Lopes, G. P. (2009). A Document Descriptor Extractor Based on Relevant Expressions. In S. Lopes, N. Lau, P. Mariano & L. M. Rocha (Eds.), Progress in Artificial Intelligence (Vol. 5816, pp. 646-657): Springer-Verlag.
  22. Silva, J. F. d., & Lopes, G. P. (2010). Towards Automatic Building of Document Keywords COLING 2010 - The 23rd International Conference on Computational Linguistics (Vol. Poster Volume, pp. 1149-1157). Pequim.
  23. Teixeira, L., Lopes, G. P., & Ribeiro, R. A. (2011). Automatic Extraction of Document Topics. In L. M. Camarinha-Matos (Ed.), DoCEIS'11 - 2nd Edition of the Doctoral Conference on Computing, Electrical and Industrial Systems (Vol. 349, pp. 101-108). Caparica, Portugal: IFIP International Federation for Information Processing.
  24. Yamamoto, M., & Church, K. W. (2001). Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus. Computational Linguistics, 27(1), 1-30.
Download


Paper Citation


in Harvard Style

F. S. Teixeira L., P. Lopes G. and A. Ribeiro R. (2012). AN EXTENSIVE COMPARISON OF METRICS FOR AUTOMATIC EXTRACTION OF KEY TERMS . In Proceedings of the 4th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8425-95-9, pages 55-63. DOI: 10.5220/0003720400550063


in Bibtex Style

@conference{icaart12,
author={Luis F. S. Teixeira and Gabriel P. Lopes and Rita A. Ribeiro},
title={AN EXTENSIVE COMPARISON OF METRICS FOR AUTOMATIC EXTRACTION OF KEY TERMS},
booktitle={Proceedings of the 4th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2012},
pages={55-63},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003720400550063},
isbn={978-989-8425-95-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 4th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - AN EXTENSIVE COMPARISON OF METRICS FOR AUTOMATIC EXTRACTION OF KEY TERMS
SN - 978-989-8425-95-9
AU - F. S. Teixeira L.
AU - P. Lopes G.
AU - A. Ribeiro R.
PY - 2012
SP - 55
EP - 63
DO - 10.5220/0003720400550063