A New Multi-lingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language

Dante Degl'Innocenti, Dario De Nart, Carlo Tasso

Abstract

Associating meaningful keyphrases to text documents and Web pages is an activity that can significantly increase the accuracy of Information Retrieval, Personalization and Recommender systems, but the growing amount of text data available is too large for an extensive manual annotation. On the other hand, automatic keyphrase generation can significantly support this activity. This task is already performed with satisfactory results by several systems proposed in the literature, however, most of them focuses solely on the English language which represents approximately more than 50% of Web contents. Only few other languages have been investigated and Italian, despite being the ninth most used language on the Web, is not among them. In order to overcome this shortage, we propose a novel multi-language, unsupervised, knowledge-based approach towards keyphrase generation. To support our claims, we developed DIKpE-G, a prototype system which integrates several kinds of knowledge for selecting and evaluating meaningful keyphrases, ranging from linguistic to statistical, meta/structural, social, and ontological knowledge. DIKpE-G performs well over English and Italian texts.

References

  1. Barker, K. and Cornacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In Advances in Artificial Intelligence, pages 40-52. Springer.
  2. Danilevsky, M., Wang, C., Desai, N., Guo, J., and Han, J. (2013). Kert: Automatic extraction and ranking of topical keyphrases from content-representative document titles. arXiv preprint arXiv:1306.0271.
  3. DAvanzo, E., Magnini, B., and Vallin, A. (2004). Keyphrase extraction for summarization purposes: The lake system at duc-2004. In Proceedings of the 2004 document understanding conference.
  4. De Nart, D. and Tasso, C. (2014). A domain independent double layered approach to keyphrase generation. In WEBIST 2014 - Proceedings of the 10th International Conference on Web Information Systems and Technologies, pages 305-312. SCITEPRESS Science and Technology Publications.
  5. El-Beltagy, S. R. and Rafea, A. (2009). Kp-miner: A keyphrase extraction system for english and arabic documents. Information Systems, 34(1):132-144.
  6. Fagan, J. (1987). Automatic phrase indexing for document retrieval. In Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7887, pages 91-101, New York, NY, USA. ACM.
  7. Ferragina, P. and Scaiella, U. (2010). Tagme: On-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 7810, pages 1625-1628, New York, NY, USA. ACM.
  8. Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., and et al. (1999). Domain-specific keyphrase extraction. In Proc. Sixteenth International Joint Conference on Artificial Intelligence, pages 668-673. Morgan Kaufmann Publishers.
  9. Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on Empirical methods in natural language processing, EMNLP 7803, pages 216- 223, Stroudsburg, PA, USA. Association for Computational Linguistics.
  10. Krapivin, M., Marchese, M., Yadrantsau, A., and Liang, Y. (2008). Unsupervised key-phrases extraction from scientific papers using domain and linguistic knowledge. In Digital Information Management, 2008. ICDIM 2008. Third International Conference on, pages 105-112.
  11. Litvak, M., Last, M., and Friedman, M. (2010). A new approach to improving multilingual summarization using a genetic algorithm. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 927-936. Association for Computational Linguistics.
  12. Liu, Z., Li, P., Zheng, Y., and Sun, M. (2009). Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP 7809, pages 257-266, Stroudsburg, PA, USA. Association for Computational Linguistics.
  13. Matsuo, Y. and Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01):157-169.
  14. Paukkeri, M.-S., Nieminen, I. T., Pöllä, M., and Honkela, T. (2008). A language-independent approach to keyphrase extraction and evaluation. In COLING (Posters), pages 83-86.
  15. Pudota, N., Dattolo, A., Baruzzo, A., and Tasso, C. (2010). A new domain independent keyphrase extraction system. In Agosti, M., Esposito, F., and Thanos, C., editors, Digital Libraries, volume 91 of Communications in Computer and Information Science, pages 67-78. Springer Berlin Heidelberg.
  16. Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. (1999). Analysis of a very large web search engine query log. In ACm SIGIR Forum, volume 33, pages 6-12. ACM.
  17. Turney, P. D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303-336.
  18. W3Techs (2014). Usage of content languages for websites. Available online at: http://w3techs.com/technologies.
  19. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. (1999). Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254-255. ACM.
  20. Zhang, C. (2008). Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 4(3):1169-1180.
Download


Paper Citation


in Harvard Style

Degl'Innocenti D., De Nart D. and Tasso C. (2014). A New Multi-lingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 78-85. DOI: 10.5220/0005077100780085


in Bibtex Style

@conference{kdir14,
author={Dante Degl'Innocenti and Dario De Nart and Carlo Tasso},
title={A New Multi-lingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={78-85},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005077100780085},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - A New Multi-lingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language
SN - 978-989-758-048-2
AU - Degl'Innocenti D.
AU - De Nart D.
AU - Tasso C.
PY - 2014
SP - 78
EP - 85
DO - 10.5220/0005077100780085