Enrichment of Inflection Dictionaries: Automatic Extraction of Semantic Labels from Encyclopedic Definitions

Pawel Chrzaszcz

Abstract

Inflection dictionaries are widely used in many natural language processing tasks, especially for inflecting languages. However, they lack semantic information, which could increase the accuracy of such processing. This paper describes a method to extract semantic labels from encyclopedic entries. Adding such labels to an inflection dictionary could eliminate the need of using ontologies and similar complex semantic structures for many typical tasks. A semantic label is either a single word or a sequence of words that describes the meaning of a headword, hence it is similar to a semantic category. However, no taxonomy of such categories is known prior to the extraction. Encyclopedic articles consist of headwords and their definitions, so the definitions are used as sources for semantic labels. The described algorithm has been implemented for extracting data from the Polish Wikipedia. It is based on definition structure analysis, heuristic methods and word form recognition and processing with use of the Polish Inflection Dictionary. This paper contains a description of the method and test results as well as discussion on possible further development.

References

  1. De Vel, O., Anderson, A., Corney, M., and Mohay, G. (2001). Mining e-mail content for author identification forensics. SIGMOD Rec., 30(4):55-64.
  2. Edmonds, P. and Kilgarriff, A. (2002). Introduction to the special issue on evaluating word sense disambiguation systems. Nat. Lang. Eng., 8(4):279-291.
  3. Fellbaum, C., editor (1998). WordNet: An Electronic Lexical Database. MIT Press.
  4. Gajecki, M. (2009). Slownik fleksyjny jako biblioteka jezyka c. In Slowniki komputerowe i automatyczna ekstrakcja informacji z tekstu. Wydawnictwa AGH, Krakow.
  5. Kazama, J. and Torisawa, K. (2007). Exploiting wikipedia as external knowledge for named entity recognition. In EMNLP-CoNLL, pages 698-707. ACL.
  6. Kuta, M., Chrzaszcz, P., and Kitowski, J. (2007). A case study of algorithms for morphosyntactic tagging of polish language. Computing and Informatics, 26(6):627-647.
  7. Kuta, M., Kitowski, J., Wójcik, W., and Wrzeszcz, M. (2010). Application of weighted voting taggers to languages described with large tagsets. Computing and Informatics, 29(2):203-225.
  8. Lubaszewski, W., Wróbel, H., Gajecki, M., Moskal, B., Orzechowska, A., Pietras, P., Pisarek, P., and Rokicka, T. (2001). Slownik Fleksyjny Jezyka Polskiego. Lexis Nexis, Kraków.
  9. Medelyan, O., Milne, D., Legg, C., and Witten, I. H. (2009). Mining meaning from wikipedia. Int. J. Hum.-Comput. Stud., 67(9):716-754.
  10. Milne, D. and Witten, I. H. (2008). Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM 7808, pages 509-518, New York, NY, USA. ACM.
  11. Pietras, P. (2009). Ekstrakcja leksykalna. In SÅ‚owniki komputerowe i automatyczna ekstrakcja informacji z tekstu. Wydawnictwa AGH, Kraków.
  12. Pohl, A. (2009). SÅ‚ownik semantyczny jÄTMzyka polskiego. In SÅ‚owniki komputerowe i automatyczna ekstrakcja informacji z tekstu. Wydawnictwa AGH, Kraków.
  13. Suchanek, F. M., Kasneci, G., and Weikum, G. (2008). Yago: A large ontology from wikipedia and wordnet. Web Semant., 6(3):203-217.
  14. Toral, A. and Muñoz, R. (2006). A proposal to automatically build and maintain gazetteers for named entity recognition by using Wikipedia. In NEW TEXT - Wikis and blogs and other dynamic text sources, Trento.
  15. Voorhees, E. M. (1999). Natural language processing and information retrieval. In Information Extraction: Towards Scalable, Adaptable Systems, pages 32-48. Springer, New York.
  16. Wolinski, M. (2006). Morfeusz - a practical tool for the morphological analysis of polish. Advances in Soft Computing, 26(6):503-512.
Download


Paper Citation


in Harvard Style

Chrzaszcz P. (2012). Enrichment of Inflection Dictionaries: Automatic Extraction of Semantic Labels from Encyclopedic Definitions . In Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2012) ISBN 978-989-8565-16-7, pages 106-119. DOI: 10.5220/0004100501060119


in Bibtex Style

@conference{nlpcs12,
author={Pawel Chrzaszcz},
title={Enrichment of Inflection Dictionaries: Automatic Extraction of Semantic Labels from Encyclopedic Definitions},
booktitle={Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2012)},
year={2012},
pages={106-119},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004100501060119},
isbn={978-989-8565-16-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2012)
TI - Enrichment of Inflection Dictionaries: Automatic Extraction of Semantic Labels from Encyclopedic Definitions
SN - 978-989-8565-16-7
AU - Chrzaszcz P.
PY - 2012
SP - 106
EP - 119
DO - 10.5220/0004100501060119