A Generic Open World Named Entity Disambiguation Approach for Tweets

Mena B. Morgan, Maurice van Keulen

2013

Abstract

Social media is a rich source of information. To make use of this information, it is sometimes required to extract and disambiguate named entities. In this paper, we focus on named entity disambiguation (NED) in twitter messages. NED in tweets is challenging in two ways. First, the limited length of Tweet makes it hard to have enough context while many disambiguation techniques depend on it. The second is that many named entities in tweets do not exist in a knowledge base (KB). We share ideas from information retrieval (IR) and NED to propose solutions for both challenges. For the first problem we make use of the gregarious nature of tweets to get enough context needed for disambiguation. For the second problem we look for an alternative home page if there is no Wikipedia page represents the entity. Given a mention, we obtain a list of Wikipedia candidates from YAGO KB in addition to top ranked pages from Google search engine. We use Support Vector Machine (SVM) to rank the candidate pages to find the best representative entities. Experiments conducted on two data sets show better disambiguation results compared with the baselines and a competitor.

References

  1. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1-27:27. Software available at http://www.csie.ntu.edu.tw/ cjlin/ libsvm.
  2. Christoforaki, M., Erunse, I., and Yu, C. (2011). Searching social updates for topic-centric entities. In Proc. of the First International Workshop on Searching and Integrating New Web Data Sources - Very Large Data Search (VLDS), pages 34-39.
  3. Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708- 716.
  4. Davis, A., Veloso, A., da Silva, A. S., Meira, Jr., W., and Laender, A. H. F. (2012). Named entity disambiguation in streaming data. In Proc. of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL 7812, pages 815-824.
  5. Delgado, A. D., Mart'inez, R., Pérez Garc'ia-Plaza, A., and Fresno, V. (2012). Unsupervised Real-Time company name disambiguation in twitter. In Workshop on RealTime Analysis and Mining of Social Streams (RAMSS), pages 25-28.
  6. Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3):297-302.
  7. Gimpel, K., Schneider, N., O'Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., and Smith, N. A. (2011). Part-of-speech tagging for twitter: annotation, features, and experiments. In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT 7811, pages 42-47.
  8. Habib, M. B. and van Keulen, M. (2012). Unsupervised improvement of named entity extraction in short informal context using disambiguation clues. In Proc. of the Workshop on Semantic Web and Information Extraction (SWAIE 2012), pages 1-10.
  9. Hoffart, J., Yosef, M. A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., and Weikum, G. (2011). Robust disambiguation of named entities in text. In Proc. of the Conference on Empirical Methods in Natural Language Processing, EMNLP 7811, pages 782-792.
  10. Kulkarni, S., Singh, A., Ramakrishnan, G., and Chakrabarti, S. (2009). Collective annotation of wikipedia entities in web text. In Proc. of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 7809, pages 457-466.
  11. Li, L., Yu, Z., Zou, J., Su, L., Xian, Y., and Mao, C. (2009). Research on the method of entity homepage recognition. Journal of Computational Information Systems (JCIS), 5(4):1617-1624.
  12. Lin, T., Mausam, and Etzioni, O. (2012). Entity linking at web scale. In Proc. of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), pages 84- 88.
  13. Locke, B. and Martin, J. (2009). Named entity recognition: Adapting to microblogging. Senior Thesis, University of Colorado.
  14. MacKay, D. J. and Peto, L. C. B. (1994). A hierarchical dirichlet language model. Natural Language Engineering, 1:1-19.
  15. Spina, D., Amig ó, E., and Gonzalo, J. (2011). Filter keywords and majority class strategies for company name disambiguation in twitter. In Proc. of the Second international conference on Multilingual and multimodal information access evaluation, CLEF'11, pages 50- 61.
  16. Steiner, T., Verborgh, R., Gabarró Vallés, J., and Van de Walle, R. (2013). Adding meaning to social network microposts via multiple named entity disambiguation apis and tracking their data provenance. International Journal of Computer Information Systems and Industrial Management, 5:69-78.
  17. Suchanek, F. M., Kasneci, G., and Weikum, G. (2007). Yago: a core of semantic knowledge. In Proc. of the 16th international conference on World Wide Web, WWW 7807, pages 697-706.
  18. Wang, C., Chakrabarti, K., Cheng, T., and Chaudhuri, S. (2012). Targeted disambiguation of ad-hoc, homogeneous sets of named entities. In Proc. of the 21st international conference on World Wide Web, WWW 7812, pages 719-728.
  19. Westerveld, T., Kraaij, W., and Hiemstra, D. (2002). Retrieving web pages using content, links, urls and anchors. In Tenth Text REtrieval Conference, TREC 2001, volume SP 500, pages 663-672.
  20. Yerva, S. R., Mikl ós, Z., and Aberer, K. (2012). Entitybased classification of twitter messages. IJCSA, 9(1):88-115.
  21. Yosef, M., Hoffart, J., Bordino, I., Spaniol, M., and Weikum, G. (2011). Aida: An online tool for accurate disambiguation of named entities in text and tables. volume 4, pages 1450-1453.
  22. Zhai, C. and Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7801, pages 334-342.
Download


Paper Citation


in Harvard Style

B. Morgan M. and van Keulen M. (2013). A Generic Open World Named Entity Disambiguation Approach for Tweets . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SNAM, (IC3K 2013) ISBN 978-989-8565-75-4, pages 267-276. DOI: 10.5220/0004536302670276


in Bibtex Style

@conference{snam13,
author={Mena B. Morgan and Maurice van Keulen},
title={A Generic Open World Named Entity Disambiguation Approach for Tweets},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SNAM, (IC3K 2013)},
year={2013},
pages={267-276},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004536302670276},
isbn={978-989-8565-75-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SNAM, (IC3K 2013)
TI - A Generic Open World Named Entity Disambiguation Approach for Tweets
SN - 978-989-8565-75-4
AU - B. Morgan M.
AU - van Keulen M.
PY - 2013
SP - 267
EP - 276
DO - 10.5220/0004536302670276