Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction

Mena B. Habib, Maurice van Keulen

2012

Abstract

Named entity extraction (NEE) and disambiguation (NED) have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation (as a representative example of named entities). First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation techniques mostly take as input extracted named entities without considering the uncertainty and imperfection of the extraction process. It is the aim of this paper to investigate both avenues and to show that explicit handling of the uncertainty of annotation has much potential for making both extraction and disambiguation more robust. We conducted experiments with a set of holiday home descriptions with the aim to extract and disambiguate toponyms. We show that the extraction confidence probabilities are useful in enhancing the effectiveness of disambiguation. Reciprocally, retraining the extraction models with information automatically derived from the disambiguation results, improves the extraction models. This mutual reinforcement is shown to even have an effect after several automatic iterations.

References

  1. Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. (1998). NYU: Description of the MENE named entity system as used in MUC-7. In Proc. of MUC-7.
  2. Buscaldi, D. and Rosso, P. (2008). A conceptual densitybased approach for the disambiguation of toponyms. Int'l Journal of Geographical Information Science, 22(3):301-313.
  3. Finkel, J. R., Grenager, T., and Manning, C. (2005). ncorporating non-local information into information extraction systems by gibbs sampling. In roceedings of the 43nd Annual Meeting of the Association for Computational Linguistics, ACL 2005, pages 363-370.
  4. Gaizauskas, R., Wakao, T., Humphreys, K., Cunningham, H., and Wilks, Y. (1995). University of Sheffield: Description of the LaSIE system as used for MUC-6. In Proc. of MUC-6, pages 207-220.
  5. Grishman, R. and Sundheim, B. (1996). Message understanding conference - 6: A brief history. In Proc. of Int'l Conf. on Computational Linguistics, pages 466- 471.
  6. Gupta, R. (2006). Creating probabilistic databases from information extraction models. In VLDB, pages 965- 976.
  7. Habib, M. B. (2011). Neogeography: The challenge of channelling large and ill-behaved data streams. In Workshops Proc. of the 27th ICDE 2011, pages 284- 287.
  8. Habib, M. B. and van Keulen, M. (2011). Named entity extraction and disambiguation: The reinforcement effect. In Proc. of MUD 2011, Seatle, USA, pages 9-16.
  9. Hobbs, J., Appelt, D., Bear, J., Israel, D., Kameyama, M., Stickel, M., and Tyson, M. (1993). Fastus: A system for extracting information from text. In Proc. of Human Language Technology, pages 133-137.
  10. Humphreys, K., Gaizauskas, R., Azzam, S., Huyck, C., Mitchell, B., Cunningham, H., and Wilks, Y. (1998). University of Sheffield: Description of the Lasie-II system as used for MUC-7. In Proc. of MUC-7.
  11. Isozaki, H. and Kazawa, H. (2002). Efficient support vector classifiers for named entity recognition. In Proc. of COLING 2002, pages 1-7.
  12. Martins, B., Anastácio, I., and Calado, P. (2010). A machine learning approach for resolving place references in text. In Proc. of AGILE 2010.
  13. McCallum, A. and Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proc. of CoNLL 2003, pages 188-191.
  14. Michelakis, E., Krishnamurthy, R., Haas, P. J., and Vaithyanathan, S. (2009). Uncertainty management in rule-based information extraction systems. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD 7809, pages 101-114, New York, NY, USA. ACM.
  15. Overell, J. and Ruger, S. (2006). Place disambiguation with co-occurrence models. In Proc. of CLEF 2006.
  16. Rauch, E., Bukatin, M., and Baker, K. (2003). A confidence-based framework for disambiguating geographic terms. In Workshop Proc. of the HLT-NAACL 2003, pages 50-54.
  17. Sekine, S. (1998). NYU: Description of the Japanese NE system used for MET-2. In Proc. of MUC-7.
  18. Smith, D. and Crane, G. (2001). Disambiguating geographic names in a historical digital library. In Research and Advanced Technology for Digital Libraries, volume 2163 of LNCS, pages 127-136.
  19. Smith, D. and Mann, G. (2003). Bootstrapping toponym classifiers. In Workshop Proc. of HLT-NAACL 2003, pages 45-49.
  20. Sutton, C. and McCallum, A. (2011). An introduction to conditional random fields. Foundations and Trends in Machine Learning. To appear.
  21. Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on, 13(2):260 - 269.
  22. Wacholder, N., Ravin, Y., and Choi, M. (1997). Disambiguation of proper names in text. In Proc. of ANLC 1997, pages 202-208.
  23. Wallach, H. (2004). Conditional random fields: An introduction. Technical Report MS-CIS-04-21, Department of Computer and Information Science, University of Pennsylvania.
  24. Zhou, G. and Su, J. (2002). Named entity recognition using an hmm-based chunk tagger. In Proc. ACL2002, pages 473-480.
Download


Paper Citation


in Harvard Style

B. Habib M. and van Keulen M. (2012). Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012) ISBN 978-989-8565-29-7, pages 399-410. DOI: 10.5220/0004174903990410


in Bibtex Style

@conference{sstm12,
author={Mena B. Habib and Maurice van Keulen},
title={Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012)},
year={2012},
pages={399-410},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004174903990410},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012)
TI - Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction
SN - 978-989-8565-29-7
AU - B. Habib M.
AU - van Keulen M.
PY - 2012
SP - 399
EP - 410
DO - 10.5220/0004174903990410