Explaining Unintelligible Words by Means of their Context

Balázs Pintér, Gyula Vörös, Zoltán Szabó, András Lőrincz

Abstract

Explaining unintelligible words is a practical problem for text obtained by optical character recognition, from the Web (e.g., because of misspellings), etc. Approaches to wikification, to enriching text by linking words to Wikipedia articles, could help solve this problem. However, existing methods for wikification assume that the text is correct, so they are not capable of wikifying erroneous text. Because of errors, the problem of disambiguation (identifying the appropriate article to link to) becomes large-scale: as the word to be disambiguated is unknown, the article to link to has to be selected from among hundreds, maybe thousands of candidate articles. Existing approaches for the case where the word is known build upon the distributional hypothesis: words that occur in the same contexts tend to have similar meanings. The increased number of candidate articles makes the difficulty of spuriously similar contexts (when two contexts are similar but belong to different articles) more severe. We propose a method to overcome this difficulty by combining the distributional hypothesis with structured sparsity, a rapidly expanding area of research. Empirically, our approach based on structured sparsity compares favorably to various traditional classification methods.

References

  1. Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1-106.
  2. BNC Consortium (2001). The British National Corpus, version 2 (BNC World).
  3. Chang, C.-C. and Lin, C.-J. (2001). Libsvm: a library for support vector machines.
  4. Garofolo, J. S., Auzanne, C. G. P., and Voorhees, E. M. (2000). The TREC Spoken Document Retrieval Track: A Success Story. In RIAO, pages 1-20.
  5. Han, E.-H. and Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In PKDD, pages 116-123.
  6. Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011). Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297- 2334.
  7. Kantor, P. B. and Voorhees, E. M. (2000). The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text. Information Retrieval, 2:165-176.
  8. Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377- 439.
  9. Kulkarni, S., Singh, A., Ramakrishnan, G., and Chakrabarti, S. (2009). Collective annotation of Wikipedia entities in web text. In KDD, pages 457- 466.
  10. Leacock, C., Chodorow, M., Gamon, M., and Tetreault, J. (2010). Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
  11. Lee, Y. K. and Ng, H. T. (2002). An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In EMNLP, pages 41-48.
  12. Liu, J., Ji, S., and Ye, J. (2009). SLEP: Sparse Learning with Efficient Projections. Arizona State University.
  13. Martins, A. F. T., Smith, N. A., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2011). Structured Sparsity in Structured Prediction. In EMNLP, pages 1500-1511.
  14. Mihalcea, R. and Csomai, A. (2007). Wikify!: linking documents to encyclopedic knowledge. In CIKM, pages 233-242.
  15. Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38:39-41.
  16. Milne, D. and Witten, I. H. (2008). Learning to link with Wikipedia. In CIKM, pages 509-518.
  17. Porter, M. F. (1997). An algorithm for suffix stripping, pages 313-316. Morgan Kaufmann Publishers Inc.
  18. Ratinov, L., Roth, D., Downey, D., and Anderson, M. (2011). Local and global algorithms for disambiguation to Wikipedia. In ACL-HLT, pages 1375-1384.
  19. Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1):97-123.
  20. Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267-288.
  21. Turney, P. D. and Pantel, P. (2010). From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141-188.
  22. Yuan, M., Yuan, M., Lin, Y., and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49-67.
Download


Paper Citation


in Harvard Style

Pintér B., Vörös G., Szabó Z. and Lőrincz A. (2013). Explaining Unintelligible Words by Means of their Context . In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-8565-41-9, pages 382-387. DOI: 10.5220/0004267003820387


in Bibtex Style

@conference{icpram13,
author={Balázs Pintér and Gyula Vörös and Zoltán Szabó and András Lőrincz},
title={Explaining Unintelligible Words by Means of their Context},
booktitle={Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2013},
pages={382-387},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004267003820387},
isbn={978-989-8565-41-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Explaining Unintelligible Words by Means of their Context
SN - 978-989-8565-41-9
AU - Pintér B.
AU - Vörös G.
AU - Szabó Z.
AU - Lőrincz A.
PY - 2013
SP - 382
EP - 387
DO - 10.5220/0004267003820387