The Computation of Semantically Related Words: Thesaurus Generation for English, German, and Russian

Reinhard Rapp

2007

Abstract

A method for the automatic extraction of semantically similar words is presented which is based on the analysis of word distribution in large monolingual text corpora. It involves compiling matrices of word co-occurrences and reducing the dimensionality of the semantic space by conducting a singular value decomposition. This way problems of data sparseness are reduced and a generalization effect is achieved which considerably improves the results. The method is largely language independent and has been applied to corpora of English, German, and Russian, with the resulting thesauri being freely available. For the English thesaurus, an evaluation has been conducted by comparing it to experimental results as obtained from test persons who were asked to give judgements of word similarities. According to this evaluation, the machine generated results come close to native speaker’s performance.

References

  1. Burnard, L.; Aston, G. (1998). The BNC Handbook: Exploring the British National Corpus with Sara. Edinburgh University Press.
  2. Grefenstette, G. (1994). Explorations in Automatic Thesaurus Discovery. Dordrecht: Kluwer.
  3. Harris, Z.S. (1954). Distributional structure. Word, 10(23), 146-162.
  4. Landauer, T. K.; Dumais, S. T. (1997). A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-240.
  5. Landauer, T.K.; McNamara, D.S.; Dennis, S.; Kintsch, W. (eds.) (2007). Handbook of Latent Semantic Analysis. Lawrence Erlbaum.
  6. Lezius, W.; Rapp, R.; Wettler, M. (1998). A freely available morphology system, part-ofspeech tagger, and context-sensitive lemmatizer for German. In: Proceedings of COLINGACL 1998, Montreal, Vol. 2, 743-748.
  7. Lin, D. (1998). Automatic retrieval and clustering of similar words. In: Proceedings of COLINGACL 1998, Montreal, Vol. 2, 768-773.
  8. Pantel, Patrick; Lin, Dekang (2002). Discovering word senses from text. In: Proceedings of ACM SIGKDD, Edmonton, 613-619.
  9. Rapp, R. (2002). The computation of word associations: comparing syntagmatic and paradigmatic approaches. Proceedings of 19th COLING, Taipei, ROC, Vol. 2, 821-827.
  10. Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In: Proceedings of the Ninth Machine Translation Summit, New Orleans, 315-322.
  11. Rapp, R. (2004). A freely available automatically generated thesaurus of related words. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), Lisbon, Vol. II, 395-398.
  12. Ruge, G. (1992). Experiments on linguistically based term associations. Information Processing and Management 28(3), 317-332.
  13. Sahlgren, M. (2001). Vector-based semantic analysis: representing word meanings based on random labels. In: A. Lenci, S. Montemagni, V. Pirrelli (eds.): Proceedings of the ESSLLI Workshop on the Acquisition and Representation of Word Meaning, Helsinki.
  14. Schütze, H. (1997). Ambiguity Resolution in Language Learning: Computational and Cognitive Models. Stanford: CSLI Publications.
  15. Terra, E., Clarke, C.L.A. (2003). Frequency estimates for statistical word similarity measures. Proceedings of HLT/NAACL, Edmonton, Alberta, May 2003.
  16. Turney, P.D. (2001). Mining the Web for synonyms. PMI-IR versus LSA on TOEFL. In: Proc. of the Twelfth European Conference on Machine Learning, 491- 502.
  17. Turney, P.D. (2006). Similarity of Semantic Relations. Computational Linguistics, 32(3), 379-416.
Download


Paper Citation


in Harvard Style

Rapp R. (2007). The Computation of Semantically Related Words: Thesaurus Generation for English, German, and Russian . In Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2007) ISBN 978-972-8865-97-9, pages 71-80. DOI: 10.5220/0002414500710080


in Bibtex Style

@conference{nlpcs07,
author={Reinhard Rapp},
title={The Computation of Semantically Related Words: Thesaurus Generation for English, German, and Russian},
booktitle={Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2007)},
year={2007},
pages={71-80},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002414500710080},
isbn={978-972-8865-97-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2007)
TI - The Computation of Semantically Related Words: Thesaurus Generation for English, German, and Russian
SN - 978-972-8865-97-9
AU - Rapp R.
PY - 2007
SP - 71
EP - 80
DO - 10.5220/0002414500710080