A Multi-Layer System for Semantic Textual Similarity

Ngoc Phuoc An Vo, Octavian Popescu

Abstract

Building a system able to cope with various phenomena which falls under the umbrella of semantic similarity is far from trivial. It is almost always the case that the performances of a system do not vary consistently or predictably from corpora to corpora. We analyzed the source of this variance and found that it is related to the word-pair similarity distribution among the topics in the various corpora. Then we used this insight to construct a 4-module system that would take into consideration not only string and semantic word similarity, but also word alignment and sentence structure. The system consistently achieves an accuracy which is very close to the state of the art, or reaching a new state of the art. The system is based on a multi-layer architecture and is able to deal with heterogeneous corpora which may not have been generated by the same distribution.

References

  1. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, R., Rigau, G., Uria, L., and Wiebe, J. (2015). SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, CO. Association for Computational Linguistics.
  2. Agirre, E., Baneab, C., Cardiec, C., Cerd, D., Diabe, M., Gonzalez-Agirrea, A., Guof, W., Mihalceab, R., Rigaua, G., and Wiebeg, J. (2014). Semeval-2014 task 10: Multilingual semantic textual similarity. SemEval 2014, page 81.
  3. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., and Guo, W. (2013). sem 2013 shared task: Semantic textual similarity, including a pilot on typed-similarity. In In* SEM 2013: The Second Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics. Citeseer.
  4. Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational SemanticsVolume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 385-393. Association for Computational Linguistics.
  5. Allison, L. and Dix, T. I. (1986). A bit-string longestcommon-subsequence algorithm. Information Processing Letters, 23(5):305-310.
  6. Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65-72.
  7. Bär, D., Biemann, C., Gurevych, I., and Zesch, T. (2012). Ukp: Computing semantic textual similarity by combining multiple content similarity measures. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 435-440. Association for Computational Linguistics.
  8. Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don't count, predict! a systematic comparison of contextcounting vs. context-predicting semantic vectors. In ACL (1), pages 238-247.
  9. Barrón-Cedeno, A., Rosso, P., Agirre, E., and Labaka, G. (2010). Plagiarism detection across distant language pairs. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 37-45. Association for Computational Linguistics.
  10. Berant, J., Dagan, I., and Goldberger, J. (2012). Learning entailment relations by global graph structure optimization. Computational Linguistics, 38(1):73-111.
  11. Broder, A. Z. (1997). On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, pages 21-29. IEEE.
  12. Budanitsky, A. and Hirst, G. (2006). Evaluating wordnetbased measures of lexical semantic relatedness. Comput. Linguist., 32(1):13-47.
  13. Fellbaum, C. (1998). WordNet. Wiley Online Library.
  14. Gabrilovich, E. and Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, volume 7, pages 1606- 1611.
  15. Galitsky, B. (2013). Machine learning of syntactic parse trees for search and classification of text. Engineering Applications of Artificial Intelligence , 26(3):1072- 1091.
  16. Glickman, O. and Dagan, I. (2004). Acquiring lexical paraphrases from a single corpus. Recent Advances in Natural Language Processing III. John Benjamins Publishing, Amsterdam, Netherlands, pages 81-90.
  17. Guo, W. and Diab, M. (2012). Modeling sentences in the latent space. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 864-872. Association for Computational Linguistics.
  18. Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.
  19. Han, L., Kashyap, A., Finin, T., Mayfield, J., and Weese, J. (2013). Umbc ebiquity-core: Semantic textual similarity systems. In In* SEM 2013: The Second Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics.
  20. Han, L., Martineau, J., Cheng, D., and Thomas, C. (2015). Samsung: Align-and-differentiate approach to semantic textual similarity. SemEval-2015, page 172.
  21. Hänig, C., Remus, R., and De La Puente, X. (2015). Exb themis: Extensive feature extraction from word alignments for semantic textual similarity. SemEval-2015, page 264.
  22. Harris, Z. S. (1968). Mathematical structures of language. Interscience Publishers.
  23. Hirst, G. and St-Onge, D. (1998). Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An electronic lexical database, 305:305-332.
  24. Jiang, J. J. and Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008.
  25. Klein, D. and Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational LinguisticsVolume 1, pages 423-430. Association for Computational Linguistics.
  26. Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3):259-284.
  27. Leacock, C., Miller, G. A., and Chodorow, M. (1998). Using corpus statistics and wordnet relations for sense identification. Computational Linguistics, 24(1):147- 165.
  28. Lin, D. (1998). An information-theoretic definition of similarity. In ICML, volume 98, pages 296-304.
  29. Lyon, C., Malcolm, J., and Dickerson, B. (2001). Detecting short passages of similar text in large document collections. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pages 118-125.
  30. Marsi, E., Moen, H., Bungum, L., Sizov, G., Gambäck, B., and Lynum, A. (2013). Ntnu-core: Combining strong features for semantic similarity. In In* SEM 2013: The Second Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics.
  31. Meadow, C. T. (1992). Text information retrieval systems. Academic Press, Inc.
  32. Mihalcea, R., Corley, C., and Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, volume 6, pages 775- 780.
  33. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  34. Milne, D. and Witten, I. H. (2013). An open-source toolkit for mining wikipedia. Artificial Intelligence , 194:222-239.
  35. Moschitti, A. (2006). Efficient convolution kernels for dependency and constituent syntactic trees. In Machine Learning: ECML 2006, pages 318-329. Springer.
  36. Pedersen, T., Patwardhan, S., and Michelizzi, J. (2004). Wordnet:: Similarity: measuring the relatedness of concepts. In Demonstration Papers at HLT-NAACL 2004, pages 38-41. Association for Computational Linguistics.
  37. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825-2830.
  38. Pilehvar, M. T., Jurgens, D., and Navigli, R. (2013). Align, disambiguate and walk: A unified approach for measuring semantic similarity. In ACL (1), pages 1341- 1351.
  39. Plotkin, G. D. (1970). A note on inductive generalization. Machine intelligence, 5(1):153-163.
  40. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmplg/9511007.
  41. Sahlgren, M. (2006). The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in highdimensional vector spaces.
  42. Salton, G. and McGill, M. J. (1983). Introduction to modern information retrieval.
  43. Šaric, F., Glavaš, G., Karan, M., Šnajder, J., and Bašic, B. D. (2012). Takelab: Systems for measuring semantic text similarity. In Proceedings of the First Joint Conference on Lexical and Computational SemanticsVolume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 441-448. Association for Computational Linguistics.
  44. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing, volume 12, pages 44-49. Manchester, UK.
  45. Shareghi, E. and Bergler, S. (2013). Clac-core: Exhaustive feature combination for measuring textual similarity. In In* SEM 2013: The Second Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics.
  46. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, pages 223-231.
  47. Sultan, M. A., Bethard, S., and Sumner, T. (2014a). Back to basics for monolingual alignment: Exploiting word similarity and contextual evidence. Transactions of the Association for Computational Linguistics, 2:219- 230.
  48. Sultan, M. A., Bethard, S., and Sumner, T. (2014b). Dls@ cu: Sentence similarity from word alignment. SemEval 2014, page 241.
  49. Sultan, M. A., Bethard, S., and Sumner, T. (2015). Dls@ cu: Sentence similarity from word alignment and semantic vector composition. In Proceedings of the 9th International Workshop on Semantic Evaluation, pages 148-153.
  50. Surdeanu, M., Ciaramita, M., and Zaragoza, H. (2011). Learning to rank answers to non-factoid questions from web collections. Computational Linguistics, 37(2):351-383.
  51. Turney, P. D., Pantel, P., et al. (2010). From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research , 37(1):141-188.
  52. Vo, N. P. A., Caselli, T., and Popescu, O. (2014). Fbktr: Applying svm with multiple linguistic features for cross-level semantic similarity. SemEval 2014, page 284.
  53. Wise, M. J. (1993). String similarity via greedy string tiling and running karp-rabin matching. Online Preprint, Dec, 119.
  54. Wise, M. J. (1996). Yap3: Improved detection of similarities in computer program and other texts. In ACM SIGCSE Bulletin, volume 28, pages 130-134. ACM.
  55. Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133-138. Association for Computational Linguistics.
  56. Zanzotto, F. M. and Dell'Arciprete, L. (2012). Distributed tree kernels. arXiv preprint arXiv:1206.4607.
Download


Paper Citation


in Harvard Style

Vo N. and Popescu O. (2016). A Multi-Layer System for Semantic Textual Similarity . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 56-67. DOI: 10.5220/0006045800560067


in Bibtex Style

@conference{kdir16,
author={Ngoc Phuoc An Vo and Octavian Popescu},
title={A Multi-Layer System for Semantic Textual Similarity},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},
year={2016},
pages={56-67},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006045800560067},
isbn={978-989-758-203-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - A Multi-Layer System for Semantic Textual Similarity
SN - 978-989-758-203-5
AU - Vo N.
AU - Popescu O.
PY - 2016
SP - 56
EP - 67
DO - 10.5220/0006045800560067