Performance Evaluation of Similarity Measures on Similar and Dissimilar Text Retrieval

Victor U. Thompson, Christo Panchev, Michael Michael Oakes

Abstract

Many Information Retrieval (IR) and Natural language processing (NLP) systems require textual similarity measurement in order to function, and do so with the help of similarity measures. Similarity measures function differently, some measures which work better on highly similar texts do not always do so well on highly dissimilar texts. In this paper, we evaluated the performances of eight popular similarity measures on four levels (degree) of textual similarity using a corpus of plagiarised texts. The evaluation was carried out in the context of candidate selection for plagiarism detection. Performance was measured in terms of recall, and the best performed similarity measure(s) for each degree of textual similarity was identified. Results from our Experiments show that the performances of most of the measures were equal on highly similar texts, with the exception of Euclidean distance and Jensen-Shannon divergence which had poorer performances. Cosine similarity and Bhattacharryan coefficient performed best on lightly reviewed text, and on heavily reviewed texts, Cosine similarity and Pearson Correlation performed best and next best respectively. Pearson Correlation had the best performance on highly dissimilar texts. The results also show term weighing methods and n-gram document representations that best optimises the performance of each of the similarity measures on a particular level of intertextual similarity.

References

  1. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (Vol. 463). New York: ACM press.
  2. Bernstein, Y., & Zobel, J. (2004, January). A scalable system for identifying co-derivative documents. In String Processing and Information Retrieval (pp. 55- 67). Springer Berlin Heidelberg.
  3. Bigi, B. (2003). Using Kullback-Leibler distance for text categorization (pp. 305-319). Springer Berlin Heidelberg.
  4. Broder, A. Z. (1997, June). On the resemblance and containment of documents. In Compression and Complexity of Sequences Proceedings (pp. 21-29). IEEE
  5. Cha S. (2007), “Comprehensive survey on distance/similarity measures between probability density functions.” International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, Issue 4, pp. 300-307.
  6. Charikar, M. S. (2002, May). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388). ACM.
  7. Clough, P., & Stevenson, M. (2011). Developing a corpus of plagiarised short answers. Language Resources and Evaluation, 45(1), pp.5-24.
  8. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JAsIs, 41(6), 391-407.
  9. Eiselt, M. P. B. S. A., & Rosso, A. B. C. P. (2009). Overview of the 1st international competition on plagiarism detection. In 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse (p. 1).
  10. Forsyth, R. S., & Sharoff, S. (2014). Document dissimilarity within and across languages: A benchmarking study. Literary and Linguistic Computing, 29(1), 6-22.
  11. Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P.. & Stein, B. (2013). Recent trends in digital text forensics and its evaluation. InInformation Access Evaluation. Multilinguality, Multimodality, and Visualization (pp. 282-302). Springer Berlin Heidelberg.M.
  12. Gomaa, W. H., & Fahmy, A. A. (2013). A survey of text similarity approaches.International Journal of Computer Applications, 68(13), pp. 13-18.
  13. Hiemstra, D., & De Vries, A. P. (2000). Relating the new language models of information retrieval to the traditional retrieval models.
  14. Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American society for information science and technology, 54(3), pp.203-215.
  15. Huang, A. (2008, April). Similarity measures for text document clustering. InProceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand (pp. 49- 56).
  16. Johnson, R., & Zhang, T. (2014). Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058.
  17. Jones, W. P., & Furnas, G. W. (1987). Pictures of relevance: A geometric analysis of similarity measures. Journal of the American society for information science, 38(6), 420-442.
  18. Ljubešic, N., Boras, D., Bakaric, N., & Njavro, J. (2008, June). Comparing measures of semantic similarity. In Information Technology Interfaces, 2008. ITI 2008. 30th International Conference on (pp. 675-682). IEEE
  19. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval (Vol. 1, p. 496). Cambridge: Cambridge university press.
  20. Mihalcea, R., Corley, C., & Strapparava, C. (2006, July). Corpus-based and knowledge-based measures of text semantic similarity. In AAAI (Vol. 6, pp. 775-780).
  21. Oakes, M. P. (2014). Literary Detective Work on the Computer (Vol. 12). John Benjamins Publishing Company.
  22. Polettini, N. (2004). The vector space model in information retrieval-term weighting problem. Entropy, 1-9.
  23. Ponte, J. M., & Croft, W. B. (1998, August). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 275-281). ACM.
  24. Robertson, S. E. (1977). The probability ranking principle in IR. Journal of documentation, 33(4), pp.294-304.
  25. Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation, 60(5), 503-520.
  26. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
  27. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), pp.513-523.
  28. Singhal, A., Salton, G., Mitra, M., & Buckley, C. (1996A). Document length normalization. Information Processing & Management, 32(5), 619-633.
  29. Singhal, A., Buckley, C., & Mitra, M. (1996B). Pivoted document length normalization. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 21-29). ACM.
  30. Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1), pp.11-21.
  31. Strehl, A., Ghosh, J., & Mooney, R. (2000, July). Impact of similarity measures on web-page clustering. In Workshop on Artificial Intelligence for Web Search (AAAI 2000) (pp. 58-64).
  32. Turney, P. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL.
  33. Turney, P. & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37(1), pp.141-188.
  34. White, R. W., & Jose, J. M. (2004, July). A study of topic similarity measures. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 520-521). ACM.
  35. Zhang, J., & Korfhage, R. R. (1999). A distance and angle similarity measure method. Journal of the American Society for Information Science, 50(9), pp. 772-778.
Download


Paper Citation


in Harvard Style

Thompson V., Panchev C. and Michael Oakes M. (2015). Performance Evaluation of Similarity Measures on Similar and Dissimilar Text Retrieval . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015) ISBN 978-989-758-158-8, pages 577-584. DOI: 10.5220/0005619105770584


in Bibtex Style

@conference{sstm15,
author={Victor U. Thompson and Christo Panchev and Michael Michael Oakes},
title={Performance Evaluation of Similarity Measures on Similar and Dissimilar Text Retrieval},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015)},
year={2015},
pages={577-584},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005619105770584},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015)
TI - Performance Evaluation of Similarity Measures on Similar and Dissimilar Text Retrieval
SN - 978-989-758-158-8
AU - Thompson V.
AU - Panchev C.
AU - Michael Oakes M.
PY - 2015
SP - 577
EP - 584
DO - 10.5220/0005619105770584