Applying Information-theoretic and Edit Distance Approaches to Flexibly Measure Lexical Similarity

Thi Thuy Anh Nguyen, Stefan Conrad

2014

Abstract

Measurement of similarity plays an important role in data mining and information retrieval. Several techniques for calculating the similarities between objects have been proposed so far, for example, lexical-based, structure-based and instance-based measures. Existing lexical similarity measures usually base on either ngrams or Dice’s approaches to obtain correspondences between strings. Although these measures are efficient, they are inadequate in situations where strings are quite similar or the sets of characters are the same but their positions are different in strings. In this paper, a lexical similarity approach combining information-theoretic model and edit distance to determine correspondences among the concept labels is developed. Precision, Recall and F-measure as well as partial OAEI benchmark 2008 tests are used to evaluate the proposed method. The results show that our approach is flexible and has some prominent features compared to other lexical-based methods.

References

  1. Algergawy, A., Schallehn, E., and Saake, G. (2008). A sequence-based ontology matching approach. In The 10th International Conference on Information Integration and Web-based Applications & Services, pages 131-136. ACM.
  2. Batet, M., Sánchez, D., and Valls, A. (2011). An ontologybased measure to compute semantic similarity in biomedicine. Biomedical Informatics, 44(1):118-125.
  3. Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3):297-302.
  4. Euzenat, J. and Shvaiko, P. (2013). Ontology Matching. Springer, 2nd edition.
  5. Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell System Technical Journal, 29(2):147-160.
  6. Ichise, R. (2008). Machine learning approach for ontology mapping using multiple concept similarity measures. In The 7th IEEE/ACIS International Conference on Computer and Information Science, pages 340-346. IEEE.
  7. Jaccard, P. (1912). The distribution of the flora in the alpine zone. The New Phytologist, 11(2):37-50.
  8. Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. The American Statistical Association, 84(406):414-420.
  9. Kondrak, G. (2005). N-gram similarity and distance. In The 12th International Conference on String Processing and Information Retrieval, pages 115-126. Springer.
  10. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707-710.
  11. Lin, D. (1998). An information-theoretic definition of similarity. In The 15th International Conference on Machine Learning, pages 296-304. Morgan Kaufmann.
  12. Maedche, A. and Staab, S. (2002). Measuring similarity between ontologies. In The 13th International Conference on Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web, pages 251-263. Springer-Verlag.
  13. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Molecular Biology, 48:443-453.
  14. Nguyen, T. T. A. and Conrad, S. (2013). Combination of lexical and structure-based similarity measures to match ontologies automatically. In Knowledge Discovery, Knowledge Engineering and Knowledge Management, volume 415, pages 101-112. Springer.
  15. Pirró, G. and Euzenat, J. (2010). A feature and information theoretic framework for semantic similarity and relatedness. In The 9th International Semantic Web Conference on The Semantic Web, pages 615-630. Springer-Verlag.
  16. Pirró, G. and Seco, N. (2008). Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. In On the Move to Meaningful Internet Systems: OTM 2008, pages 1271-1288. Springer.
  17. Sánchez, D., Batet, M., Isern, D., and Valls, A. (2012). Ontology-based semantic similarity: A new featurebased approach. Expert Systems with Applications, 39(9):7718-7728.
  18. Tversky, A. (1997). Features of similarity. In Psychological Review, volume 84, pages 327-352.
  19. Wang, X., Ding, Y., and Zhao, Y. (2006). Similarity measurement about ontology-based semantic web services. In The Workshop on Semantics for Web Services.
  20. Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In The Section on Survey Research, pages 354-359.
Download


Paper Citation


in Harvard Style

Nguyen T. and Conrad S. (2014). Applying Information-theoretic and Edit Distance Approaches to Flexibly Measure Lexical Similarity . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014) ISBN 978-989-758-048-2, pages 505-511. DOI: 10.5220/0005170005050511


in Bibtex Style

@conference{sstm14,
author={Thi Thuy Anh Nguyen and Stefan Conrad},
title={Applying Information-theoretic and Edit Distance Approaches to Flexibly Measure Lexical Similarity},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014)},
year={2014},
pages={505-511},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005170005050511},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014)
TI - Applying Information-theoretic and Edit Distance Approaches to Flexibly Measure Lexical Similarity
SN - 978-989-758-048-2
AU - Nguyen T.
AU - Conrad S.
PY - 2014
SP - 505
EP - 511
DO - 10.5220/0005170005050511