Approximate String Matching Techniques

Taoxin Peng, Calum Mackay

2014

Abstract

Data quality is a key to success for all kinds of businesses that have information applications involved, such as data integration for data warehouses, text and web mining, information retrieval, search engine for web applications, etc. In such applications, matching strings is one of the popular tasks. There are a number of approximate string matching techniques available. However, there is still a problem that remains unanswered: for a given dataset, how to select an appropriate technique and a threshold value required by this technique for the purpose of string matching. To challenge this problem, this paper analyses and evaluates a set of popular token-based string matching techniques on several carefully designed different datasets. A thorough experimental comparison confirms the statement that there is no clear overall best technique. However, some techniques do perform significantly better in some cases. Some suggestions have been presented, which can be used as guidance for researchers and practitioners to select an appropriate string matching technique and a corresponding threshold value for a given dataset.

References

  1. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P. and Fienberg, S., 2003. Adaptive Name Matching in Information Integration, IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23.
  2. Chaudhuri, S., Ganti, V. and Kaushik, R., 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of International Conference on Data Engineering.
  3. Christen, P., 2006. A Comparison of Personal Name Matching: Techniques and Practical Issues. In Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops (ICDMW 7806). IEEE Computer Society, Washington, DC, USA, pp.290-294.
  4. Cohen, W., 2000. WHIRL: A word-based information representation language. Artificial Intelligence, Volume 118, Issues 1-2, pp. 163-196.
  5. Cohen, W., Ravikumar, P. and S. Fienberg., 2003. A comparison of string distance metrics for namematching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pp.73-78.
  6. Elmagarmid, A., Ipeirotis, P. and Verykios, V., 2007. Duplicate Record Detection: A Survey. IEEE Trans. Knowl.Data Eng., Vol.19, No.1, pp. 1-16.
  7. Fellegi, P. and Sunter, B., 1969. A Theory for Record Linkage. Journal of the American Statistical Association, 64(328), pp. 1183-1210.
  8. Hassanzadeh, O., Sadoghi, M. and Miller, R., 2007. Accuracy of Approximate String Joins Using Grams. In Proceedings of QDB'2007, pp. 11-18.
  9. Herzog, T., Scheuren, F. and Winkler, W., 2010, “Record Linkage,” in (D. W. Scott, Y. Said, and E.Wegman, eds.)Wiley Interdisciplinary Reviews: Computational Statistics, New York, N. Y.: Wiley, 2 (5), September/October, 535-543.
  10. Köpcke, H., Thor, A., and Rahm, E., 2010. Evaluation of Entity Resolution Approahces on Real-world Match Problems, In Proceedings of the VLDB Endowment, Vol. 3, No. 1.
  11. Navarro, G., 2001. A Guide Tour to Approximate String Matching. ACM Computing Surveys, Vol. 33, No. 1, pp. 31-88.
  12. Peng,T., Li, L. and Kennedy, J., 2012. A Comparison of Techniques for Name Matching. GSTF International Journal on Computing , Vol.2 No. 1, pp. 55 - 61.
  13. Rijsbergen, C., 1979. Information Retrieval. 2nd ed., London: Butterworths.
  14. Vingron, M. and Waterman, S., 1994. Sequence alignment and penalty choice. Review of concepts, case studies and implications. Journal of molecular biology 235 (1), pp. 1-12.
Download


Paper Citation


in Harvard Style

Peng T. and Mackay C. (2014). Approximate String Matching Techniques . In Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-027-7, pages 217-224. DOI: 10.5220/0004892802170224


in Bibtex Style

@conference{iceis14,
author={Taoxin Peng and Calum Mackay},
title={Approximate String Matching Techniques},
booktitle={Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2014},
pages={217-224},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004892802170224},
isbn={978-989-758-027-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Approximate String Matching Techniques
SN - 978-989-758-027-7
AU - Peng T.
AU - Mackay C.
PY - 2014
SP - 217
EP - 224
DO - 10.5220/0004892802170224