
techniques. A comprehensive comparison of the 
twelve techniques has been done based on a series of 
experiments on 63 carefully designed datasets with 
different characteristics, such as the rate of errors, the 
type of error, the number, the length of tokens in a 
string, and the size of a dataset. The comparison 
results confirmed the statement that there is no clear 
best technique. The characteristics considered all 
have significant effect on performance of these 
techniques, except the size of a dataset. In general, 
HMM and BM25 perform better than others, 
especially on smaller sized datasets, but consume 
much more time. Cosine TF-IDF and TF-IDF are 
better on larger datasets with a higher error rate 
associated. Results also show that techniques that 
perform well on datasets incorporated with mixed 
type of errors do not secure a similar performance on 
datasets incorporated with a single type of errors. For 
example, BM25 didn’t perform well on datasets with 
low error rate, incorporated only with insertion 
errors. Similarly, HMM didn’t perform well on 
datasets with low error rate, incorporated with only 
deletion errors. The token length also has an effect on 
the performance. For example, some techniques, 
such as Affine Gap, WHIRL and SoftTFIDF 
performed much better when the token length is 
medium than that of the token length when it is long. 
Regarding the threshold value, the results show 
that the level of “dirtiness” in a dataset has 
significant effect on threshold selection. In general, 
the higher the error rate in the dataset, the lower the 
threshold value is required in order to achieve the 
maximum F1-score. 
The work introduces a number of further 
investigations, including: 1) to do more experiments 
on datasets with more characteristics, such as the 
number of tokens in strings etc.; 2) to do further 
analysis in order to evaluate whether there is a 
method to select a threshold value for any of the 
matching techniques on a given dataset. 
REFERENCES 
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P. and 
Fienberg, S., 2003.  Adaptive Name Matching in 
Information Integration, IEEE Intelligent Systems, vol. 
18, no. 5, pp. 16-23. 
Chaudhuri, S., Ganti, V. and Kaushik, R., 2006.  A 
primitive operator for similarity joins in data cleaning. 
In  Proceedings of  International Conference on Data 
Engineering. 
Christen, P., 2006. A Comparison of Personal Name 
Matching: Techniques and Practical Issues. In 
Proceedings of the Sixth IEEE International 
Conference on Data Mining - Workshops (ICDMW 
'06). IEEE Computer Society, Washington, DC, USA, 
pp.290-294. 
Cohen, W., 2000. WHIRL: A word-based information 
representation language. Artificial Intelligence, 
Volume 118, Issues 1-2, pp. 163-196. 
Cohen, W., Ravikumar, P. and S. Fienberg., 2003. A 
comparison of string distance metrics for name-
matching tasks. In Proceedings of the IJCAI-2003 
Workshop on Information Integration on the Web, 
pp.73-78. 
Elmagarmid, A., Ipeirotis, P. and Verykios, V., 2007. 
Duplicate Record Detection: A Survey. IEEE Trans. 
Knowl.Data Eng., Vol.19, No.1, pp. 1-16. 
Fellegi, P. and Sunter, B., 1969. A Theory for Record 
Linkage.  Journal of the American Statistical 
Association, 64(328), pp. 1183-1210. 
Hassanzadeh, O., Sadoghi, M. and Miller, R., 2007. 
Accuracy of Approximate String Joins Using Grams. 
In Proceedings of QDB'2007, pp. 11-18. 
Herzog, T., Scheuren, F. and Winkler, W., 2010, “Record 
Linkage,” in (D. W. Scott, Y. Said, and E.Wegman, 
eds.)Wiley Interdisciplinary Reviews: Computational 
Statistics, New York, N. Y.: Wiley, 2 (5), 
September/October, 535-543. 
Köpcke, H., Thor, A., and Rahm, E., 2010. Evaluation of 
Entity Resolution Approahces on Real-world Match 
Problems, In Proceedings of the VLDB Endowment, 
Vol. 3, No. 1.  
Navarro, G., 2001. A Guide Tour to Approximate String 
Matching. ACM Computing Surveys, Vol. 33, No. 1, 
pp. 31–88.  
Peng,T., Li, L. and Kennedy, J., 2012. A Comparison of 
Techniques for Name Matching. GSTF International 
Journal on Computing , Vol.2 No. 1, pp. 55 - 61. 
Rijsbergen, C., 1979. Information Retrieval. 2
nd
 ed., 
London: Butterworths. 
Vingron, M. and Waterman, S., 1994. Sequence alignment 
and penalty choice. Review of concepts, case studies 
and implications. Journal of molecular biology  235 
(1), pp. 1–12. 
ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems
224