techniques. A comprehensive comparison of the
twelve techniques has been done based on a series of
experiments on 63 carefully designed datasets with
different characteristics, such as the rate of errors, the
type of error, the number, the length of tokens in a
string, and the size of a dataset. The comparison
results confirmed the statement that there is no clear
best technique. The characteristics considered all
have significant effect on performance of these
techniques, except the size of a dataset. In general,
HMM and BM25 perform better than others,
especially on smaller sized datasets, but consume
much more time. Cosine TF-IDF and TF-IDF are
better on larger datasets with a higher error rate
associated. Results also show that techniques that
perform well on datasets incorporated with mixed
type of errors do not secure a similar performance on
datasets incorporated with a single type of errors. For
example, BM25 didn’t perform well on datasets with
low error rate, incorporated only with insertion
errors. Similarly, HMM didn’t perform well on
datasets with low error rate, incorporated with only
deletion errors. The token length also has an effect on
the performance. For example, some techniques,
such as Affine Gap, WHIRL and SoftTFIDF
performed much better when the token length is
medium than that of the token length when it is long.
Regarding the threshold value, the results show
that the level of “dirtiness” in a dataset has
significant effect on threshold selection. In general,
the higher the error rate in the dataset, the lower the
threshold value is required in order to achieve the
maximum F1-score.
The work introduces a number of further
investigations, including: 1) to do more experiments
on datasets with more characteristics, such as the
number of tokens in strings etc.; 2) to do further
analysis in order to evaluate whether there is a
method to select a threshold value for any of the
matching techniques on a given dataset.
REFERENCES
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P. and
Fienberg, S., 2003. Adaptive Name Matching in
Information Integration, IEEE Intelligent Systems, vol.
18, no. 5, pp. 16-23.
Chaudhuri, S., Ganti, V. and Kaushik, R., 2006. A
primitive operator for similarity joins in data cleaning.
In Proceedings of International Conference on Data
Engineering.
Christen, P., 2006. A Comparison of Personal Name
Matching: Techniques and Practical Issues. In
Proceedings of the Sixth IEEE International
Conference on Data Mining - Workshops (ICDMW
'06). IEEE Computer Society, Washington, DC, USA,
pp.290-294.
Cohen, W., 2000. WHIRL: A word-based information
representation language. Artificial Intelligence,
Volume 118, Issues 1-2, pp. 163-196.
Cohen, W., Ravikumar, P. and S. Fienberg., 2003. A
comparison of string distance metrics for name-
matching tasks. In Proceedings of the IJCAI-2003
Workshop on Information Integration on the Web,
pp.73-78.
Elmagarmid, A., Ipeirotis, P. and Verykios, V., 2007.
Duplicate Record Detection: A Survey. IEEE Trans.
Knowl.Data Eng., Vol.19, No.1, pp. 1-16.
Fellegi, P. and Sunter, B., 1969. A Theory for Record
Linkage. Journal of the American Statistical
Association, 64(328), pp. 1183-1210.
Hassanzadeh, O., Sadoghi, M. and Miller, R., 2007.
Accuracy of Approximate String Joins Using Grams.
In Proceedings of QDB'2007, pp. 11-18.
Herzog, T., Scheuren, F. and Winkler, W., 2010, “Record
Linkage,” in (D. W. Scott, Y. Said, and E.Wegman,
eds.)Wiley Interdisciplinary Reviews: Computational
Statistics, New York, N. Y.: Wiley, 2 (5),
September/October, 535-543.
Köpcke, H., Thor, A., and Rahm, E., 2010. Evaluation of
Entity Resolution Approahces on Real-world Match
Problems, In Proceedings of the VLDB Endowment,
Vol. 3, No. 1.
Navarro, G., 2001. A Guide Tour to Approximate String
Matching. ACM Computing Surveys, Vol. 33, No. 1,
pp. 31–88.
Peng,T., Li, L. and Kennedy, J., 2012. A Comparison of
Techniques for Name Matching. GSTF International
Journal on Computing , Vol.2 No. 1, pp. 55 - 61.
Rijsbergen, C., 1979. Information Retrieval. 2
nd
ed.,
London: Butterworths.
Vingron, M. and Waterman, S., 1994. Sequence alignment
and penalty choice. Review of concepts, case studies
and implications. Journal of molecular biology 235
(1), pp. 1–12.
ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems
224