Authors:
Paulo Lima
;
Douglas Santana
;
Wellington Santos Martins
and
Leonardo Ribeiro
Affiliation:
Instituto de Informática (INF), Universidade Federal de Goiás (UFG), Goiânia, GO, Brazil
Keyword(s):
Data Cleaning, Integration, Deep Learning, Entity Matching, Experiments, Analysis.
Abstract:
Application data inevitably has inconsistencies that may cause malfunctioning in daily operations and com- promise analytical results. A particular type of inconsistency is the presence of duplicates, e.g., multiple and non-identical representations of the same information. Entity matching (EM) refers to the problem of de- termining whether two data instances are duplicates. Two deep learning solutions, DeepMatcher and Ditto, have recently achieved state-of-the-art results in EM. However, neither solution considered duplicates with character-level variations, which are pervasive in real-world databases. This paper presents a comparative evaluation between DeepMatcher and Ditto on datasets from a diverse array of domains with such variations and textual patterns that were previously ignored. The results showed that the two solutions experienced a considerable drop in accuracy, while Ditto was more robust than DeepMatcher.