Authors:
Leonardo Andrade Ribeiro
1
;
Alfredo Cuzzocrea
2
;
Karen Aline Alves Bezerra
3
and
Ben Hur Bahia do Nascimento
3
Affiliations:
1
Universidade Federal de Goiás, Brazil
;
2
University of Trieste and ICAR-CNR, Italy
;
3
Universidade Federal de Lavras, Brazil
Keyword(s):
Data Integration, Data Cleaning, Duplicate Identification, Set Similarity Joins, Clustering.
Related
Ontology
Subjects/Areas/Topics:
Coupling and Integrating Heterogeneous Data Sources
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Performance Evaluation and Benchmarking
;
Query Languages and Query Processing
Abstract:
A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.