loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Leonardo Andrade Ribeiro 1 ; Alfredo Cuzzocrea 2 ; Karen Aline Alves Bezerra 3 and Ben Hur Bahia do Nascimento 3

Affiliations: 1 Universidade Federal de Goiás, Brazil ; 2 University of Trieste and ICAR-CNR, Italy ; 3 Universidade Federal de Lavras, Brazil

Keyword(s): Data Integration, Data Cleaning, Duplicate Identification, Set Similarity Joins, Clustering.

Related Ontology Subjects/Areas/Topics: Coupling and Integrating Heterogeneous Data Sources ; Databases and Information Systems Integration ; Enterprise Information Systems ; Performance Evaluation and Benchmarking ; Query Languages and Query Processing

Abstract: A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.135.202.38

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Ribeiro, L.; Cuzzocrea, A.; Bezerra, K. and Nascimento, B. (2016). SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering. In Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS; ISBN 978-989-758-187-8; ISSN 2184-4992, SciTePress, pages 75-80. DOI: 10.5220/0005868700750080

@conference{iceis16,
author={Leonardo Andrade Ribeiro. and Alfredo Cuzzocrea. and Karen Aline Alves Bezerra. and Ben Hur Bahia do Nascimento.},
title={SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering},
booktitle={Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS},
year={2016},
pages={75-80},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005868700750080},
isbn={978-989-758-187-8},
issn={2184-4992},
}

TY - CONF

JO - Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS
TI - SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering
SN - 978-989-758-187-8
IS - 2184-4992
AU - Ribeiro, L.
AU - Cuzzocrea, A.
AU - Bezerra, K.
AU - Nascimento, B.
PY - 2016
SP - 75
EP - 80
DO - 10.5220/0005868700750080
PB - SciTePress