loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Andrea Colombo and Francesco Invernici

Affiliation: Dipartimento di Elettronica, Informazione e Bioingegeneria, Politecnico di Milano, Via G. Ponzio 34, Milan, Italy

Keyword(s): Record Linkage, Entity Resolution, Apache Spark, Hadoop, Big Data Integration.

Abstract: Linking entities from different datasets is a crucial task for the success of modern businesses. However, aligning entities becomes challenging as common identifiers might be missing. Therefore, the process should rely on string-based attributes, such as names or addresses, thus harming precision in the matching. At the same time, powerful general-purpose record linkage tools require users to clean and pre-process the initial data, introducing a bottleneck in the success of the data integration activity and a burden on actual users. Furthermore, scalability has become a relevant issue in modern big data environments, where a lot of data flows daily from external sources. This work presents a novel record linkage tool, BeRTo, that addresses the problem of linking a specific type of data source, i.e., business registries, containing information about companies and corporations. While being domain-specific harms its usability in other contexts, it manages to reach a new frontier in term s of precision but also scalability, as it has been built on Spark. Integrating the pre-processing and cleaning steps in the same tool creates a user-friendly end-to-end pipeline that requires users only to input the raw data and set their preferred configuration, allowing to focus on recall or precision. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.145.108.87

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Colombo, A. and Invernici, F. (2024). BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-707-8; ISSN 2184-285X, SciTePress, pages 259-268. DOI: 10.5220/0012718000003756

@conference{data24,
author={Andrea Colombo and Francesco Invernici},
title={BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments},
booktitle={Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA},
year={2024},
pages={259-268},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012718000003756},
isbn={978-989-758-707-8},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA
TI - BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments
SN - 978-989-758-707-8
IS - 2184-285X
AU - Colombo, A.
AU - Invernici, F.
PY - 2024
SP - 259
EP - 268
DO - 10.5220/0012718000003756
PB - SciTePress