BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments

Andrea Colombo, Francesco Invernici

2024

Abstract

Linking entities from different datasets is a crucial task for the success of modern businesses. However, aligning entities becomes challenging as common identifiers might be missing. Therefore, the process should rely on string-based attributes, such as names or addresses, thus harming precision in the matching. At the same time, powerful general-purpose record linkage tools require users to clean and pre-process the initial data, introducing a bottleneck in the success of the data integration activity and a burden on actual users. Furthermore, scalability has become a relevant issue in modern big data environments, where a lot of data flows daily from external sources. This work presents a novel record linkage tool, BeRTo, that addresses the problem of linking a specific type of data source, i.e., business registries, containing information about companies and corporations. While being domain-specific harms its usability in other contexts, it manages to reach a new frontier in terms of precision but also scalability, as it has been built on Spark. Integrating the pre-processing and cleaning steps in the same tool creates a user-friendly end-to-end pipeline that requires users only to input the raw data and set their preferred configuration, allowing to focus on recall or precision.

Download


Paper Citation


in Harvard Style

Colombo A. and Invernici F. (2024). BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-707-8, SciTePress, pages 259-268. DOI: 10.5220/0012718000003756


in Bibtex Style

@conference{data24,
author={Andrea Colombo and Francesco Invernici},
title={BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments},
booktitle={Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2024},
pages={259-268},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012718000003756},
isbn={978-989-758-707-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments
SN - 978-989-758-707-8
AU - Colombo A.
AU - Invernici F.
PY - 2024
SP - 259
EP - 268
DO - 10.5220/0012718000003756
PB - SciTePress