BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments
Andrea Colombo, Francesco Invernici
2024
Abstract
Linking entities from different datasets is a crucial task for the success of modern businesses. However, aligning entities becomes challenging as common identifiers might be missing. Therefore, the process should rely on string-based attributes, such as names or addresses, thus harming precision in the matching. At the same time, powerful general-purpose record linkage tools require users to clean and pre-process the initial data, introducing a bottleneck in the success of the data integration activity and a burden on actual users. Furthermore, scalability has become a relevant issue in modern big data environments, where a lot of data flows daily from external sources. This work presents a novel record linkage tool, BeRTo, that addresses the problem of linking a specific type of data source, i.e., business registries, containing information about companies and corporations. While being domain-specific harms its usability in other contexts, it manages to reach a new frontier in terms of precision but also scalability, as it has been built on Spark. Integrating the pre-processing and cleaning steps in the same tool creates a user-friendly end-to-end pipeline that requires users only to input the raw data and set their preferred configuration, allowing to focus on recall or precision.
DownloadPaper Citation
in Harvard Style
Colombo A. and Invernici F. (2024). BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-707-8, SciTePress, pages 259-268. DOI: 10.5220/0012718000003756
in Bibtex Style
@conference{data24,
author={Andrea Colombo and Francesco Invernici},
title={BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments},
booktitle={Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2024},
pages={259-268},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012718000003756},
isbn={978-989-758-707-8},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments
SN - 978-989-758-707-8
AU - Colombo A.
AU - Invernici F.
PY - 2024
SP - 259
EP - 268
DO - 10.5220/0012718000003756
PB - SciTePress