BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments

Andrea Colombo; Francesco Invernici

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments

Topics: Big Data Applications; Data Fusion; Data Science; Information Integration

In Proceedings of the 13th International Conference on Data Science, Technology and Applications DATA - Volume 1, 259-268, 2024 , Dijon, France

Authors: Andrea Colombo and Francesco Invernici

Affiliation: Dipartimento di Elettronica, Informazione e Bioingegeneria, Politecnico di Milano, Via G. Ponzio 34, Milan, Italy

Keyword(s): Record Linkage, Entity Resolution, Apache Spark, Hadoop, Big Data Integration.

Abstract: Linking entities from different datasets is a crucial task for the success of modern businesses. However, aligning entities becomes challenging as common identifiers might be missing. Therefore, the process should rely on string-based attributes, such as names or addresses, thus harming precision in the matching. At the same time, powerful general-purpose record linkage tools require users to clean and pre-process the initial data, introducing a bottleneck in the success of the data integration activity and a burden on actual users. Furthermore, scalability has become a relevant issue in modern big data environments, where a lot of data flows daily from external sources. This work presents a novel record linkage tool, BeRTo, that addresses the problem of linking a specific type of data source, i.e., business registries, containing information about companies and corporations. While being domain-specific harms its usability in other contexts, it manages to reach a new frontier in term s of precision but also scalability, as it has been built on Spark. Integrating the pre-processing and cleaning steps in the same tool creates a user-friendly end-to-end pipeline that requires users only to input the raw data and set their preferred configuration, allowing to focus on recall or precision. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.108

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Colombo, A. and Invernici, F. (2024). BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-707-8; ISSN 2184-285X, SciTePress, pages 259-268. DOI: 10.5220/0012718000003756

@conference{data24,
author={Andrea Colombo and Francesco Invernici},
title={BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments},
booktitle={Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA},
year={2024},
pages={259-268},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012718000003756},
isbn={978-989-758-707-8},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA
TI - BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments
SN - 978-989-758-707-8
IS - 2184-285X
AU - Colombo, A.
AU - Invernici, F.
PY - 2024
SP - 259
EP - 268
DO - 10.5220/0012718000003756
PB - SciTePress