used as an additional resource to improve the final re-
sults.We have also developed a plain Python version
of BeRTo for smaller record linkage tasks, which we
have also made available in our repository. While not
scalable, it can be used for small data business reg-
istry record linkage. For future work, we will work on
a graphical user interface for BeRTo, unlocking even
more user-friendly uses of our tool.
ACKNOWLEDGEMENTS
Andrea Colombo kindly acknowledges INPS for
funding his Ph.D. program.
REFERENCES
Arief, R., Achmad, B. M., Tubagus, M. K., and Husti-
nawaty (2018). Automated extraction of large scale
scanned document images using google vision ocr in
apache hadoop environment. International Journal of
Advanced Computer Science and Applications, 9(11).
Bajgar, M., Berlingieri, G., Calligaris, S., Criscuolo, C., and
Timmis, J. (2020). Coverage and representativeness of
orbis data. OECD Library.
Bilenko, M., Kamath, B., and Mooney, R. J. (2006). Adap-
tive blocking: Learning to scale up record linkage.
In Sixth International Conference on Data Mining
(ICDM’06), pages 87–96. IEEE.
Chen, X., Schallehn, E., and Saake, G. (2018). Cloud-scale
entity resolution: current state and open challenges.
Open Journal of Big Data (OJBD), 4(1):30–51.
Christen, P. (2012). The data matching process. Springer.
Christophides, V. e. a. (2020). An overview of end-to-end
entity resolution for big data. ACM Comput. Surv.,
53(6).
Dong, X. L. and Srivastava, D. (2013). Big data integration.
In 2013 IEEE 29th International Conference on Data
Engineering (ICDE), pages 1245–1248.
Drabas, T. and Lee, D. (2017). Learning PySpark. Packt
Publishing Ltd.
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani,
M., and Tang, N. (2018). Distributed representations
of tuples for entity resolution. Proc. VLDB Endow.,
11(11):1454–1467.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S.
(2007). Duplicate record detection: A survey. IEEE
Transactions on Knowledge and Data Engineering,
19(1):1–16.
Eurostat (2024). Statistical business registers. https://ec.
europa.eu/eurostat/web/statistical-business-registers/
information-data Accessed: 01/2024.
Forest, G. and Eder, D. (2015). Dedupe. https://github.com/
dedupeio/dedupe. [Accessed January-2024].
Gagliardelli, L., Simonini, G., Beneventano, D., Berga-
maschi, S., et al. (2019). Sparker: Scaling en-
tity resolution in spark. In Advances in Database
Technology-EDBT 2019, 22nd International Confer-
ence on Extending Database Technology, Lisbon, Por-
tugal, March 26-29, Proceedings. PRT.
Getoor, L. and Machanavajjhala, A. (2012). Entity resolu-
tion: Theory, practice & open challenges. Proc. VLDB
Endow., 5(12).
Guralnick, R. P. e. a. (2015). Community next steps for
making globally unique identifiers work for biocollec-
tions data. ZooKeys, (494):133.
Herzog, T. N., Scheuren, F. J., and Winkler, W. E.
(2007). Data Quality and Record Linkage Techniques.
Springer Publishing, 1st edition.
Kolb, L., Thor, A., and Rahm, E. (2012a). Dedoop: Effi-
cient deduplication with hadoop. Proc. VLDB Endow.,
5(12):1878–1881.
Kolb, L., Thor, A., and Rahm, E. (2012b). Load balancing
for mapreduce-based entity resolution. In 2012 IEEE
28th International Conference on Data Engineering,
pages 618–629.
Konda, P., Das, S., C., P. S. G., Doan, A., Ardalan, A., Bal-
lard, J. R., Li, H., Panahi, F., Zhang, H., Naughton, J.,
Prasad, S., Krishnan, G., Deep, R., and Raghavendra,
V. (2016). Magellan: Toward building entity matching
management systems over data science stacks. Proc.
VLDB Endow., 9(13):1581–1584.
Koudas, N., Sarawagi, S., and Srivastava, D. (2006). Record
linkage: Similarity measures and algorithms. In Pro-
ceedings of the 2006 ACM SIGMOD International
Conference, SIGMOD ’06, page 802–803. Associa-
tion for Computing Machinery.
Krishnan, S., Haas, D., Franklin, M. J., and Wu, E.
(2016). Towards reliable interactive data cleaning: a
user survey and recommendations. In Proceedings of
the Workshop on Human-In-the-Loop Data Analytics,
HILDA ’16. ACM.
Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2020).
Deep entity matching with pre-trained language mod-
els. Proceedings of the VLDB Endowment, 14.
Linacre, R., Lindsay, S., Manassis, T., Slade, Z., Hepworth,
T., Kennedy, R., and Bond, A. (2022). Splink: Free
software for probabilistic record linkage at scale. In-
ternational Journal of Population Data Science, 7(3).
Nentwig, M. and Rahm, E. (2018). Incremental cluster-
ing on linked data. In 2018 IEEE International Con-
ference on Data Mining Workshops (ICDMW), pages
531–538. IEEE.
Ryan, L., Thompson, C., and Jones, J. (2020). A sta-
tistical business register spine as a new approach to
support data integration and firm-level data linking:
An abs perspective. Statistical Journal of the IAOS,
36(3):767–774.
Salloum, S., Dautov, R., Chen, X., Peng, P. X., and Huang,
J. Z. (2016). Big data analytics on apache spark. In-
ternational Journal of Data Science and Analytics,
1:145–164.
Shaikh, E., Mohiuddin, I., Alufaisan, Y., and Nahvi, I.
(2019). Apache spark: A big data processing en-
BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments
267