EM-Join: Efficient Entity Matching Using Embedding-Based Similarity Join
Douglas Rolins Santana, Paulo Lima, Leonardo Ribeiro
2025
Abstract
Entity matching in textual data remains a challenging task due to variations in data representation and the computational cost. In this paper, we propose an efficient pipeline for entity matching that combines text preprocessing, embedding-based data representation, and similarity joins with a heuristic-driven method for threshold selection. Our approach simplifies the matching process by concatenating attribute values and leveraging specialized language models for generating embeddings, followed by a fast similarity join evaluation. We compare our method against state-of-the-art techniques, namely Ditto, Ember, and DeepMatcher, across 13 publicly available datasets. Our solution achieves superior performance in 3 datasets while maintaining competitive accuracy in the others, and it significantly reduces execution time—up to 3x faster than Ditto. The results obtained demonstrate the potential for high-speed, scalable entity matching in practical applications.
DownloadPaper Citation
in Harvard Style
Santana D., Lima P. and Ribeiro L. (2025). EM-Join: Efficient Entity Matching Using Embedding-Based Similarity Join. In Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS; ISBN 978-989-758-749-8, SciTePress, pages 402-409. DOI: 10.5220/0013483700003929
in Bibtex Style
@conference{iceis25,
author={Douglas Santana and Paulo Lima and Leonardo Ribeiro},
title={EM-Join: Efficient Entity Matching Using Embedding-Based Similarity Join},
booktitle={Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS},
year={2025},
pages={402-409},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013483700003929},
isbn={978-989-758-749-8},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS
TI - EM-Join: Efficient Entity Matching Using Embedding-Based Similarity Join
SN - 978-989-758-749-8
AU - Santana D.
AU - Lima P.
AU - Ribeiro L.
PY - 2025
SP - 402
EP - 409
DO - 10.5220/0013483700003929
PB - SciTePress