
6 CONCLUSION
This paper proposed a new EM technique that com-
bines text embeddings generated by pre-trained lan-
guage models with a similarity join mechanism. By
optimizing the matching process through heuristic
threshold selection, our method achieved competi-
tive accuracy, outperforming the accuracy of Ditto,
the state-of-the-art EM solution, in 3 of the 13
tested datasets, while significantly reducing execution
time — up to 3 times faster than Ditto. These results
demonstrate the effectiveness of our approach in bal-
ancing performance and speed, making it suitable for
large-scale, real-time applications.
For future work, we plan to refine the threshold
selection process to further improve accuracy, partic-
ularly on textual and dirty datasets. We also intend to
explore the applicability of our method in other ap-
plication domains and larger datasets. Additionally,
integrating more advanced language models and opti-
mizing computational efficiency will be key areas of
focus to expand the versatility, robustness, and scala-
bility of our proposed solution.
ACKNOWLEDGEMENTS
This work was partially supported by CAPES/Brazil
and LaMCAD/UFG.
REFERENCES
Barlaug, N. and Gulla, J. A. (2021). Neural Networks for
Entity Matching: A Survey. ACM Transactions on
Knowledge Discovery from Data, 15(3):52:1–52:37.
Clark, K. and Manning, C. D. (2016). Improving Corefer-
ence Resolution by Learning Entity-Level Distributed
Representations. In Proceedings of the Association for
Computational Linguistics, pages 643–653.
Das, S., G.C., P. S., Doan, A., Naughton, J. F., Krishnan,
G., Deep, R., Arcaute, E., Raghavendra, V., and Park,
Y. (2017). Falcon: Scaling up hands-off crowdsourced
entity matching to build cloud services. SIGMOD ’17,
page 1431–1446, New York, NY, USA. ACM.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019).
BERT: Pre-training of Deep Bidirectional Transform-
ers for Language Understanding. In Proceedings of
the ACL, pages 4171–4186.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S.
(2007). Duplicate Record Detection: A Survey. IEEE
Transactions on Knowledge and Data Engineering,
19(1):1–16.
Johnson, J., Douze, M., and J
´
egou, H. (2019). Billion-scale
similarity search with gpus. IEEE Transactions on Big
Data, 7(3):535–547.
Leone, M., Huber, S., Arora, A., Garc
´
ıa-Dur
´
an, A., and
West, R. (2022). A Critical Re-evaluation of Neural
Methods for Entity Alignment. Proceedings of the
VLDB Endowment, 15(8):1712–1725.
Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2023).
Effective entity matching with transformers. The
VLDB Journal, 32(6):1215–1235.
Lima, P. H. S., Santana, D. R., Martins, W. S., and Ribeiro,
L. A. (2023). Evaluation of Deep Learning Tech-
niques for Entity Matching. In International Confer-
ence on Enterprise Information Systems, pages 247–
254.
Malkov, Y. A. and Yashunin, D. A. (2020). Efficient and
Robust Approximate Nearest Neighbor Search Using
Hierarchical Navigable Small World Graphs. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 42(4):824–836.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Ef-
ficient Estimation of Word Representations in Vector
Space. In Bengio, Y. and LeCun, Y., editors, Interna-
tional Conference on Learning Representations.
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Kr-
ishnan, G., Deep, R., Arcaute, E., and Raghavendra,
V. (2018). Deep Learning for Entity Matching: A De-
sign Space Exploration. In Proceedings of the SIG-
MOD Conference, pages 19–34. ACM.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.
(2023). MTEB: Massive Text Embedding Bench-
mark. In Proceedings of the ACL, pages 2014–2037,
Dubrovnik, Croatia. ACL.
Newcombe, H., Kennedy, J., Axford, S., and James, A.
(1959). Automatic Linkage of Vital Records. Science,
130(3381):954–959.
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sen-
tence Embeddings using Siamese BERT-Networks.
Proceedings of the Conference on Empirical Methods
in Natural Language Processing, pages 3982–3992.
Santana, D. R. and Ribeiro, L. A. (2023). Approx-
imate Similarity Joins over Dense Vector Embed-
dings. In Proceedings of the Brazilian Symposium on
Databases, pages 51–62. SBC.
Shen, W., Wang, J., and Han, J. (2015). Entity Linking
with a Knowledge Base: Issues, Techniques, and So-
lutions. IEEE Transactions on Knowledge and Data
Engineering, 27(2):443–460.
Suri, R., Fischer, J., Madden, S., and Stonebraker, M.
(2022). Ember: No-code context enrichment via
similarity-based keyless joins. Proceedings of the
VLDB Endowment, 15:699–712.
Thirumuruganathan, S., Li, H., Tang, N., Ouzzani, M.,
Govind, Y., Paulsen, D., Fung, G., and Doan, A.
(2021). Deep Learning for Blocking in Entity Match-
ing: A Design Space Exploration. Proceedings of the
VLDB Endowment, 14(11):2459–2472.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is All you Need. In Proceedings
of the Conference on Neural Information Processing
Systems, pages 5998–6008.
EM-Join: Efficient Entity Matching Using Embedding-Based Similarity Join
409