Matching Entities from Multiple Sources with Hierarchical Agglomerative Clustering

Alieh Saeedi, Alieh Saeedi, Lucie David, Erhard Rahm, Erhard Rahm

2021

Abstract

We propose extensions to Hierarchical Agglomerative Clustering (HAC) to match and cluster entities from multiple sources that can be either duplicate-free or dirty. The proposed scheme is comparatively evaluated against standard HAC as well as other entity clustering approaches concerning efficiency and efficacy criteria. All proposed algorithms can be run in parallel on a distributed cluster to improve scalability to large data volumes. The evaluation with diverse datasets shows that the new approach can utilize duplicate-free sources and achieves better match quality than previous methods.

Download


Paper Citation


in Harvard Style

Saeedi A., David L. and Rahm E. (2021). Matching Entities from Multiple Sources with Hierarchical Agglomerative Clustering. In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 2: KEOD; ISBN 978-989-758-533-3, SciTePress, pages 40-50. DOI: 10.5220/0010649600003064


in Bibtex Style

@conference{keod21,
author={Alieh Saeedi and Lucie David and Erhard Rahm},
title={Matching Entities from Multiple Sources with Hierarchical Agglomerative Clustering},
booktitle={Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 2: KEOD},
year={2021},
pages={40-50},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010649600003064},
isbn={978-989-758-533-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 2: KEOD
TI - Matching Entities from Multiple Sources with Hierarchical Agglomerative Clustering
SN - 978-989-758-533-3
AU - Saeedi A.
AU - David L.
AU - Rahm E.
PY - 2021
SP - 40
EP - 50
DO - 10.5220/0010649600003064
PB - SciTePress