An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases
Emiel Caron
2020
Abstract
Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for further analysis. In order for these databases to be a reliable point of reference, the data needs to be cleaned. Entity resolution focuses on disambiguating records that refer to the same entity. In this paper we propose a generic optimization method for disambiguating large databases. This method is used on a table with scientific references from the Patstat database. The table holds ambiguous information on citations to scientific references. The research method described is used to create clusters of records that refer to the same bibliographic entity. The method starts by pre-cleaning the records and extracting bibliographic labels. Next, we construct rules based on these labels and make use of the tf-idf algorithm to compute string similarities. We create clusters by means of a rule-based scoring system. Finally, we perform precision-recall analysis using a golden set of clusters and optimize our parameters with simulated annealing. Here we show that it is possible to optimize the performance of a disambiguation method using a global optimization algorithm.
DownloadPaper Citation
in Harvard Style
Caron E. (2020). An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases.In Proceedings of the 15th International Conference on Software Technologies - Volume 1: ICSOFT, ISBN 978-989-758-443-5, pages 625-632. DOI: 10.5220/0009972406250632
in Bibtex Style
@conference{icsoft20,
author={Emiel Caron},
title={An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases},
booktitle={Proceedings of the 15th International Conference on Software Technologies - Volume 1: ICSOFT,},
year={2020},
pages={625-632},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009972406250632},
isbn={978-989-758-443-5},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 15th International Conference on Software Technologies - Volume 1: ICSOFT,
TI - An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases
SN - 978-989-758-443-5
AU - Caron E.
PY - 2020
SP - 625
EP - 632
DO - 10.5220/0009972406250632