computing as the rules could be split over multiple
cores of a computer and be constructed simultane-
ously. This would decrease the computation time of
the disambiguation method significantly. Moreover,
the exploration of different optimization techniques
for finding a global optimum next to simulated an-
nealing would be very interesting for comparison. A
potential candidate for comparison is Tabu search.
ACKNOWLEDGEMENTS
We kindly acknowledge Wen Xin Lin and Prof.
H.A.M. Daniels for their contribution to this work.
REFERENCES
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q.,
Whang, S. E., and Widom, J. (2009). Swoosh: A
generic approach to entity resolution. The VLDB Jour-
nal, 18(1):255276.
Bhattacharya, I. and Getoor, L. (2007). Collective entity
resolution in relational data. ACM Trans. Knowl. Dis-
cov. Data, 1(1).
Bondy, J. A. and Murty, U. S. R. (1976). Graph theory with
applications, volume 290. Macmillan London.
Caron, E. and Daniels, H. (2016). Identification of organiza-
tion name variants in large databases using rule-based
scoring and clustering - with a case study on the web
of science database. In Proceedings of the 18th Inter-
national Conference on Enterprise Information Sys-
tems - Volume 1: ICEIS,, pages 182–187. INSTICC,
SciTePress.
Caron, E. and Van Eck, N.-J. (2014). Large scale author
name disambiguation using rule-based scoring and
clustering. In Noyons, E., editor, Proceedings of the
Science and Technology Indicators Conference 2014,
pages 79–86. Universiteit Leiden.
Erber, T. and Hockney, G. (1995). Comment on method
of constrained global optimization. Physical review
letters, 74(8):1482.
European Patent Office (2019). Data Catalog - PATSTAT
Global, 2019 autumn edition edition.
Fawcett, T. (2006). An introduction to roc analysis. Pattern
recognition letters, 27(8):861–874.
Fellegi, I. P. and Sunter, A. B. (1969). A theory for record
linkage. Journal of the American Statistical Associa-
tion, 64(328):1183–1210.
Hern
´
andez, M. A. and Stolfo, S. J. (1995). The merge/purge
problem for large databases. In ACM Sigmod Record,
volume 24, pages 127–138. ACM.
Isele, R. and Bizer, C. (2013). Active learning of expres-
sive linkage rules using genetic programming. Web
Semantics: Science, Services and Agents on the World
Wide Web, 23:2–15.
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983).
Optimization by simulated annealing. science,
220(4598):671–680.
Levin, M., Krawczyk, S., Bethard, S., and Jurafsky,
D. (2012). Citation-based bootstrapping for large-
scale author disambiguation. Journal of the Ameri-
can Society for Information Science and Technology,
63(5):1030–1047.
Lotti, F. and Marin, G. (2013). Matching of patstat applica-
tions to aida firms: discussion of the methodology and
results. Bank of Italy Occasional Paper, (166).
Monge, A. E. and Elkan, C. (1996). The field matching
problem: Algorithms and applications. In KDD, vol-
ume 2, pages 267–270.
Ngomo, A.-C. N., Lehmann, J., Auer, S., and H
¨
offner, K.
(2011). Raven–active learning of link specifications.
Ontology Matching, 2011.
Ngonga Ngomo, A.-C. and Lyko, K. (2012). Eagle: Effi-
cient active learning of link specifications using ge-
netic programming. In Simperl, E., Cimiano, P.,
Polleres, A., Corcho, O., and Presutti, V., editors,
The Semantic Web: Research and Applications, pages
149–163, Berlin, Heidelberg. Springer Berlin Heidel-
berg.
Nguyen, K. and Ichise, R. (2016). Linked data entity res-
olution system enhanced by configuration learning al-
gorithm. IEICE Transactions on Information and Sys-
tems, E99.D(6):1521–1530.
Salton, G. and Buckley, C. (1988). Term-weighting ap-
proaches in automatic text retrieval. Information Pro-
cessing & Management, 24(5):513 – 523.
SciPy.org (2020). Scipy.optimize package with
dual annealing() function.
Thoma, G. and Torrisi, S. (2007). Creating powerful indica-
tors for innovation studies with approximate matching
algorithms: a test based on PATSTAT and Amadeus
databases. Universit
`
a commerciale Luigi Bocconi.
Thoma, G., Torrisi, S., Gambardella, A., Guellec, D., Hall,
B. H., and Harhoff, D. (2010). Harmonizing and
combining large datasets-an application to firm-level
patent and accounting data. Technical report, National
Bureau of Economic Research.
Whitley, D. (1994). A genetic algorithm tutorial. Statistics
and computing, 4(2):65–85.
Xiang, Y. and Gong, X. (2000). Efficiency of generalized
simulated annealing. Physical Review E, 62(3):4473.
Xiang, Y., Sun, D., Fan, W., and Gong, X. (1997). General-
ized simulated annealing algorithm and its application
to the thomson model. Physics Letters A, 233(3):216–
220.
Zhao, K., Caron, E., and Guner, S. (2016). Large
scale disambiguation of scientific references in patent
databases. In Rafols, I., Molas-Gallart, J., Castro-
Martinez, E., and Woolley, R., editors, Proceedings of
21st International Conference on Science and Tech-
nology Indicators (STI 2016), pages 1404–1410. Edi-
torial Universitat Polit
`
ecnica de Val
`
encia.
An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases
631