point to the same bibliographic entity. The method
begins by pre-cleaning the records and extracting bib-
liographic labels. Subsequently, rules are developed
based on the labels combined with string similarity
measures, and clusters are created by a rule-based
scoring system. Lastly, precision-recall analysis is
performed using a golden set of clusters, to optimize
the rule weights and thresholds. The results demon-
strate that it is feasible to optimize the overall F1-
score of disambiguation method using a global opti-
mization algorithm, and obtain the best parameters to
disambiguate the whole database of scientific refer-
ence. By changing the rules, the method can directly
be applied on on similar ER-problems. Therefore, the
method has a generic perspective.
In future research, several directions might be ex-
plored to obtain the optimal configuration for the
method. Firstly, our method can be analyzed on other
datasets for ER, to study whether the results are sta-
ble and to compare the evaluations. Secondly, this
work revealed additional challenges worth investigat-
ing with respect to the incorporated rules. A possi-
ble future direction is to check the algorithm’s be-
haviour when increasing the number of used rules,
and another direction is moving towards rules that can
evolve over time. Thirdly, more information is nec-
essary about the best clustering algorithm applied to
merge similar name variants, e.g. an in-depth com-
parison between the connected components and max-
clique algorithm. Fourthly, alternative optimization
techniques might be used that produce similar or even
better results in terms of efficiency and/or effective-
ness. A comparison between simulated annealing,
Tabu search, and a genetic algorithm is therefore en-
visaged.
ACKNOWLEDGEMENTS
We kindly acknowledge Wen Xin Lin, Colin de
Ruiter, Mark Nijland, and Prof. Dr. H.A.M. Daniels
for their contributions to this work.
REFERENCES
Bondy, J. A. and Murty, U. S. R. (1976). Graph theory with
applications, volume 290. Macmillan London.
Bron, C. and Kerbosch, J. (1973). Algorithm 457: Finding
all cliques of an undirected graph. Communications of
the ACM, 16:48 – 50.
Caron, E. and Daniels, H. (2016). Identification of organiza-
tion name variants in large databases using rule-based
scoring and clustering - with a case study on the web
of science database. In ICEIS, pages 182–187.
Caron, E. and Eck, N.-J. V. (2014). Large scale author name
disambiguation using rule-based scoring and cluster-
ing. In Proceedings of the Science and Technology
Indicators Conference, pages 79–86. Universiteit Lei-
den.
Dong, X., Halevy, A., and Madhavan, J. (2005). Refer-
ence reconciliation in complex information spaces. In
¨
Ozcan, F., editor, SIGMOD, pages 85–96. ACM.
Dong, X. L. and Srivastava, D. (2015). Big Data Integra-
tion. Synthesis Lectures on Data Management. Mor-
gan & Claypool Publishers.
Erber, T. and Hockney, G. (1995). Comment on “method
of constrained global optimization”. Physical review
letters, 74(8):1482.
European Patent Office (2019). Data Catalog - PATSTAT
Global, 2019 autumn edition edition.
Fawcett, T. (2006). An introduction to roc analysis. Pattern
recognition letters, 27(8):861–874.
Ioannou, E., Nieder
´
ee, C., and Nejdl, W. (2008). Prob-
abilistic entity linkage for heterogeneous informa-
tion spaces. In Bellahsene, Z. and L
´
eonard, M.,
editors, Advanced Information Systems Engineering,
20th International Conference, CAiSE 2008, Montpel-
lier, France, June 16-20, 2008, Proceedings, volume
5074 of Lecture Notes in Computer Science, pages
556–570. Springer.
Kalashnikov, D., Mehrotra, S., and Chen, Z. (2005). Ex-
ploiting relationships for domain-independent data
cleaning. In Kargupta, H., Srivastava, J., Kamath,
C., and Goodman, A., editors, SDM, pages 262–273.
SIAM.
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983).
Optimization by simulated annealing. science,
220(4598):671–680.
Levin, M., Krawczyk, S., Bethard, S., and Jurafsky,
D. (2012). Citation-based bootstrapping for large-
scale author disambiguation. Journal of the Ameri-
can Society for Information Science and Technology,
63(5):1030–1047.
Papadakis, G., Ioannou, E., Nieder
´
ee, C., Palpanas, T.,
and Nejdl, W. (2011). Eliminating the redundancy in
blocking-based entity resolution methods. In Newton,
G., Wright, M., and Cassel, L., editors, JCDL, pages
85–94. ACM.
Papadakis, G., Ioannou, E., Palpanas, T., Nieder
´
ee, C.,
and Nejdl, W. (2013). A blocking framework for
entity resolution in highly heterogeneous information
spaces. TKDE, 25(12):2665–2682.
Papadakis, G., Ioannou, E., Thanos, E., and Palpanas, T.
(2021). The Four Generations of Entity Resolution.
Synthesis Lectures on Data Management. Morgan &
Claypool Publishers.
Papenbrock, T., Heise, A., and Naumann, F. (2015). Pro-
gressive duplicate detection. TKDE, 27(5):1316–
1329.
Rastogi, V., Dalvi, N. N., and Garofalakis, M. N. (2011).
Large-scale collective entity matching. Proc. VLDB
Endow., 4(4):208–218.
Salton, G. and Buckley, C. (1988). Term-weighting ap-
Entity Resolution in Large Patent Databases: An Optimization Approach
155