organisations the name variants are spread over a
number of accurate clusters. Clusters with a lower
precision are, in general, clusters belonging to very
large cities, where multiple research institutes can be
found in a relatively small area, which makes name
normalisation more difficult.
4 CONCLUSIONS
In this research we have presented an efficient general
rule-based scoring method for the clustering of name
variants of organizations in large databases. The rules
are based on organisation name similarity and meta
data in the context of the organisation, like: country,
postal code, email domains, organization type, etc.
Basically, the method can work with any piece of
relevant meta-data, as long as it is shared between
records. Multiple rules can be combined to link
organization names, because of the scoring system.
The more rules that hold for a pair of organisation
names, the more evidence there is that the
organisation names are indeed valid name variants of
each other. In other words, the rules in the system
strengthen each other. Moreover, the rules are easy to
understand and combine. Incorrect matching of
organisation names is partly prevented by lowering
the scores for certain sensitive rules and by increasing
the threshold values, for example, for geographic
locations with a high number of organisations.
Based on the results of the case study, it can be
stated that the clustering method is careful, it values
precision (on average 95%) over recall (on average
80%). In general, precision and recall are lower for
areas with a high number of scientific organisations.
Name variants of organizations might be split over
multiple clusters, if there is not enough evidence for
coupling names variants together. However, these
alternative clusters do have a high precision and are
therefore useful for analysis.
In conclusion, the method can be viewed as a
general method for data cleaning, because it can be
used to other types of data, e.g. person or author name
disambiguation (Caron and Van Eck, 2014), as long
as there is relevant meta data available. In future
research, the cleaning method should be tested on
multiple databases with name variants to find optimal
values for scores and thresholds, and to improve the
quality of the method for very large cities. In addition,
we want to push recall performance forwards by
further integrating string similarity measures (Cohen
et al., 2003) in the method.
ACKNOWLEDGEMENTS
I thank Vasileios Stathias and Nees Jan van Eck for
their contributions to this research. In this study I used
the database facilities of the Centre for Science and
Technology Studies (CWTS, 2015).
REFERENCES
Caron, E., van Eck, N.J., 2014. Large scale author name
disambiguation using rules-based scoring and
clustering, In Proceedings of the 19
th
International
Conference on Science and Technology Indicators,
pages 79-86, Leiden, The Netherlands.
Cohen, W., Ravikumar, P., & Fienberg, S., 2003. A
comparison of string metrics for matching names and
records. In KDD Workshop on Data Cleaning and
Object Consolidation, Vol. 3, pp. 73-78.
CWTS, 2015. Centre for Science and Technology Studies,
http://www.cwts.nl, Leiden, The Netherlands.
De Bruin, R., Moed, H., 1990. The unification of addresses
in scientific publications. Informetrics, 89/90, 65-78.
Koudas, Nick & Marathe, A. & Srivastava, D., 2004.
Flexible string matching against large databases in
practice. Proceedings of the 30th VLDB Conference.
Leiden Ranking, 2015. CWTS Leiden Ranking 2015,
http://www.leidenranking.nl, The Netherlands.
Leiden University, 2015. http://www.leidenuniv.nl,
Leiden, The Netherlands.
Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D., 2012.
Citation-based bootstrapping for large-scale author
disambiguation. Journal of the Association for
Information Science and Technology, 63(5), 1030-
1047.
Maletic, J. I., & Marcus, A., 2010. Data cleansing: A
prelude to knowledge discovery. In Data Mining and
Knowledge Discovery Handbook (pp. 19-36). Springer.
Morillo, F., Santabárbara, I., & Aparicio, J., 2013. The
automatic normalisation challenge: detailed addresses
identification. Scientometrics, 95(3), 953-966.
Patstat, 2015, EPO Worldwide Patent Statistical Database,
http://www.epo.org.
Song Y., Huang J., Councill I., Li J., & Giles C., 2007.
Efficient topic-based unsupervised name
disambiguation. In Proceedings of the 7th ACM/IEEE-
CS joint conference on Digital libraries (JCDL '07).
ACM, New York, NY, USA, 342-351.
Web of Science, 2015. Thomson Reuters, United States.
http://www.webofscience.com.