Authors:
Emiel Caron
1
and
Hennie Daniels
2
Affiliations:
1
Erasmus University Rotterdam, Netherlands
;
2
Tilburg University and Erasmus University Rotterdam, Netherlands
Keyword(s):
Large Scale Databases, Data Warehousing, Database Integration, Data Cleaning, Data Mining, Clustering.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Coupling and Integrating Heterogeneous Data Sources
;
Data Engineering
;
Data Mining
;
Data Warehouses and OLAP
;
Databases and Data Security
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Large Scale Databases
;
Query Languages and Query Processing
;
Sensor Networks
;
Signal Processing
;
Soft Computing
Abstract:
This research describes a general method to automatically clean organizational and business names variants within large databases, such as: patent databases, bibliographic databases, databases in business information systems, or any other database containing organisational name variants. The method clusters name variants of organizations based on similarities of their associated meta-data, like, for example, postal code and email domain data. The method is divided into a rule-based scoring system and a clustering system. The method is tested on the cleaning of research organisations in the Web of Science database for the purpose of bibliometric analysis and scientific performance evaluation. The results of the clustering are evaluated with metrics such as precision and recall analysis on a verified data set. The evaluation shows that our method performs well and is conservative, it values precision over recall, with on average 95% precision and 80% recall for clusters.