Authors:
Ricardo Almeida
1
;
Paulo Maio
2
;
Paulo Oliveira
2
and
João Barroso
3
Affiliations:
1
ISEP-IPP and School of Engineering of Polytechnic of Porto, Portugal
;
2
ISEP-IPP, School of Engineering of Polytechnic of Porto, GECAD – Research Group on Intelligent Engineering and Computing for Advanced Innovation and Development and, Portugal
;
3
UTAD – University of Trás-os-Montes and Alto Douro, Portugal
Keyword(s):
Data Quality, Data Cleaning, Knowledge Reuse, Vocabulary, Ontologies.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Biomedical Engineering
;
Expert Systems
;
Health Information Systems
;
Knowledge Engineering and Ontology Development
;
Knowledge Representation
;
Knowledge-Based Systems
;
Symbolic Systems
Abstract:
The organizations’ demand to integrate several heterogeneous data sources and an ever-increasing volume of data is revealing the presence of quality problems in data. Currently, most of the data cleaning approaches (for detection and correction of data quality problems) are tailored for data sources with the same schema and sharing the same data model (e.g., relational model). On the other hand, these approaches are highly dependent on a domain expert to specify the data cleaning operations. This paper extends a previously proposed data cleaning methodology that reuses cleaning knowledge specified for other data sources. The methodology is further detailed/refined by specifying the requirements that a data cleaning operations vocabulary must satisfy. Ontologies in RDF/OWL are proposed as the data model for an abstract representation of the data schemas, no matter which data model is used (e.g., relational; graph). Existing approaches, methods and techniques that support the implement
ation of the proposed methodology, in general, and specifically of the data cleaning operations vocabulary are also presented and discussed in this paper.
(More)