Authors:
Mario Mezzanzanica
1
;
Roberto Boselli
1
;
Mirko Cesarini
1
and
Fabio Mercorio
2
Affiliations:
1
University of Milan-Bicocca, Italy
;
2
University of Milano-Bicocca, Italy
Keyword(s):
Data and Information Quality, Data Cleansing, Data Accuracy, Weakly-structured Data.
Related
Ontology
Subjects/Areas/Topics:
Data Engineering
;
Data Management and Quality
;
Data Management for Analytics
;
Information Quality
Abstract:
Research on data quality is growing in importance in both industrial and academic communities, as it aims at deriving knowledge (and then value) from data. Information Systems generate a lot of data useful for studying the dynamics of subjects’ behaviours or phenomena over time, making the quality of data a crucial aspect for guaranteeing the believability of the overall knowledge discovery process. In such a scenario, data cleansing techniques, i.e., automatic methods to cleanse a dirty dataset, are paramount. However, when multiple cleansing alternatives are available a policy is required for choosing between them. The policy design task still relies on the experience of domain-experts, and this makes the automatic identification of accurate policies a significant issue. This paper extends the Universal Cleaning Process enabling the automatic generation of an accurate cleansing policy derived from the dataset to be analysed. The proposed approach has been implemented and tested on
an on-line benchmark dataset, a real-world instance of the Labour Market Domain. Our preliminary results show that our approach would represent a contribution towards the generation of data-driven policy, reducing significantly the domain-experts intervention for policy specification. Finally, the generated results have been made publicly available for downloading.
(More)