Data Quality Evaluation of Scientific Datasets - A Case Study in a Policy Support Context

Antonella Zanzi, Alberto Trombetta

2013

Abstract

In this work we present the rule-based approach used to evaluate the quality of scientific datasets in a policy support context. The used case study refers to real datasets in a context where low data quality limits the accuracy of the analysis results and, consequently, the significance of the provided policy advice. The applied solution consists in the identification of types of constraints that can be useful as data quality rules and in the development of a software tool to evaluate a dataset on the basis of a set of rules expressed in the XML markup language. As rule types we selected some types of data constraints and dependencies already proposed in data quality works, but we experimented also the use of order dependencies and existence constraints. The case study was used to develop and test the adopted solution, which is anyway generally applicable to other contexts.

References

  1. Atzeni, P. and Morfuni, N. (1986). Functional dependencies and constraints on null values in database relations. Information and Control, 70(1):1-31.
  2. Bajec, M., Krisper, M., and Rupnik, R. (2000). Using business rules technologies to bridge the gap between business and business applications. In Rechnu, G., editor, Proceedings of the IFIP World Computer Congress, pages 77-85.
  3. Barateiro, J. and Galhardas, H. (2005). A survey of data quality tools. Datenbank-Spektrum, 14/2005:15-21.
  4. Bertossi, L. and Chomicki, J. (2003). Query answering in inconsistent databases. In Chomicki, J., Saake, G., and van der Meyden, R., editors, Logics for Emerging Applications of Databases, pages 43-83. Springer.
  5. Bohannon, P., Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. (2007). Conditional functional dependencies for data cleaning. In Proceedings of the Int'l Conference on Data Engineering, pages 746-755.
  6. Chen, W., Fan, W., and Ma, S. (2009). Analyses and validation of conditional dependencies with built-in predicates. In Bhowmick, S., Kung, J., and Wagner, R., editors, Proceedings of the Int'l Conference on Database and Expert Systems Applications, volume 5690 of LNCS, pages 576-591. Springer-Verlag.
  7. Chen, Y. (2003). Quality of fisheries data and uncertainty in Stock Assessment. Scientia Marina, 67(Suppl. 1):75- 87.
  8. Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. (2007). Improving data quality: Consistency and accuracy. In Proceedings of the Int'l Conference on Very Large Data Bases, pages 315-326. VLDB Endowment.
  9. Dasu, T. and Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. Wiley.
  10. Elmagarmid, A., Ipeirotis, P., and Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions in Knowledge and Data Engineering, 19(1):1-16.
  11. Elmasri, R. and Navathe, S. (2000). Foundamentals of Database Systems (3rd edition). Addison-Wesley.
  12. European Commission (15 July 2008). Commission Regulation (EC) No. 665/2008 of 14 July 2008. Official Journal of the European Union.
  13. Fan, W., Geerts, F., and Jia, X. (2008). Semandaq: A data quality system based on conditional functional dependencies. In Proceedings of the Int'l Conference on Very Large Data Bases, pages 1460-1463. VLDB Endowment.
  14. Ginsburg, S. and Hull, R. (1983). Order dependency in the relational model. Theoretical Computer Science, 26:149-195.
  15. Golab, L., Karloff, H., Korn, F., and D., S. (2010). Data Auditor: Exploring data quality and semantics using pattern tableaux. Proceedings of the VLDB Endowment, 3(2):1641-1644.
  16. Hellerstein, J. (2008). Quantitative data cleaning for large databases. Report for the United Nations Economic Commission for Europe (UNECE), 42 pp.
  17. Hipp, J., Güntzer, U., and Grimmer, U. (2001). Data quality mining - making a virtue of necessity. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 52-57.
  18. Juran, J. (1964). Managerial breakthrough. McGraw-Hill, New York.
  19. Lee, Y., Pipino, L., Funk, J., and Wang, R. (2006). Journey to Data Quality. The MIT Press.
  20. Ng, W. (1999). Order functional dependencies in relational databases. Information Systems, 24(7):535-554.
  21. Pivert, O. and Prade, H. (2010). Handling dirty databases: From user warning to data cleaning - towards an interactive approach. In Deshpande, A. and Hunter, A., editors, Proceedings of the Int'l Conference on Scalable Uncertainty Management, volume 6379 of LNAI, pages 292-305. Springer-Verlag.
  22. Renear, A., Sacchi, S., and Wickett, K. (2010). Definitions of dataset in the scientific and technical literature. In Grove, A., editor, Proceedings of the American Society for Information Science and Technology Annual Meeting, volume 47(1), pages 1-4.
  23. Shankaranarayan, G., Wang, R. Y., and Ziad, M. (2000). Modeling the manufacture of an information product with IP-MAP. In Proceedings of the Int'l Conference on Information Quality, pages 1-16.
  24. Shanks, G. and Corbitt, B. (1999). Understanding data quality: Social and cultural aspects. In Proceedings of the Australasian Conference on Information Systems, pages 785-797.
  25. Song, S. and Chen, L. (2011). Differential dependencies: Reasoning and discovery. ACM Transactions on Database Systems, 26(3), 16.
  26. Szlichta, J., Godfrey, P., and Gryz, J. (2012). Fundamentals of order dependencies. Proceedings of the VLDB Endowment, 5(11):1120-1231.
  27. van der Vlist, E. (2007). Schematron. O'Reilly Media.
  28. Vassiliadis, P. (2009). A survey of Extract-Transform-Load technology. Int'l Journal of Data Warehousing & Mining, 5(3):1-27.
Download


Paper Citation


in Harvard Style

Zanzi A. and Trombetta A. (2013). Data Quality Evaluation of Scientific Datasets - A Case Study in a Policy Support Context . In Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA, ISBN 978-989-8565-67-9, pages 167-174. DOI: 10.5220/0004476401670174


in Bibtex Style

@conference{data13,
author={Antonella Zanzi and Alberto Trombetta},
title={Data Quality Evaluation of Scientific Datasets - A Case Study in a Policy Support Context},
booktitle={Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,},
year={2013},
pages={167-174},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004476401670174},
isbn={978-989-8565-67-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,
TI - Data Quality Evaluation of Scientific Datasets - A Case Study in a Policy Support Context
SN - 978-989-8565-67-9
AU - Zanzi A.
AU - Trombetta A.
PY - 2013
SP - 167
EP - 174
DO - 10.5220/0004476401670174