Data Quality Sensitivity Analysis on Aggregate Indicators

Mario Mezzanzanica, Roberto Boselli, Mirko Cesarini, Fabio Mercorio

Abstract

Decision making activities stress data and information quality requirements. The quality of data sources is frequently very poor, therefore a cleansing process is required before using such data for decision making processes. When alternative (and more trusted) data sources are not available data can be cleansed only using business rules derived from domain knowledge. Business rules focus on fixing inconsistencies, but an inconsistency can be cleansed in different ways (i.e. the correction can be not deterministic), therefore the choice on how to cleanse data can (even strongly) affect the aggregate values computed for decision making purposes. The paper proposes a methodology exploiting Finite State Systems to quantitatively estimate how computed variables and indicators might be affected by the uncertainty related to low data quality, independently from the data cleansing methodology used. The methodology has been implemented and tested on a real case scenario providing effective results.

References

  1. Arasu, A. and Kaushik, R. (2009). A grammar-based entity representation framework for data cleaning. In Proceedings of the 35th SIGMOD international conference on Management of data, pages 233-244.
  2. Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., and Spinrad, J. (2003). Scalar aggregation in inconsistent databases. Theoretical Computer Science, 296(3):405-434.
  3. Arenas, M., Bertossi, L. E., and Chomicki, J. (1999). Consistent query answers in inconsistent databases. In ACM Symp. on Principles of Database Systems, pages 68-79. ACM Press.
  4. Batini, C., Cappiello, C., Francalanci, C., and Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Comput. Surv., 41:16:1- 16:52.
  5. Batini, C. and Scannapieco, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications. Springer.
  6. Chomicki, J. and Marcinkowski, J. (2005). Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1-2):90-121.
  7. Csiszar, I. and Körner, J. (1981). Information theory: coding theorems for discrete memoryless systems, volume 244. Academic press.
  8. Embury, S., Brandt, S., Robinson, J., Sutherland, I., Bisby, F., Gray, W., Jones, A., and White, R. (2001). Adapting integrity enforcement techniques for data reconciliation. Information Systems, 26(8):657-689.
  9. Fan, W., Geerts, F., and Jia, X. (2008). A Revival of Integrity Constraints for Data Cleaning. Proc. VLDB Endow., 1:1522-1523.
  10. Fellegi, I. and Holt, D. (1976). A systematic approach to automatic edit and inputation. Journal of the American Statistical association, 71(353):17-35.
  11. Galhardas, H., Florescuand, D., Simon, E., and Shasha, D. (2000). An extensible framework for data cleaning. In Proceedings of ICDE 7800, pages 312-. IEEE Computer Society.
  12. Iosifescu, M. (1980). Finite Markov processes and their applications. Wiley.
  13. Maletic, J. and Marcus, A. (2000). Data cleansing: beyond Integrity Analysis. In Proceedings of the Conference on Information Quality, pages 200-209.
  14. Martini, M. and Mezzanzanica, M. (2009). The Federal Observatory of the Labour Market in Lombardy: Models and Methods for the Costruction of a Statistical Information System for Data Analysis. In Larsen, C., Mevius, M., Kipper, J., and Schmid, A., editors, Information Systems for Regional Labour Market Monitoring - State of the Art and Prospectives. Rainer Hampp Verlag.
  15. Mayfield, C., Neville, J., and Prabhakar, S. (2009). A Statistical Method for Integrated Data Cleaning and Imputation. Technical Report CSD TR-09-008, Purdue University.
  16. Mezzanzanica, M., Boselli, R., Cesarini, M., and Mercorio, F. (2011). Data quality through model checking techniques. In Gama, J., Bradley, E., and Hollmén, J., editors, IDA, volume 7014 of Lecture Notes in Computer Science, pages 270-281. Springer.
  17. Müller, H. and Freytag, J.-C. (2003). Problems, Methods and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, HumboldtUniversität zu Berlin, Institut für Informatik.
  18. Rahm, E. and Do, H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3-13.
  19. Redman, T. C. (1998). The impact of poor data quality on the typical enterprise. Commun. ACM, 41:79-82.
  20. Sang Hyun, P., Wesley, W., et al. (2001). Discovering and matching elastic rules from sequence databases. Fundamenta Informaticae, 47(1-2):75-90.
  21. Strong, D. M., Lee, Y. W., and Wang, R. Y. (1997). Data quality in context. Commun. ACM, 40(5):103-110.
  22. Wang, R., Kon, H., and Madnick, S. (1993). Data quality requirements analysis and modeling. In Data Engineering, 1993. Proceedings. Ninth International Conference on, pages 670-677.
  23. Weidema, B. P. and Wesns, M. S. (1996). Data quality management for life cycle inventoriesan example of using data quality indicators. Journal of Cleaner Production, 4(34):167 - 174.
Download


Paper Citation


in Harvard Style

Mezzanzanica M., Boselli R., Cesarini M. and Mercorio F. (2012). Data Quality Sensitivity Analysis on Aggregate Indicators . In Proceedings of the International Conference on Data Technologies and Applications - Volume 1: DATA, ISBN 978-989-8565-18-1, pages 97-108. DOI: 10.5220/0004040300970108


in Bibtex Style

@conference{data12,
author={Mario Mezzanzanica and Roberto Boselli and Mirko Cesarini and Fabio Mercorio},
title={Data Quality Sensitivity Analysis on Aggregate Indicators},
booktitle={Proceedings of the International Conference on Data Technologies and Applications - Volume 1: DATA,},
year={2012},
pages={97-108},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004040300970108},
isbn={978-989-8565-18-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Data Technologies and Applications - Volume 1: DATA,
TI - Data Quality Sensitivity Analysis on Aggregate Indicators
SN - 978-989-8565-18-1
AU - Mezzanzanica M.
AU - Boselli R.
AU - Cesarini M.
AU - Mercorio F.
PY - 2012
SP - 97
EP - 108
DO - 10.5220/0004040300970108