Improving Data Cleansing Accuracy - A Model-based Approach

Mario Mezzanzanica, Roberto Boselli, Mirko Cesarini, Fabio Mercorio

Abstract

Research on data quality is growing in importance in both industrial and academic communities, as it aims at deriving knowledge (and then value) from data. Information Systems generate a lot of data useful for studying the dynamics of subjects’ behaviours or phenomena over time, making the quality of data a crucial aspect for guaranteeing the believability of the overall knowledge discovery process. In such a scenario, data cleansing techniques, i.e., automatic methods to cleanse a dirty dataset, are paramount. However, when multiple cleansing alternatives are available a policy is required for choosing between them. The policy design task still relies on the experience of domain-experts, and this makes the automatic identification of accurate policies a significant issue. This paper extends the Universal Cleaning Process enabling the automatic generation of an accurate cleansing policy derived from the dataset to be analysed. The proposed approach has been implemented and tested on an on-line benchmark dataset, a real-world instance of the Labour Market Domain. Our preliminary results show that our approach would represent a contribution towards the generation of data-driven policy, reducing significantly the domain-experts intervention for policy specification. Finally, the generated results have been made publicly available for downloading.

References

  1. Abello, J., Pardalos, P. M., and Resende, M. G. (2002). Handbook of massive data sets, volume 4. Springer.
  2. Batini, C., Cappiello, C., Francalanci, C., and Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Comput. Surv., 41:16:1- 16:52.
  3. Batini, C. and Scannapieco, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications. Springer.
  4. Bertossi, L. (2006). Consistent query answering in databases. ACM Sigmod Record, 35(2):68-76.
  5. Boselli, R., Cesarini, M., Mercorio, F., and Mezzanzanica, M. (2013). Inconsistency knowledge discovery for longitudinal data management: A modelbased approach. In SouthCHI13 special session on Human-Computer Interaction & Knowledge Discovery, LNCS, vol. 7947. Springer.
  6. Boselli, R., Cesarini, M., Mercorio, F., and Mezzanzanica, M. (2014a). Planning meets data cleansing. In The 24th International Conference on Automated Planning and Scheduling (ICAPS), pages 439-443. AAAI.
  7. Boselli, R., Cesarini, M., Mercorio, F., and Mezzanzanica, M. (2014b). A policy-based cleansing and integration framework for labour and healthcare data. In Holzinger, A. and Igor, J., editors, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics, volume 8401 of LNCS, pages 141-168. Springer.
  8. Boselli, R., Cesarini, M., Mercorio, F., and Mezzanzanica, M. (2014c). Towards data cleansing via planning. Intelligenza Artificiale, 8(1):57-69.
  9. Chomicki, J. and Marcinkowski, J. (2005a). Minimalchange integrity maintenance using tuple deletions. Information and Computation, 197(1):90-121.
  10. Chomicki, J. and Marcinkowski, J. (2005b). On the computational complexity of minimal-change integrity maintenance in relational databases. In Inconsistency Tolerance, pages 119-150. Springer.
  11. Clemente, P., Kaba, B., Rouzaud-Cornabas, J., Alexandre, M., and Aujay, G. (2012). Sptrack: Visual analysis of information flows within selinux policies and attack logs. In AMT Special Session on Human-Computer Interaction and Knowledge Discovery, volume 7669 of LNCS, pages 596-605. Springer.
  12. Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. (2007). Improving data quality: Consistency and accuracy. In Proceedings of the 33rd international conference on Very large data bases, pages 315-326. VLDB Endowment.
  13. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A. K., Ilyas, I. F., Ouzzani, M., and Tang, N. (2013). Nadeef: a commodity data cleaning system. In Ross, K. A., Srivastava, D., and Papadias, D., editors, SIGMOD Conference, pages 541-552. ACM.
  14. De Silva, V. and Carlsson, G. (2004). Topological estimation using witness complexes. In Proceedings of the First Eurographics conference on Point-Based Graphics, pages 157-166. Eurographics Association.
  15. Della Penna, G., Intrigila, B., Magazzeni, D., and Mercorio, F. (2009). UPMurphi: a tool for universal planning on PDDL+ problems. In Proceeding of the 19th International Conference on Automated Planning and Scheduling (ICAPS) 2009, pages 106-113. AAAI Press.
  16. Devaraj, S. and Kohli, R. (2000). Information technology payoff in the health-care industry: a longitudinal study. Journal of Management Information Systems, 16(4):41-68.
  17. Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2007). Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on, 19(1):1-16.
  18. Fan, W., Li, J., Ma, S., Tang, N., and Yu, W. (2010). Towards certain fixes with editing rules and master data. Proceedings of the VLDB Endowment, 3(1-2):173- 184.
  19. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). The kdd process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27-34.
  20. Fellegi, I. P. and Holt, D. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical association, 71(353):17-35.
  21. Ferreira de Oliveira, M. C. and Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. IEEE Trans. Vis. Comput. Graph., 9(3):378-394.
  22. Fisher, C., Lauría, E., Chengalur-Smith, S., and Wang, R. (2012). Introduction to information quality. AuthorHouse.
  23. Fox, C., Levitin, A., and Redman, T. (1994). The notion of data and its quality dimensions. Information processing & management, 30(1):9-19.
  24. Hansen, P. and Järvelin, K. (2005). Collaborative information retrieval in an information-intensive domain. Information Processing & Management, 41(5):1101- 1119.
  25. Holzinger, A. (2012). On knowledge discovery and interactive intelligent visualization of biomedical data - challenges in human-computer interaction & biomedical informatics. In Helfert, M., Francalanci, C., and Filipe, J., editors, DATA. SciTePress.
  26. Holzinger, A., Bruschi, M., and Eder, W. (2013a). On interactive data visualization of physiological low-costsensor data with focus on mental stress. In Cuzzocrea, A., Kittl, C., Simos, D. E., Weippl, E., and Xu, L., editors, CD-ARES, volume 8127 of Lecture Notes in Computer Science, pages 469-480. Springer.
  27. Holzinger, A., Yildirim, P., Geier, M., and Simonic, K.- M. (2013b). Quality-based knowledge discovery from medical text on the web. In (Pasi et al., 2013b), pages 145-158.
  28. Holzinger, A. and Zupan, M. (2013). Knodwat: A scientific framework application for testing knowledge discovery methods for the biomedical domain. BMC Bioinformatics, 14:191.
  29. Kapovich, I., Myasnikov, A., Schupp, P., and Shpilrain, V. (2003). Generic-case complexity, decision problems in group theory, and random walks. Journal of Algebra, 264(2):665-694.
  30. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI'95, pages 1137-1143, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  31. Kolahi, S. and Lakshmanan, L. V. (2009). On approximating optimum repairs for functional dependency violations. In Proceedings of the 12th International Conference on Database Theory, pages 53-62. ACM.
  32. Lovaglio, P. G. and Mezzanzanica, M. (2013). Classification of longitudinal career paths. Quality & Quantity, 47(2):989-1008.
  33. Madnick, S. E., Wang, R. Y., Lee, Y. W., and Zhu, H. (2009). Overview and framework for data and information quality research. J. Data and Information Quality, 1(1):2:1-2:22.
  34. Mercorio, F. (2013). Model checking for universal planning in deterministic and non-deterministic domains. AI Commun., 26(2):257-259.
  35. Mezzanzanica, M., Boselli, R., Cesarini, M., and Mercorio, F. (2013). Automatic synthesis of data cleansing activities. In Helfert, M. and Francalanci, C., editors, The 2nd International Conference on Data Management Technologies and Applications (DATA), pages 138 - 149. Scitepress.
  36. Pasi, G., Bordogna, G., and Jain, L. C. (2013a). An introduction to quality issues in the management of web information. In (Pasi et al., 2013b), pages 1-3.
  37. Pasi, G., Bordogna, G., and Jain, L. C., editors (2013b). Quality Issues in the Management of Web Information, volume 50 of Intelligent Systems Reference Library. Springer.
  38. Prinzie, A. and Van den Poel, D. (2011). Modeling complex longitudinal consumer behavior with dynamic bayesian networks: an acquisition pattern analysis application. Journal of Intelligent Information Systems, 36(3):283-304.
  39. Rahm, E. and Do, H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3-13.
  40. Redman, T. C. (2013). Data's credibility problem. Harvard Business Review, 91(12):84-+.
  41. Sadiq, S. (2013). Handbook of Data Quality. Springer.
  42. Scannapieco, M., Missier, P., and Batini, C. (2005). Data Quality at a Glance. Datenbank-Spektrum, 14:6-14.
  43. The Italian Ministry of Labour and Welfare (2012). Annual report about the CO system, available at http://www.cliclavoro.gov.it/Barometro-Del-Lavoro/ Documents/Rapporto CO/Executive summary.pdf.
  44. Vardi, M. (1987). Fundamentals of dependency theory. Trends in Theoretical Computer Science, pages 171- 224.
  45. Volkovs, M., Chiang, F., Szlichta, J., and Miller, R. J. (2014). Continuous data cleaning. ICDE (12 pages).
  46. Wang, R. Y. and Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. J. of Management Information Systems, 12(4):5-33.
  47. Yakout, M., Berti- Óquille, L., and Elmagarmid, A. K. (2013). Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In Proceedings of the 2013 international conference on Management of data, pages 553-564. ACM.
Download


Paper Citation


in Harvard Style

Mezzanzanica M., Boselli R., Cesarini M. and Mercorio F. (2014). Improving Data Cleansing Accuracy - A Model-based Approach . In Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-035-2, pages 189-201. DOI: 10.5220/0005004901890201


in Bibtex Style

@conference{data14,
author={Mario Mezzanzanica and Roberto Boselli and Mirko Cesarini and Fabio Mercorio},
title={Improving Data Cleansing Accuracy - A Model-based Approach},
booktitle={Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2014},
pages={189-201},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005004901890201},
isbn={978-989-758-035-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - Improving Data Cleansing Accuracy - A Model-based Approach
SN - 978-989-758-035-2
AU - Mezzanzanica M.
AU - Boselli R.
AU - Cesarini M.
AU - Mercorio F.
PY - 2014
SP - 189
EP - 201
DO - 10.5220/0005004901890201