Automatic Synthesis of Data Cleansing Activities
Mario Mezzanzanica, Roberto Boselli, Mirko Cesarini, Fabio Mercorio
2013
Abstract
Data cleansing is growing in importance among both public and private organisations, mainly due to the relevant amount of data exploited for supporting decision making processes. This paper is aimed to show how model-based verification algorithms (namely, model checking) can contribute in addressing data cleansing issues, furthermore a new benchmark problem focusing on the labour market dynamic is introduced. The consistent evolution of the data is checked using a model defined on the basis of domain knowledge. Then, we formally introduce the concept of universal cleanser, i.e. an object which summarises the set of all cleansing actions for each feasible data inconsistency (according to a given consistency model), then providing an algorithm which synthesises it. The universal cleanser can be seen as a repository of corrective interventions useful to develop cleansing routines. We applied our approach to a dataset derived from the Italian labour market data, making the whole dataset and outcomes publicly available to the community, so that the results we present can be shared and compared with other techniques.
References
- Afrati, F. N. and Kolaitis, P. G. (2009). Repair checking in inconsistent Databases: Algorithms and Complexity. In ICDT, pages 31-41. ACM.
- Bartolucci, F., Farcomeni, A., and Pennoni, F. (2012). Latent Markov models for longitudinal data. Boca Raton, FL: Chapman & Hall/CRC Press.
- Batini, C. and Scannapieco, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications. Springer.
- Bertossi, L. (2006). Consistent query answering in databases. ACM Sigmod Record, 35(2):68-76.
- Bertossi, L. E., Kolahi, S., and Lakshmanan, L. V. S. (2011). Data cleaning and query answering with matching dependencies and matching functions. In Milo, T., editor, ICDT, pages 268-279. ACM.
- Choi, E.-H., Tsuchiya, T., and Kikuno, T. (2006). Model checking active database rules under various rule processing strategies. IPSJ Digital Courier, 2(0):826- 839.
- Cimatti, R., Roveri, M., and Traverso, P. (1998). Automatic OBDD-based generation of universal plans in nondeterministic domains. In AAAI-98, pp. 875-881., pages 875-881. AAAI Press.
- Clarke, E. M., Grumberg, O., and Long, D. E. (1994). Model checking and abstraction. ACM Transactions on Programming Languages and Systems (TOPLAS), 16(5):1512-1542.
- Clarke, E. M., Grumberg, O., and Peled, D. A. (1999). Model Checking. The MIT Press.
- Della Penna, G., Intrigila, B., Magazzeni, D., and Mercorio, F. (2009). UPMurphi: a tool for universal planning on PDDL+ problems. In ICAPS 2009, pages 106-113. AAAI Press.
- Della Penna, G., Magazzeni, D., and Mercorio, F. (2012). A universal planning system for hybrid domains. Applied Intelligence, 36(4):932-959.
- Dovier, A. and Quintarelli, E. (2009). Applying Model-checking to solve Queries on semistructured Data. Computer Languages, Systems & Structures, 35(2):143 - 172.
- Elmagarmid, A., Ipeirotis, P., and Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1-16.
- Fan, W. (2008). Dependencies revisited for improving data quality. In the ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 159- 170. ACM.
- Fayyad, U. M., Piatetsky-Shapiro, G., and Uthurusamy, R. (2003). Summary from the kdd-03 panel: data mining: the next 10 years. ACM SIGKDD Explorations Newsletter, 5(2):191-196.
- Maletic, J. and Marcus, A. (2000). Data cleansing: beyond Integrity Analysis. In IQ, pages 200-209.
- Maletic, J. and Marcus, A. (2010). Data cleansing: A prelude to knowledge discovery. In Data Mining and Knowledge Discovery Handbook, pages 19-32. Springer US.
- Martini, M. and Mezzanzanica, M. (2009). The Federal Observatory of the Labour Market in Lombardy: Models and Methods for the Costruction of a Statistical Information System for Data Analysis. In Information Systems for Regional Labour Market Monitoring - State of the Art and Prospectives. Rainer Hampp Verlag.
- Mayfield, C., Neville, J., and Prabhakar, S. (2009). A Statistical Method for Integrated Data Cleaning and Imputation. Technical Report CSD TR-09-008, Purdue University.
- Mercorio, F. (2013). Model checking for universal planning in deterministic and non-deterministic domains. AI Communications, 26(2).
- Mezzanzanica, M., Boselli, R., Cesarini, M., and Mercorio, F. (2012). Data quality sensitivity analysis on aggregate indicators. In DATA 2012, pages 97-108. SciTePress.
- Neven, F. (2002). Automata theory for XML researchers. SIGMOD Rec., 31:39-46.
- Schoppers, M. (1987). Universal plans of reactive robots in unpredictable environments. In Proc. IJCAI 1987.
- Singer, J. and Willett, J. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford University Press, USA.
- Vardi, M. (1987). Fundamentals of dependency theory. Trends in Theoretical Computer Science, pages 171- 224.
- Vardi, M. Y. (1992). Automata Theory for Database Theoreticians. In Theoretical Studies in Computer Science, pages 153-180. Academic Press Professional, Inc.
Paper Citation
in Harvard Style
Mezzanzanica M., Boselli R., Cesarini M. and Mercorio F. (2013). Automatic Synthesis of Data Cleansing Activities . In Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA, ISBN 978-989-8565-67-9, pages 138-149. DOI: 10.5220/0004491101380149
in Bibtex Style
@conference{data13,
author={Mario Mezzanzanica and Roberto Boselli and Mirko Cesarini and Fabio Mercorio},
title={Automatic Synthesis of Data Cleansing Activities},
booktitle={Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,},
year={2013},
pages={138-149},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004491101380149},
isbn={978-989-8565-67-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,
TI - Automatic Synthesis of Data Cleansing Activities
SN - 978-989-8565-67-9
AU - Mezzanzanica M.
AU - Boselli R.
AU - Cesarini M.
AU - Mercorio F.
PY - 2013
SP - 138
EP - 149
DO - 10.5220/0004491101380149