Yingying Wen, Kevin B. Korb, Ann E. Nicholson


Evaluating the relative performance of machine learners on incomplete data is important because one common problem with real data is that the data is often incomplete, which means that some values in the data are not present. DataZapper is a tool for uncreating data: given a dataset containing joint samples over variables, DataZapper will make a specified percentage of observed values disappear, replaced by an indication that the measurement failed. Since the causal mechanisms of measurement that result in failed measurements may depend in arbitrary ways upon the system under study, it is important to be able to produce incomplete data sets which allow for such arbitrary dependencies. DataZapper is the only tool that allows any kind of dependence, and any degree of dependence, in its generation of missing data. We illustrate its use in a machine learning experiment and offer it to the data mining and machine learning communities.


  1. Backus, J. and Naur, P. (1960). Revised report on the algorithmic language algol 60. Communications of the ACM, 3(5):299-314.
  2. Chickering, D. M. (1995). A tranformational characterization of equivalent Bayesian network structures. In Besnard, P. and Hanks, S., editors, UAI95, pages 87- 98, San Francisco.
  3. Cooper, G. F. and Herskovits, E. (1991). A Bayesian method for constructing Bayesian belief networks from databases. In In Proceedings of the Conference on Uncertainty in AI, pages 86-94. San Mateo, CA: Morgan Kaufmann.
  4. Francois, O. and Leray, P. (2007). Generation of incomplete test-data using bayesian networks. In Proceedings of International Joint Conference on Neural Networks, pages 12-17, Orlando, Florida, USA.
  5. Ghahramani, Z. and Jordan, M. I. (1994). Learning from incomplete data. Technical Report AIM-1509, Artificial Intelligence laboraory and Center for Biological and Computational Learning, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology.
  6. Gill, M. K., Asefa, T., Kaheil, Y., and McKee, M. (2007). Effect of missing data on performance of learning algorithms for hydrologic predictions: Implications to an imputation technique. Water Resources Research, 43(W07416,).
  7. Leray, P. and Francois, O. (2004). BNT structure learning package: documentation and experiment s. Technical Report Laboratoire PSI - INSA Rouen-FRE CNRS 2645, Universit et INSA de Rouen.
  8. Meek, C. (1997). Graphical Models: Selecting Causal and Statistical Models. PhD thesis, Carnegie Mellon University.
  9. Onisko, A., Druzdzel, M. J., and Wasyluk, H. (2002). An experimental comparison of methods for handling incomplete data in learning parameters of bayesian networks. In Proceedings of the IIS'2002 Symposium on Intelligent Information Systems, pages 351-360. Physica-Verlag.
  10. Richman, M. B., Trafalis, T. B., and Adrianto, I. (2007). Multiple imputation through machine learning algorithms. In Artificial Intelligence and Climate Applications (Joint between 5th Conference on Applications of Artificial Intelligence in the Environmental Sciences and 19th Conference on Climate Variability and Change).
  11. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581-592.
  12. Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Prediction, and Search. Cambridge, MA:MIT Press, 2 edition.
  13. Twala, B., Cartwright, M., and Shepperd, M. J. (2005). Comparison of various methods for handling incomplete data in software engineering databases. In 2005 International Symposium on Empirical Software Engineering, pages 105-114, Noosa Heads, Australia.
  14. Twala, B. E. T. H., Jones, M. C., and Hand, D. J. (2008). Good methods for coping with missing data in decision trees. Pattern Recogn. Lett., 29(7):950-956.
  15. Wallace, C., Korb, K. B., and Dai, H. (1996). Causal discovery via MML. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 516-524. Morgan Kaufmann.
  16. Wen, Y. and Korb, K. B. (2007). A heuristic algorithm for pattern-to-dag conversion. In Proceedings of IASTED International Conference on Artificial Intelligence and Applications, pages 428-433.
  17. Wen, Y., Korb, K. B., and Nicholson, A. E. (2008). Datazapper: A tool for generating incomplete datasets. Technical report, Bayesian Intelligence Pty Ltd.
  18. Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA, USA, 2 edition.

Paper Citation

in Harvard Style

Wen Y., B. Korb K. and E. Nicholson A. (2009). DATAZAPPER: GENERATING INCOMPLETE DATASETS . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8111-66-1, pages 69-76. DOI: 10.5220/0001660700690076

in Bibtex Style

author={Yingying Wen and Kevin B. Korb and Ann E. Nicholson},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},

in EndNote Style

JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
SN - 978-989-8111-66-1
AU - Wen Y.
AU - B. Korb K.
AU - E. Nicholson A.
PY - 2009
SP - 69
EP - 76
DO - 10.5220/0001660700690076