Classification of Datasets with Missing Values: Two Level Approach

Ivan Bruha

Abstract

One of the problems of pattern recognition (PR) are datasets with missing attribute values. Therefore, PR algorithms should comprise some routines for processing these missing values.There exist several such routines for each PR paradigm. Quite a few experiments have revealed that each dataset has more or less its own 'favourite' routine for processing missing attribute values. In this paper, we use the machine learning algorithm CN4, a large extension of well-known CN2, which contains six routines for missing attribute values processing. Our system runs these routines independently (at the base level), and afterwards, a meta-combiner (at the second level) is used to generate a meta-classifier that makes up the overall decision about the class of input objects.This knowledge combination algorithm splits a training set to S subsets for the training purposes. The parameter S (called ‘foldness’) is the crucial one in the process of meta-learning. The paper focuses on its optimal value. Therefore, the routines used here for the missing attribute values processing are only the vehicles (for the function of the base classifiers); in fact, any PR algorithm for base classifiers could be used. In other words, the paper does not compare various missing attribute processing techniques, but its target is the parameter S.

References

  1. Batista, G., Monard, M.C.: An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17 (2003), 519-533
  2. Berka, P. and Bruha, I.: Various discretizing procedures of numerical attributes: Machine Learning, and Knowledge Discovery in Databases, Heraklion, Crete (1995), 136-141
  3. : Manual for CN2, version 4.1. Turing Institute, Techn. Rept. P-2145/Rab/4/1.3 (1990)
  4. : Unknown attribute values processing utilizing expert knowledge on attribute hierarchy. 8th European Conference on Machine Learning, Workshop Statistics, Machine Learning, and Knowledge Discovery in Databases, Heraklion, Crete (1995), 130-135
  5. : Unknown attribute values processing by meta-learner. International Symposium on Methodologies for Intelligent Systems (ISMIS-2002), Lyon, France (2002)
  6. and Franek, F.: Comparison of various routines for unknown attribute value processing: Covering paradigm. International Journal Pattern Recognition and Artificial Intelligence, 10, 8 (1996), 939-955
  7. and Boswell, R.: Rule induction with CN2: Some recent improvements. EWSL'91, Porto (1991), 151-163
  8. P. and Niblett, T.: The CN2 induction algorithm. Machine Learning, 3 (1989), 261- 283
  9. W., , Chan, P.K., Stolfo, S.J.: A comparative evaluation of combiner and stacked generalization. Workshop Integrating Multiple Learning Models, AAAI, Portland (1996)
  10. Fortes, I. et al.: Inductive learning models with missing values. Mathematical and Computer Modelling, 44 (2006), 790-806
  11. R.: Induction of decision trees. Machine Learning, 1 (1986), 81-106
  12. R.: Unknown attribute values in ID3. International Conference ML (1989), 164-8
  13. , Barbara, D.: Constraints in data mining of contents. ACM SIGKDD Explorations Newsletter (2002), 1931-1945
  14. et al.:'Missing is useful': Missing values in cost-sensitive decision trees. IEEE Trans. Knowledge and Data Engineering, 17, 12 (2005), 1689-1693
Download


Paper Citation


in Harvard Style

Bruha I. (2010). Classification of Datasets with Missing Values: Two Level Approach . In Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010) ISBN 978-989-8425-14-0, pages 90-98. DOI: 10.5220/0003017800900098


in Bibtex Style

@conference{pris10,
author={Ivan Bruha},
title={Classification of Datasets with Missing Values: Two Level Approach},
booktitle={Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010)},
year={2010},
pages={90-98},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003017800900098},
isbn={978-989-8425-14-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010)
TI - Classification of Datasets with Missing Values: Two Level Approach
SN - 978-989-8425-14-0
AU - Bruha I.
PY - 2010
SP - 90
EP - 98
DO - 10.5220/0003017800900098