Classification of Datasets with Missing Values: Two Level Approach
Ivan Bruha
2010
Abstract
One of the problems of pattern recognition (PR) are datasets with missing attribute values. Therefore, PR algorithms should comprise some routines for processing these missing values.There exist several such routines for each PR paradigm. Quite a few experiments have revealed that each dataset has more or less its own 'favourite' routine for processing missing attribute values. In this paper, we use the machine learning algorithm CN4, a large extension of well-known CN2, which contains six routines for missing attribute values processing. Our system runs these routines independently (at the base level), and afterwards, a meta-combiner (at the second level) is used to generate a meta-classifier that makes up the overall decision about the class of input objects.This knowledge combination algorithm splits a training set to S subsets for the training purposes. The parameter S (called ‘foldness’) is the crucial one in the process of meta-learning. The paper focuses on its optimal value. Therefore, the routines used here for the missing attribute values processing are only the vehicles (for the function of the base classifiers); in fact, any PR algorithm for base classifiers could be used. In other words, the paper does not compare various missing attribute processing techniques, but its target is the parameter S.
References
- Batista, G., Monard, M.C.: An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17 (2003), 519-533
- Berka, P. and Bruha, I.: Various discretizing procedures of numerical attributes: Machine Learning, and Knowledge Discovery in Databases, Heraklion, Crete (1995), 136-141
- : Manual for CN2, version 4.1. Turing Institute, Techn. Rept. P-2145/Rab/4/1.3 (1990)
- : Unknown attribute values processing utilizing expert knowledge on attribute hierarchy. 8th European Conference on Machine Learning, Workshop Statistics, Machine Learning, and Knowledge Discovery in Databases, Heraklion, Crete (1995), 130-135
- : Unknown attribute values processing by meta-learner. International Symposium on Methodologies for Intelligent Systems (ISMIS-2002), Lyon, France (2002)
- and Franek, F.: Comparison of various routines for unknown attribute value processing: Covering paradigm. International Journal Pattern Recognition and Artificial Intelligence, 10, 8 (1996), 939-955
- and Boswell, R.: Rule induction with CN2: Some recent improvements. EWSL'91, Porto (1991), 151-163
- P. and Niblett, T.: The CN2 induction algorithm. Machine Learning, 3 (1989), 261- 283
- W., , Chan, P.K., Stolfo, S.J.: A comparative evaluation of combiner and stacked generalization. Workshop Integrating Multiple Learning Models, AAAI, Portland (1996)
- Fortes, I. et al.: Inductive learning models with missing values. Mathematical and Computer Modelling, 44 (2006), 790-806
- R.: Induction of decision trees. Machine Learning, 1 (1986), 81-106
- R.: Unknown attribute values in ID3. International Conference ML (1989), 164-8
- , Barbara, D.: Constraints in data mining of contents. ACM SIGKDD Explorations Newsletter (2002), 1931-1945
- et al.:'Missing is useful': Missing values in cost-sensitive decision trees. IEEE Trans. Knowledge and Data Engineering, 17, 12 (2005), 1689-1693
Paper Citation
in Harvard Style
Bruha I. (2010). Classification of Datasets with Missing Values: Two Level Approach . In Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010) ISBN 978-989-8425-14-0, pages 90-98. DOI: 10.5220/0003017800900098
in Bibtex Style
@conference{pris10,
author={Ivan Bruha},
title={Classification of Datasets with Missing Values: Two Level Approach},
booktitle={Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010)},
year={2010},
pages={90-98},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003017800900098},
isbn={978-989-8425-14-0},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010)
TI - Classification of Datasets with Missing Values: Two Level Approach
SN - 978-989-8425-14-0
AU - Bruha I.
PY - 2010
SP - 90
EP - 98
DO - 10.5220/0003017800900098