SELECTING CATEGORICAL FEATURES IN MODEL-BASED CLUSTERING

Cláudia M. V. Silvestre, Margarida M. G. Cardoso, Mario A. T. Figueiredo

2009

Abstract

There has been relatively little research on feature/variable selection in unsupervised clustering. In fact, feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. The methods proposed for addressing this problem are mostly focused on numerical data. In this work, we propose an approach to selecting categorical features in clustering. We assume that the data comes from a finite mixture of multinomial distributions and implement a new expectation-maximization (EM) algorithm that estimate the parameters of the model and selects the relevant variables. The results obtained on synthetic data clearly illustrate the capability of the proposed approach to select the relevant features.

References

  1. Constantinopoulos, C., Titsias, M. K., and Likas, A. (2006). Bayesian feature and model selection for gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:1013-1018.
  2. Dempster, A., Laird, N., and Rubin, D. (1997). Maximum likelihood estimation from incomplete data via the em algorithm. Journal of Royal Statistical Society, 39B:1-38.
  3. Dy, J. and Brodley, C. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5:845-889.
  4. Figueiredo, M. and Jain, A. (2002). Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:381-396.
  5. Law, M., Figueiredo, M., and Jain, A. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:1154-1166.
  6. Steinley, D. and Brusco, M. (2008). Selection of variables in cluster analysis an empirical comparison of eight procedures. Psychometrika, 73:125-144.
  7. Talavera, L. (2005). An evaluation of filter and wrapper methods for feature selection in categorical clustering. Advances in Intelligent Data Analysis VI, 3646:440- 451.
  8. Wallace, C. and Boulton, D. (1968). An information measure for classification. The Computer Journal, 11:195-209.
Download


Paper Citation


in Harvard Style

M. V. Silvestre C., M. G. Cardoso M. and A. T. Figueiredo M. (2009). SELECTING CATEGORICAL FEATURES IN MODEL-BASED CLUSTERING . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 303-306. DOI: 10.5220/0002303203030306


in Bibtex Style

@conference{kdir09,
author={Cláudia M. V. Silvestre and Margarida M. G. Cardoso and Mario A. T. Figueiredo},
title={SELECTING CATEGORICAL FEATURES IN MODEL-BASED CLUSTERING},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={303-306},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002303203030306},
isbn={978-989-674-011-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - SELECTING CATEGORICAL FEATURES IN MODEL-BASED CLUSTERING
SN - 978-989-674-011-5
AU - M. V. Silvestre C.
AU - M. G. Cardoso M.
AU - A. T. Figueiredo M.
PY - 2009
SP - 303
EP - 306
DO - 10.5220/0002303203030306