Table 1: Probabilities of the clustering five base variables categories.
Synthetic data Estimated parameters
Component 1 Component 2 Component 1 Component 2
Number of samples: 400 Number of samples: 500
α
1
= 0.4444 α
2
= 0.5556
b
α
1
= 0.4444
b
α
2
= 0.5556
Variable 1 0.7 0.1 0.6953 0.0988
0.2 0.3 0.2013 0.3024
0.1 0.6 0.1034 0.5988
Variable 2 0.2 0.7 0.2007 0.6936
0.8 0.3 0.7994 0.3064
Variable 3 0.4 0.6 0.4029 0.5996
0.6 0.4 0.5971 0.4004
Variable 4 0.5 0.49 0.4946
0.2 0.22 0.2049
0.3 0.29 0.3005
Variable 5 0.3 0.31 0.3119
0.3 0.30 0.2999
0.4 0.39 0.3882
features are also distributed according to a multino-
mial distribution. The numerical experiments refer
to 8 simulated data sets. According to the obtained
results using the proposed EM variant, the estimated
probabilities corresponding to the categorical features
almost exactly match the actual (simulated) probabil-
ities. In Table 1 we present results which refer to one
data set with 900 observations and 5 categorical vari-
ables (features). The first three variables are relevant
and the last two are irrelevant, with “similar” distribu-
tions between components. The variables 1, 4 and 5
have 3 categories each and the variables 2 and 3 have
2 categories.
4 CONCLUSIONS AND FUTURE
RESEARCH
In this work, we describe a feature selection method
for clustering categorical data. Our work is based on
the commonly used framework which assumes that
the data comes from a multinomial mixture model
(we assume that the number of components of the
mixture model is known). We adopt a specific defini-
tion of feature irrelevancy, based on the work of (Law
et al., 2004), which we believe is more adequate than
alternative formulations (Talavera, 2005) that tends to
discard features which are uncorrelated. We use a new
variant of the EM algorithm, together with an MML
(minimum message length) criterion, to estimate the
parameters of the mixture and select the relevant vari-
ables.
The reported results clearly illustrate the ability of
the proposed approach to recover the ground truth on
data concerning the features’ saliency. In future work,
we will implement the simultaneous selection of fea-
tures and the number of components, based on a sim-
ilar approach and illustrate the proposed approach us-
ing real data sets.
REFERENCES
Constantinopoulos, C., Titsias, M. K., and Likas, A. (2006).
Bayesian feature and model selection for gaussian
mixture models. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 28:1013–1018.
Dempster, A., Laird, N., and Rubin, D. (1997). Maxi-
mum likelihood estimation from incomplete data via
the em algorithm. Journal of Royal Statistical Society,
39B:1–38.
Dy, J. and Brodley, C. (2004). Feature selection for unsu-
pervised learning. Journal of Machine Learning Re-
search, 5:845–889.
Figueiredo, M. and Jain, A. (2002). Unsupervised learning
of finite mixture models. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 24:381–396.
Law, M., Figueiredo, M., and Jain, A. (2004). Simultaneous
feature selection and clustering using mixture models.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26:1154–1166.
Steinley, D. and Brusco, M. (2008). Selection of variables
in cluster analysis an empirical comparison of eight
procedures. Psychometrika, 73:125–144.
Talavera, L. (2005). An evaluation of filter and wrapper
methods for feature selection in categorical clustering.
Advances in Intelligent Data Analysis VI, 3646:440–
451.
Wallace, C. and Boulton, D. (1968). An information
measure for classification. The Computer Journal,
11:195–209.
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
306