AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED

CATEGORICAL AND CONTINUOUS DATA

Gauthier Doquire and Michel Verleysen

ICTEAM Institute, Machine Learning Group, Universit

e catholique de Louvain

pl. du Levant 3, 1348 Louvain-la-Neuve, Belgium

Keywords:

Feature selection, Categorical features, Continuous features, Mutual information.

Abstract:

This paper proposes an algorithm for feature selection in the case of mixed data. It consists in ranking inde-

pendently the categorical and the continuous features before recombining them according to the accuracy of

a classiﬁer. The popular mutual information criterion is used in both ranking procedures. The proposed algo-

rithm thus avoids the use of any similarity measure between samples described by continuous and categorical

attributes, which can be unadapted to many real-world problems. It is able to effectively detect the most useful

features of each type and its effectiveness is experimentally demonstrated on four real-world data sets.

1 INTRODUCTION

Feature selection is a key problem in many machine

learning, pattern recognition or data mining applica-

tions. Indeed the ways to acquire and store data in-

crease every day. A lot of features are thus typically

gathered for a speciﬁc problem while many of them

can be either redundant or irrelevant. These useless

features often tend to decrease the performances of

the learning (classiﬁcation or regression) algorithms

(Guyon and Elisseeff, 2003) and slower the whole

learning process. Moreover, reducing the number of

attributes leads to a better interpretability of the prob-

lem and of the models, which is of crucial importance

in many industrial and medical applications. Feature

selection thus plays a major role both from a learning

and from an application point of view.

Due to the importance of the problem, many fea-

ture selection algorithms have been proposed in the

past few years. However, the great majority of them

are designed to work only with continuous or cate-

gorical features and are thus not well suited to handle

data sets with both type of features, while mixed data

are encountered in many real-world situations. To il-

lustrate this, two examples are given. First, the results

of medical surveys can include continuous attributes

as the size or the blood pressure of a patient, together

with categorical ones as the sex or the presence or ab-

sence of a symptom. In another ﬁeld, socio-economic

data can contain discrete variables about individuals

such as their kind of job or the city they come from,

as well as continuous ones like their income.

Algorithms dealing with continuous and discrete

attributes are thus needed. Two obvious ways to han-

dle problems with mixed attributes are turning the

problem into a categorical or a continuous one. Un-

fortunately, both approaches have strong drawbacks.

The ﬁrst idea would consist in coding the categor-

ical attributes into discrete numerical values. It would

then be possible to compute distances between ob-

servations as if all features were continuous. How-

ever, this approach is not likely to work well. In-

deed, permuting the code for two categorical values

could lead to different values of distance. To cir-

cumvent this problem, Bar-Hen and Daudin (1995)

proposed to use a generalized Mahalanobis distance,

while Kononenko (1994) employs the Euclidean dis-

tance for continuous features and the Hamming dis-

tance for categorical ones. The second idea is to dis-

cretize continuous features before running an algo-

rithm designed for discrete data (Hall, 2000). Even

if appealing, this approach may lead to a loss of in-

formation and makes the feature selection efﬁciency

extremely dependant on the discretization technique.

Recently, Tang and Mao proposed a method based

on the error probability (Tang and Mao, 2007) while

Hu et al. reported very satisfactory results using rough

set models generalized to the mixed case (Hu et al.,

2008). In this last paper, the authors base their work

on neighborood relationships between mixed sam-

ples, deﬁned in the following way. First, to be consid-

ered as neighbors, two samples must have the same

394

Doquire G. and Verleysen M..

AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA.

DOI: 10.5220/0003634903860393

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 386-393

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

values for all their discrete attributes. Then, depend-

ing on the approach chosen and according to the con-

tinuous features, the Euclidean distance between the

samples has to be below a ﬁxed treshold or one of

the sample has to belong to the k nearest neighbors of

the other. The method thus makes a strong hypoth-

esis about the notion of proximity between samples,

which can be totally inconsistent with some problems

as will be illustrated later in this work.

In contrast, the approach proposed in this paper

does not consider any notion of relationship between

mixed samples. Instead, the objective is to correctly

detect the most useful features of each kind and to

combine them to optimize the performance of predic-

tion models. More precisely, the features of each type

are ﬁrst ranked independently; two independent lists

are produced. The lists are then combined accord-

ing to the accuracy of a classiﬁer. Mutual information

(MI) based feature selection is employed for the rank-

ing of both continuous and categorical features.

The rest of the paper is organized as follows. Sec-

tion 2 brieﬂy recalls basic notions about MI. The pro-

posed methodology is described in Section 3 and ex-

perimental results are given in Section 4. Conclusions

are drawn in Section 5 which also contains some fu-

ture work perspectives.

2 MUTUAL INFORMATION

In this section, basic concepts about MI are intro-

duced and a few words are given about its estimation.

2.1 Deﬁnitions

MI (Shannon, 1948) is a criterion from the informa-

tion theory which has proven to be very efﬁcient in

feature selection (Battiti, 1994; Fleuret, 2004) mainly

because it is able to detect non linear relationships

between variables, while other popular criteria as the

well-known correlation coefﬁcient are limited to lin-

ear relationships. Moreover, MI can handle groups of

vectors, i.e. multidimensional variables.

MI is intuitively a symmetric measure of the in-

formation two random variables X and Y carry about

each other and is formally deﬁned as follows:

I(X;Y ) = H(X ) + H(Y )−H(X,Y ) (1)

where H(X ) is the entropy of X:

H(X ) = −

(x)log f

(x) dx (2)

with f

being the probability density function (pdf) of

X. H(X,Y ) is the entropy of the joint variable (X,Y )

deﬁned in the same way.

The MI can be reformulated as:

I(X;Y ) =

Z Z

X,Y

(x,y) log

X,Y

(x,y)

(x) f

(y)

dx dy. (3)

This last equation deﬁnes MI as the Kullback-Leibler

divergence between the joint distribution f

X,Y

and the

product of the distributions f

and f

, these quanti-

tites being equal for independant variables.

As in practice none of the pdf f

, f

and f

X,Y

are

known, MI can not be computed analytically but has

to be estimated from the data set.

2.2 MI Estimation

Traditional MI estimators are based on histograms or

kernels (Parzen, 1962) density estimators which are

used to approximate the value of the MI according for

example to (1) (Kwak and Choi, 2002). Despites its

popularity, this approach has the huge drawback that

it is unreliable for high-dimensional data. Indeed, as

the dimension of the space increases, if the number of

available samples remains constant, these points will

not be sufﬁcient to sample the space with an accept-

able resolution. For histograms, most of the boxes

will be empty and the estimates are likely to be in-

accurate. Things will not be different for kernel es-

timators which are essentially smoothed histograms.

These problems are a direct consequence of the curse

of dimensionality (Bellman, 1961; Verleysen, 2003),

stating that the number of points needed to sample

a space at a given resolution increases exponentially

with the dimension of the space; if p points are needed

to sample a one-dimensional space at a given resolu-

tion, p

points will be needed if the dimension is n.

Since in this paper MI estimation is needed for

multi-dimensional data points, other estimators have

to be considered. To this end, a recently introduced

family of estimators based on the principle of near-

est neighbors are used (Kraskov et al., 2004; G

omez-

Verdejo et al., 2009). These estimators have the ad-

vantage that they do not estimate the entropy directly

and are thus expected to be more robust if the di-

mension of the space increases. They are inspired

by the Kozachenko-Leonenko estimator of entropy

(Kozachenko and Leonenko, 1987):

H(X) = −ψ(k) + ψ(N) + log(c

) +

∑

n=1

log(ε

(n,k))

(4)

where k is the number of nearest neighbors consid-

ered, N the number of samples of a random variable

X, d the dimensionality of these samples, c

the vol-

ume of a unitary ball of dimension d and ε

(n,k)

twice the distance from the n

observation in X to

AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA

395

its k

nearest neighbor; ψ is the digamma function:

ψ(k) =

(k)

Γ(k)

lnΓ(k) , Γ(k) =

∞

k−1

−x

dx.

Using (4), Kraskov et al. (Kraskov et al., 2004) de-

rived two slightly different estimators for regression

problems (i.e. for problems with a continuous out-

put). The most widely used one is:

I(X;Y ) =ψ(N) + ψ(K) −

−

∑

n=1

(ψ(τ

(n)) + ψ(τ

(n)))

(5)

where τ

(n) is the number of points located no further

than ε

(n,k) from the n

observation in the X space;

(n) is deﬁned similarly in the Y space with ε

(n,k).

In case of a classiﬁcation problem, Y is a dis-

crete vector representing the class labels. Calling L

the number of classes, Gomez et al. (G

omez-Verdejo

et al., 2009) took into account the fact that the prob-

ability distribution of Y is estimated by p(y = y

) =

/N, where n

is the number of points whose class

value is y

, and proposed to estimate the MI as:

cat

(X;Y ) = ψ(N) −

ψ(n

∑

n=1

log(ε

(n,K)) −

∑

l=1

∑

n∈y

log(ε

(n,K))

(6)

In this last equation, ε

(n,K) is deﬁned in the same

way as ε

(n,K) in (4) but the neighbors are limited to

the points having the class label y

If both X and Y are categorical features, equations

(2) and (3) become sums where the probabilities can

be estimated from the samples in the learning set by

simple counting and no estimator is needed. Assume

X (resp. Y ) takes s

) different values x

.. .x

.. .y

), each with a probability p

) and de-

note by p

the joint probability of x

and y

, then:

I(X;Y ) =

∑

i=1

∑

j=1

log

. (7)

3 METHODOLOGY

This section presents the proposed feature selection

procedure. It ends with a few comments on the ﬁlter /

wrapper dilemma.

3.1 Lists Ranking

As already discussed, this paper suggests avoiding the

use of any similarity measure between mixed data

points. To this end, the proposed procedure starts

by separating the continuous and the categorical fea-

tures. Both groups of features are then ranked inde-

pendently, according to the following strategies.

3.1.1 Continuous Features

For continuous features, the multivariate MI criterion

is considered, meaning that the MI is directly esti-

mated between a set X of features and the output Y .

The nearest neighbors based MI estimators

(Kraskov et al., 2004; G

omez-Verdejo et al., 2009)

described previously are particularly well suited for

multivariate MI estimation. Indeed, as already ex-

plained, they do not require the estimation of multi-

variate probability density functions. This crucial ad-

vantage allows us to evaluate robustly the MI between

groups of features with a limited number of samples.

As an example the estimator described in (Kraskov

et al., 2004) has been used sucessfully in feature se-

lection for regression problems (Rossi et al., 2006).

In this paper, the multivariate MI estimator is com-

bined with a greedy forward search procedure; at each

step of the selection procedure, the feature whose ad-

dition to the set of already selected features leads to

the largest multivariate MI with the output is selected.

This choice is never questioned again, hence the name

forward. Algorithm 1 illustrates a greedy forward

search procedure for a relevance criterion c to be max-

imized, with R{i} being the i

element of R.

Obviously, in such a procedure, the possible re-

dundancy between the features is implicitely taken

into account since the selection of a feature carrying

no more information about the output than the already

selected ones will result in no increase of the MI.

3.1.2 Categorical Features

It is important to note that the multivariate MI esti-

mators (Kraskov et al., 2004; G

omez-Verdejo et al.,

2009) should not be considered for categorical fea-

tures. Indeed, for categorical data it is likely that

the distances between a sample and several others are

identical, especially in the ﬁrst steps of the forward

selection procedure. These ex-aequos could bring

confusion in the determination of the nearest neigh-

bors and harm the MI estimation. Moreover, using di-

rectly equation (7) can be untractable in practice. As

an example, if X consists of 20 features, each taking 3

possible discrete values, the total number of possible

values s

for points in X is 3

> 3 × 10

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

396

Another criterion than multivariate MI has thus to

be thought of. In this paper, the minimal-Redundancy

maximal-Relevance (mRmR) principle is used since

it has proven to be very efﬁcient in feature selection

when combined with the MI criterion (Peng et al.,

2005). The idea is to select a set of maximally in-

formative but not redundant features.

This principle is also combined with a greedy for-

ward search strategy: suppose a subset of features

has already been selected; one searches for the uns-

elected feature which maximises D − R where D, the

estimated relevance, is the MI between the new fea-

ture and the output. R is the estimated redundancy

and can be measured by the average MI between the

new feature and each of the already selected features.

Denote by S the set of indices of already selected fea-

tures; the mRmR criterion D − R for feature i (i /∈ S)

given an output vector Y is:

mRmR( f

) = I( f

;Y ) −

|S|

∑

j∈S

I( f

; f

). (8)

All MI estimations or computations in the mRmR

procedure are thus bivariate (i.e. involve only two

variables). Of course, bivariate methods are not ex-

pected to perform as well as multivariate ones since

they only consider pairwise redundancy or relevance.

A simple example showing this is the well-known

XOR problem. It consists of two random binary vec-

tors X

, X

and an output Y whose i

element is 1 if

the i

elements of X

and X

are different and 0 other-

wise. Individually, both vectors carry no information

about the output Y . However, together they entirely

determine it. Thus, even if X

is selected, a mRmR

procedure will not be able to determine X

as relevant

while a multivariate approach will.

3.2 Combination of the Lists

Once established, the two lists are combined accord-

ing to the accuracy (the percentage of well-classiﬁed

samples) of a classiﬁcation model. First, the accura-

cies of a model built on the ﬁrst continuous or the ﬁrst

categorical feature are compared. The feature leading

to the best result is chosen and removed from the list

it belongs to. The selected feature is then combined

with the best continuous or with the best categorical

feature that still belong to their respective lists, i.e.

that has not been selected yet; the subset for which

a model performs the best is selected, and so on un-

til all features have been selected. The whole feature

selection procedure is described in Algorithm 2.

Input: A set F of features i, i = 1 : n

A class labels vector Y

Output: A list L of sorted indices of features.

begin

R ←− 1 : n

//R is the set of indices of not yet

selected features

L ←−

for k = 1 : n

foreach i ∈ R do

set ←− L ∪ R

{

}

score

{

}

←− c(set,Y )

end

winner ←− argmax

score

{

}

L ←− [L; R

{

winner

}

]

R ←− R \R

{

winner

}

clear score

end

Algorithm 1: Forward search procedure to maximize a cri-

terion c.

3.3 Filter or Wrapper Feature Selection

As can be seen from the previous developments, two

different approaches to feature selection are succes-

sively used to produce a global algorithm. First,

building the two lists is made by using ﬁlter methods.

This means that no classiﬁcation model is used and

that the selection is rather based on a relevance crite-

rion, such as MI in this paper. On the other hand, the

combination of the lists does require a speciﬁc classi-

ﬁer and is thus a wrapper procedure.

Wrappers are generally expected to lead to better

results than ﬁlters since they are designed to optimize

the performances of a speciﬁc classiﬁer. Of course,

wrappers are also usually much slower than ﬁlters,

precisely because of the fact they have to build a huge

number of classiﬁcation models with possible hyper-

parameters to tune.

As an example, the exhaustive wrapper approach

consisting in testing all the possible feature subsets

would require building 2

models, n

being the num-

ber of features. If there are 20 features, about 10

classiﬁers must be built. This method thus becomes

quickly untractable as the number of features grows.

An alternative is to use heuristics such as the greedy

forward search presented in Algorithm 1. The num-

ber of models to build is then

+1)

− 1 which is

still unrealistic for complex models.

On the contrary, the approach proposed in this pa-

per only requires the construction of at most n

− 1

classiﬁers. Indeed, Algorithm 2 emphasizes the fact

that the use of a classiﬁcation model is needed only

if none of the lists are empty; in practice the num-

ber of models to build will thus often be smaller than

AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA

397

− 1. In addition

cont

+1)

cat

+1)

− 2 eval-

uations of the MI are necessary, n

cat

and n

cont

being

respectively the number of categorical and continuous

features. A compromise between both approaches is

thus found, which prevents us to use any similarity

measure between mixed samples, while keeping the

computional burden of the procedure relatively low.

Input: A set of categorical features F

cat

{

}

i = 1 : n

cat

A set of continuous features F

cont

{

}

i = 1 : n

cont

A class labels vector Y.

Output: A list L of sorted indices of features.

begin

InCat ←− SortCat(F

cat

,Y )

//Get the sorted list of indices for

categorical features

InCon ←− SortCon(F

cont

,Y )

//Get the sorted list of indices for

continuous features

L ←−

for k = 1 : n

cat

+ n

cont

if InCat 6=

0 and InCon 6=

0 then

AccCat ←− Acc(L ∪InCat

{

}

,Y )

AccCon ←− Acc(L ∪ InCon

{

}

,Y )

//Function Acc(.) gives the

accuracy of a classifier.

if AccCat < AccCon then

L ←− L ∪ InCon

{

}

delete InCon

{

}

else

L ←− L ∪ InCat

{

}

delete InCat

{

}

end

else

if InCat =

0 then

L ←− L ∪ InCon

{

}

delete InCon

{

}

else

L ←− L ∪ InCat

{

}

delete InCat

{

}

end

Algorithm 2: Proposed feature selection algorithm.

4 EXPERIMENTAL RESULTS

To assess the performance of the proposed feature se-

lection algorithms, experiments are conducted on ar-

tiﬁcial and real-world data sets. The limitations of

methods based on a given similarity measure between

mixed samples are ﬁrst emphasized on a very simple

Table 1: Description of the datasests used in the experi-

ments.

Name samples cont. features cat. features classes

Heart 270 6 7 2

Hepatitis 80 6 13 2

Australian Credit 690 6 8 2

Contraception 1473 2 7 3

toy problem. Results obtained on four UCI (Asuncion

and Newman, 2007) data sets then conﬁrm the inter-

est of the proposed approach.

Two classiﬁcation models are used in this study.

The ﬁrst one is a Naive Bayes classiﬁer with probabil-

ities for continuous attributes estimated using Parzen

window density estimation (Parzen, 1962) and those

for categorical attributes estimated by counting.

The second one is a 5-nearest neighbors classiﬁer,

with distances between samples computed by the Het-

erogeneous Euclidean-Overlap Metric (HEOM) (Wil-

son and Martinez, 1997) while other choices could

as well have been made (see e.g. (Boriah et al.,

2008)). This metric uses different distance functions

for categorical and continuous attributes. It is deﬁned

for two vectors X = [X

.. .X

] and Y = [Y

.. .Y

] as

heom

(X,Y ) =

∑

a=1

)

where

(x,y) :=

(

overlap(x,y) if a is categorial

x−y

max

−min

if a is continuous

with max

and min

, respectively the maximal and

minimal values observed for the a

feature in the

training set, and overlap(x,y) = 1 − δ(x,y) (δ denot-

ing the Kronecker delta, δ(x,y) = 1 if x = y and 0

otherwise). These models have mainly been chosen

because they are both known to suffer dramatically

from the presence of irrelevant features in compari-

son with, for example, decision trees.

In this section, we compare the proposed feature

selection approach with the algorithm by Hu and al.

(Hu et al., 2008). As already explained, in that paper,

the authors consider two points as neighbors if their

categorical attributes are equal and if one is among

the k nearest neighbors of the other or if the distance

between them is not too large according to their nu-

merical attributes (there are thus two versions of the

algorithm). Then, they look for the features for which

the largest number of points share their class label

with at least a given fraction of their neighbors. The

methodology is thus extremely dependent on the cho-

sen deﬁnition of neighborhood. Even if this deﬁnition

can be modiﬁed, it is not easy, given an unknown data

set, to determine a priori which relation can be a good

choice. Among the two versions of Hu and al.’s al-

gorithm, only the nearest neighbors-based one will be

considered in this work, since it has been shown more

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

398

efﬁcient in practice (Hu et al., 2008), which was con-

ﬁrmed by our experiments.

4.1 Toy Problem

To underline the aformentionned drawbacks, a toy

problem is considered to show the limitations of

methods based on dissimilarities for mixed data.

It consists in a data set X containing two categori-

cal variables (X

) taking two possible discrete val-

ues with equal probability and two continuous vari-

ables (X

) uniformly distributed over [0; 1]. The

sample size is 100. A class labels vector Y is also

built from X ; the points whose value of X

is below

0.15 or above 0.85 are given the class label 1 and the

other points are given the class label −1. The only

relevant variable is thus X

, which sould be selected

in ﬁrst place by accurate feature selection algorithms.

However, the problem with Hu et al.’s method is

that many points with class label 1 do not have enough

neighbors with the same label. Feature X

will not

thus be detected as relevant by the algorithms. More

precisely, over 50 repetitions, X

has been selected in

the ﬁrst place only in 28 cases. With the approach

proposed in this paper, X

has been selected ﬁrst in

the 50 repetitions of the experiment.

Interestingly, if the problem is modiﬁed such that

points for which the value of X

is between 0.4 and

0.7 have class label 1 and other points have class label

−1, then Hu and al.’s algorithm always detect X

has

the most relevant variable. Although the proportion

of both classes in the two problems are the same, the

second one is more compatible with the chosen def-

inition of neighborhood, explaining the better results

obtained. Of course, this deﬁnition of neighborhood

could be modiﬁed to better ﬁt the ﬁrst problem but

would then likely be inaccurate in other situations.

4.2 Real-world Data Sets

Four classiﬁcation benchmark data sets from the UCI

Machine Learning Repository (Asuncion and New-

man, 2007) are used in the study to further illustrate

the interest of the proposed approach. All contain

continuous and categorical attributes and are summa-

rized in Table 1. Two come from the medical world,

one is concerned with whether an applicant should re-

ceive a credit card or not and the last one is about the

choice of a contraceptive method. The data sets used

in this work are not the same as those considered in

(Hu et al., 2008) since many of the latest do not actu-

ally contain mixed features.

As a preprocessing, observations containing miss-

ing values are deleted. Moreover, continuous at-

Figure 1: Error rate as a function of the number of selected

features: Heart data set and naive Bayes classiﬁer.

Figure 2: Error rate as a function of the number of selected

features: Hepatitis data set and naive Bayes classiﬁer.

tributes are normalized by removing their mean and

dividing them by their standard deviation in order

to make the contribution of each attribute to the Eu-

clidean distance equivalent in the MI estimation. As

suggested in (G

omez-Verdejo et al., 2009), for con-

tinuous features, the MI is estimated over a range of

different values of the parameter k and then averaged

to prevent strong underﬁtting or overﬁtting. In this

paper, 4 to 12 neighbors are considered except for the

Hepatitis data set for which 4 to 6 neighbors are taken

into account. This is due to the fact that a class is

represented by only a few samples in this data set.

The criterion of comparison is the classiﬁcation

error rate. It is estimated through a 5-fold cross vali-

dation procedure repeated 5 times with different ran-

dom shufﬂings of the data set. The presented error

rates are obtained on the test set, independant of the

training set used to train the classiﬁer. The feature to

choose at each wrapper step is determined by a 5-fold

cross validation on the training set.

Figures 1 to 7 show the average classiﬁcation er-

ror rate achieved by the proposed method (referred to

as Hybrid MI) and by Hu and al.’s one with respect to

the number of features selected. As can be seen, the

approach is very competitive with both classiﬁers and

leads to the global smallest misclassiﬁcation rate in 7

AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA

399

Figure 3: Error rate as a function of the number of selected

features: Australian credit data set and naive Bayes classi-

ﬁer.

Figure 4: Error rate as a function of the number of selected

features: Contraception data set and naive Bayes classiﬁer.

Figure 5: Error rate as a function of the number of selected

features: Heart data set and k-nn classiﬁer.

of the 8 experiments. Moreover, for the Hepatitis and

the Contraception data sets, it is the only approach

selecting a subset of features leading to better perfor-

mances than the set of all features. The improvement

in classiﬁcation accuracy is thus obvious.

5 CONCLUSIONS

This paper introduces a new feature selection method-

Figure 6: Error rate as a function of the number of selected

features: Australian credit data set and k-nn classiﬁer.

Figure 7: Error rate as a function of the number of selected

features: Contraception data set and k-nn classiﬁer.

Figure 8: Error rate as a function of the number of selected

features: Hepatitis data set and k-nn classiﬁer.

ology for mixed data, i.e. for data with both categori-

cal and continuous attributes. The idea of the method

is to independently rank both types of features before

recombining them guided by the accuracy of a clas-

siﬁer. The proposed algorithm is thus a combination

of a ﬁlter and a wrapper approach to feature selec-

tion. The well-known MI criterion is used to produce

both ranked lists. For continuous features, multidi-

mensional MI estimation is used while a mRmR ap-

proach is considered for categorical features.

One of the most problematic issues for feature se-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

400

lection algorithms dealing with mixed data is to chose

an appropriate relationship between categorical and

continuous features leading to a sound similarity mea-

sure or neighborood deﬁnition between observations.

This new method simply alleviates this problem by

ignoring this unknown relationship. The hope is to

compensate the loss of information induced by this

hypothesis by a more accurate ranking of features of

each type and by the use of a classiﬁcation model.

Even if the approach requires the explicit building

of prediction models, the number of models to build

is small compared to a pure wrapper approach. More-

over, experimental results on four data sets show the

interest in terms of the accuracy of two classiﬁers.

All the developments presented in this paper could

be applied to regression problems (problems with a

continuous output); the only modiﬁcations needed

would be to use the MI estimator (5) instead of (6)

for the continuous features and the mRmR approach

for the categorical ones. It would thus be interesting

to test the proposed approach on such problems.

REFERENCES

Asuncion, A. and Newman, D. (2007). UCI machine learn-

ing repository. University of California, Irvine, School

of Information and Computer Sciences, available at

http://www.ics.uci.edu/∼mlearn/MLRepository.html.

Battiti, R. (1994). Using mutual information for select-

ing features in supervised neural net learning. IEEE

Transactions on Neural Networks, 5:537–550.

Bellman, R. (1961). Adaptive Control Processes: A Guided

Tour. Princeton University Press.

Boriah, S., Chandola, V., and Kumar, V. (2008). Similarity

measures for categorical data: A comparative evalua-

tion. In SDM’08, pages 243–254.

Fleuret, F. (2004). Fast binary feature selection with con-

ditional mutual information. J. Mach. Learn. Res.,

5:1531–1555.

omez-Verdejo, V., Verleysen, M., and Fleury, J. (2009).

Information-theoretic feature selection for functional

data classiﬁcation. Neurocomputing, 72:3580–3589.

Guyon, I. and Elisseeff, A. (2003). An introduction to

variable and feature selection. J. Mach. Learn. Res.,

3:1157–1182.

Hall, M. A. (2000). Correlation-based feature selection

for discrete and numeric class machine learning. In

Proceedings of ICML 2000, pages 359–366. Morgan

Kaufmann Publishers Inc.

Hu, Q., Liu, J., and Yu, D. (2008). Mixed feature selec-

tion based on granulation and approximation. Know.-

Based Syst., 21:294–304.

Kozachenko, L. F. and Leonenko, N. (1987). Sample es-

timate of the entropy of a random vector. Problems

Inform. Transmission, 23:95–101.

Kraskov, A., St

ogbauer, H., and Grassberger, P. (2004). Es-

timating mutual information. Physical review. E, Sta-

tistical, nonlinear, and soft matter physics, 69(6 Pt 2).

Kwak, N. and Choi, C.-H. (2002). Input feature selection

by mutual information based on parzen window. IEEE

Trans. Pattern Anal. Mach. Intell., 24:1667–1671.

Parzen, E. (1962). On the estimation of a probability density

function and mode. Annals of Mathematical Statistics,

33:1065–1076.

Peng, H., Long, F., and Ding, C. (2005). Fea-

ture selection based on mutual information: Cri-

teria of max-dependency, max-relevance and min-

redundancy. IEEE Trans. Pattern Anal. Mach. Intell.,

27:1226–1238.

Rossi, F., Lendasse, A., Franc¸ois, D., Wertz, V., and Verley-

sen, M. (2006). Mutual information for the selection

of relevant variables in spectrometric nonlinear mod-

elling. Chemometrics and Intelligent Laboratory Sys-

tems, 80(2):215–226.

Shannon, C. E. (1948). A mathematical theory of commu-

nication. The Bell system technical journal, 27:379–

423.

Tang, W. and Mao, K. Z. (2007). Feature selection algo-

rithm for mixed data with both nominal and continu-

ous features. Pattern Recogn. Lett., 28:563–571.

Verleysen, M. (2003). Learning high-dimensional data.

Limitations and Future Trends in Neural Computa-

tion, 186:141–162.

Wilson, D. R. and Martinez, T. R. (1997). Improved hetero-

geneous distance functions. J. Artif. Int. Res., 6:1–34.

AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA

401