Adaptive Sequential Feature Selection for Pattern Classiﬁcation

Liliya Avdiyenko

1

, Nils Bertschinger

1

and Juergen Jost

1,2

1

Max Planck Institute for Mathematics in the Sciences, Inselstr. 22, 04103 Leipzig, Germany

2

Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, U.S.A.

Keywords:

Adaptivity, Feature Selection, Mutual Information, Multivariate Density Estimation, Pattern Recognition.

Abstract:

Feature selection helps to focus resources on relevant dimensions of input data. Usually, reducing the input

dimensionality to the most informative features also simpliﬁes subsequent tasks, such as classiﬁcation. This

is, for instance, important for systems operating in online mode under time constraints. However, when the

training data is of limited size, it becomes difﬁcult to deﬁne a single small subset of features sufﬁcient for

classiﬁcation of all data samples. In such situations, one should select features in an adaptive manner, i.e.

use different feature subsets for every testing sample. Here, we propose a sequential adaptive algorithm that

for a given testing sample selects features maximizing the expected information about its class. We provide

experimental evidence that especially for small data sets our algorithm outperforms two the most similar

information-based static and adaptive feature selectors.

1 INTRODUCTION

Machine learning is often confronted with high-

dimensional data. A common problem is the so-called

“curse of dimensionality”, meaning that the amount

of data required to ﬁnd good model parameters grows

exponentially with the dimension of the input space.

For this reason, as well as computational issues, fea-

ture selection is often used to reduce the data dimen-

sionality to the features relevant to solve a given prob-

lem, such as classiﬁcation. Moreover, in a situation

when the training set is of a limited size, a classiﬁer

built on a smaller number of features usually has bet-

ter generalization ability.

Basically, one can distinguish between two types

of feature selection algorithms: ﬁlters and wrappers

(Webb, 1999). The former try to reduce the dimen-

sionality of the data while keeping potential clusters

in the data well separated. In this case, the relevance

of each feature is evaluated using different measures

of distances between classes, e.g. probabilistic dis-

tance measures. However, the involved probabilities

are difﬁcult to estimate and often approximate meth-

ods are used. Wrappers also preprocess the data but

directly take into account that the resulting features

should be useful for a certain classiﬁer. Therefore,

features are selected based on the prediction accuracy

of the classiﬁer employing these features. This might

lead to better results but is usually computationally

demanding and prone to overﬁtting.

In each case, one can look for the best feature sub-

set of a certain cardinality using an optimal search

strategy. Since the number of possible subsets is ex-

ponentially large, testing all of them is infeasible.

A good example is the branch and bound method

(Narendra and Fukunaga, 1977) that assumes mono-

tonicity of the selection criterion to avoid an exhaus-

tive search. If such an assumption is not valid and the

number of features is large, suboptimal methods have

to be used. This class of algorithms includes forward

and backward sequential feature selection, e.g. (Ding

and Peng, 2005; Abe, 2005). In both cases, the rel-

evance of each feature is evaluated together with the

current feature subset.

Among probabilistic criteria used by ﬁlters, se-

lection criteria based on Shannon entropy are widely

used (Duch et al., 2004). Such criteria select the

features to reduce uncertainty about the output class.

Battiti was one of the ﬁrst to use mutual informa-

tion, a concept closely related to the Shannon en-

tropy, for sequential feature selection (Battiti, 1994).

However, this involves estimation of the conditional

mutual information (CMI), i.e. the amount of infor-

mation between the feature and the class given the

already selected features, which requires multivari-

ate density estimation. To circumvent this problem,

Battiti approximated CMI by pairwise mutual infor-

mation. Kernel density estimation (discussed below

474

Avdiyenko L., Bertschinger N. and Jost J..

Adaptive Sequential Feature Selection for Pattern Classiﬁcation.

DOI: 10.5220/0004146804740482

In Proceedings of the 4th International Joint Conference on Computational Intelligence (NCTA-2012), pages 474-482

ISBN: 978-989-8565-33-4

Copyright

c

2012 SCITEPRESS (Science and Technology Publications, Lda.)

in subsection 2.2.1) is a non-parametric technique

widely used for multivariate density estimation. It

was successfully applied to estimate CMI for the ex-

haustive search procedure (Bonnlander and Weigend,

1994) and forward feature selection (Kwak and Choi,

2002; Bonnlander, 1996).

Ideally, it should be possible to describe all obser-

vations by the same small subset of features. How-

ever, when the amount of available training data is

limited and the number of features exceeds the num-

ber of training samples, it is very likely that no sin-

gle feature subset is good enough for classiﬁcation of

all observations. For example, one may need differ-

ent features to discriminate between classes, or even

different objects belonging to one class may have dif-

ferent discriminative features. One can partially over-

come this problem by having a collection of all rele-

vant feature subsets. This, however, will lead to an in-

crease in the classiﬁer complexity, which in turn will

lead to its poor performance, since there is not enough

data for training the classiﬁer in high-dimensional

space, e.g. see (Raudys and Jain, 1991). Thus, con-

ventional feature selection schemes, which select a

ﬁxed subset of features before they are handed to a

classiﬁer, can be inefﬁcient.

Thereby, in cases of small data sets, we propose

to use different subsets of features for every testing

sample, i.e. select the relevant features in an “adap-

tive” manner. Here, by adaptivity we mean that for

a certain testing sample every selected feature should

yield the maximum additional information about the

class given the already selected features with values

observed on this testing sample.

The idea of adaptivity was used by Geman and

Jedynak in their active testing model (1996) where

they sequentially select tests in order to reduce uncer-

tainty about the true hypothesis. For their problem do-

main, they assumed that features are conditionally in-

dependent given the class. Jiang also used an adaptive

scheme (Jiang, 2008), however, without conditioning

on the already selected features, which are employed

only to update a set of currently active classes. In con-

trast to these schemes, we adaptively select features

taking into account high-order dependencies between

them.

Here, we propose an adaptive feature selection al-

gorithm based on CMI that sequentially adds features

one by one to a subset of features relevant for a certain

testing sample. Even though the multivariate prob-

ability densities are hard to estimate in general and

from small data sets especially, the algorithm is still

able to select informative features in high dimensions.

Our model is also inspired by sequential visual

processing, i.e. a principle of eye movements when

performing a task. Since a human can foveate only

on a small part of an image at time, the scene is per-

ceived sequentially. Moreover, only a few eye ﬁxa-

tions are usually enough to analyze the whole scene.

This might suggest that optimal saccades for a cer-

tain task follow the sequence of the most informative

scene-speciﬁc locations. For experimental support

see (Renninger et al., 2007; Najemnik and Geisler,

2005).

Sec. 2.1 explains the mathematical basis and gen-

eral idea of our method, whereas Sec. 2.2 gives imple-

mentation details. Then, in Sec. 3, we provide results

for two image classiﬁcation tasks using artiﬁcially

constructed bitmap images of digits and real-world

data from the MNIST database of hand-written dig-

its. We show that our method outperforms the Parzen

window feature selector (Kwak and Choi, 2002) and

the active testing model (Geman and Jedynak, 1996),

which are static and adaptive CMI-based feature se-

lectors. Finally, in Sec. 4 we discuss beneﬁts of our

approach and future extensions.

2 MODEL

For our model, we start with a standard classiﬁcation

setup. Suppose we have a space of possible inputs

F = ×

n

i=1

F

i

, i.e each input is an n-dimensional fea-

ture vector f = ( f

1

, . . . , f

n

), where the i

th

feature takes

values f

i

∈F

i

. Our notion of feature is rather general,

ranging from simple ones, such as the gray-value of a

certain pixel, to sophisticated ones, such as counting

faces in an image. Feature combinations are consid-

ered as a random variable F with a joint distribution

on F

1

×···×F

n

and the observation f is drawn from

that distribution.

Furthermore, each observation has an associated

class label c ∈ C = {c

1

, . . . , c

m

}. The task of the

classiﬁer is to assign a class label to each observa-

tion f. Thus, formally it is considered as a map

φ : F → C or, more generally, assigning to each f the

conditional probabilities p(c|f) of the classes c. To

learn such a classiﬁcation, we are given a training set

X = {(x

i

, c

i

)}

T

i=1

of labeled observations, which are

assumed to be drawn independently from the distribu-

tion relating feature vectors and class labels. Then the

goal is to ﬁnd a classiﬁcation rule φ that correctly pre-

dicts the class of future samples with unknown class

label, called testing samples. That is, confronted with

a feature vector ξ we would classify it as c = φ(ξ).

Feature selection then means that for this particular

task only a subset of features rather than the full fea-

ture vector is used.

AdaptiveSequentialFeatureSelectionforPatternClassification

475

2.1 Adaptive Feature Selection

Adaptivity: For classiﬁcation problems with small

training sets, we suggest to select features adaptively.

Thus, we do not predeﬁne a single subset of the rele-

vant features but rather select a speciﬁc one for every

new testing sample. The proposed feature selection

scheme is a sequential feedforward algorithm. Every

next feature added to the subset should be discrimina-

tive together with the already selected features, which

take particular values observed on the current testing

sample.

Sequential feedforward feature selection algo-

rithms use a greedy search strategy, which does not

assume the full search and evaluating the relevance

of every possible feature subset. Such feedforward

algorithms start from the empty set and add features

one by one so that every next feature maximizes some

selection criterion S considering features selected on

the previous steps. Thus, conventionally the feature

F

α

i+1

selected on the (i+ 1)

th

step should satisfy the

following:

α

i+1

= argmax

k

S(F

α

1

, . . . , F

α

i

, F

k

),

F

k

∈ {F

1

, . . . , F

n

}\{F

α

1

, ..., F

α

i

},

(1)

where F

α

1

, . . . , F

α

i

is a subset of the features selected

before the (i+ 1)

th

iteration.

Let us consider an adaptive case. Suppose that

we have a testing sample ξ. Suppose also that af-

ter i steps we have selected the features F

α

1

, . . . , F

α

i

and observed their values ξ

α

1

, . . . , ξ

α

i

on this testing

sample. Then, for this testing sample the next feature

F

α

i+1

is selected according to the adaptive criterion:

α

i+1

= argmax

k

S(F

α

1

= ξ

α

1

, . . . , F

α

i

= ξ

α

i

, F

k

).

(2)

In contrast to the static criterion (1), the adaptivecrite-

rion also takes into account the values of the already

selected features, which are observed on the current

testing sample.

Probabilistic Selection Criterion: The feature selec-

tion scheme proposed here uses a probabilistic selec-

tion criterion and is based on the mutual information

between the features and class variables (Cover and

Thomas, 1991).

The mutual information between two continuous

random variables A and B measures the amount of in-

formation between them and is deﬁned as follows:

I(A;B) =

Z

A

Z

B

p(a, b)log

p(a, b)

p

A

(a)p

B

(b)

dbda, (3)

where p(a, b) is the joint probability density func-

tion (pdf) of A and B, and p

A

(a) =

R

B

p(a, b)db and

p

B

(b) =

R

A

p(a, b)da are their marginal densities. In

case of discrete variables, the integration is substi-

tuted by summation over the values of the variables.

Our goal is a sequential selection of features

that bring the maximum additional information about

classes, i.e. those that are both discriminative and

non-redundant with respect to the already selected

features. Thus, we propose the adaptive mutual infor-

mation feature selector (AMIFS), which is based on

the expected mutual information between the classes

and a feature candidate k conditioned on the outcome

of the selected features which is observed on the test-

ing sample I(C;F

k

|ξ

i

). Then, according to AMIFS

every next selected feature should satisfy the follow-

ing:

α

i+1

= argmax

k

S(F

α

1

= ξ

1

, . . . , F

α

i

= ξ

α

i

, F

k

) =

argmax

k

(

Z

F

k

∑

c∈C

p( f

k

, c|ξ

i

)log

p( f

k

, c|ξ

i

)

p( f

k

|ξ

i

)p(c|ξ

i

)

df

k

)

,

(4)

where the variable C represents the classes, C =

{c

1

, . . . , c

m

}, and ξ

i

= {F

α

1

= ξ

α

1

, . . . , F

α

i

= ξ

α

i

} is

a shorthand for the set of values which are observed

on the selected features of the sample ξ.

Note that the expression (4) is not a conventional

CMI since we do not average over all possible out-

comes of the features F

α

1

, . . . , F

α

i

, but rather condition

on the speciﬁc values that we observe on the particu-

lar testing sample. This implies that we look for the

feature F

α

i+1

that is informative for the certain region

of the input space, which is speciﬁed by the observed

values of the already selected features. Thefore, we

adaptively select a different subset of the relevant fea-

tures for every sample we want to classify.

Using the deﬁnition of the Kullback-Leibler diver-

gence,

D(p||q) =

Z

p(x)log

p(x)

q(x)

dx (5)

for two distributions p and q, (4) can be rewritten as

follows:

α

i+1

= argmax

k

(

∑

c∈C

p(c|ξ

i

)D(p( f

k

|c, ξ

i

)||p( f

k

|ξ

i

))

)

(6)

This is the average distance between the pdf of the

feature F

k

given a certain class and its marginal pdf,

where both pdfs are updated after observing the cur-

rent feature subset on the sample ξ. Thus, the selec-

tion criterion favors features with distinctive posterior

distributions for data drawn from the different classes,

that is, features that on the (i+ 1)

th

step are expected

to best discriminate between the classes.

IJCCI2012-InternationalJointConferenceonComputationalIntelligence

476

In our algorithm, the ﬁrst feature is selected inde-

pendently of the testing sample ξ and should maxi-

mize the mutual information with classes:

α

1

= argmax

k

I(C;F

k

), F

k

∈ {F

1

, . . . , F

n

}. (7)

The scheme becomes adaptive only after the ﬁrst fea-

ture is selected and the value it takes on the testing

sample is known.

Stopping Rule: Ideally, the algorithm can be stopped

when one of the classes has been unambiguously

identiﬁed. In practice, this is not possible and other

stopping criteria have to be used, e.g. minimum addi-

tional information that the next feature brings or sim-

ply a maximum number of iterations. However, in

this paper, we shall not address the issue of stopping

rules.

2.2 Estimation of the Selection

Criterion

The selection criterion (4) can be rewritten as

α

i+1

= argmax

k

(

m

∑

j=1

p(c

j

|ξ

i

)×

Z

p( f

k

|ξ

i

, c

j

)log

p( f

k

, ξ

i

|c

j

)p(c

j

)p(ξ

i

)

p(c

j

, ξ

i

)p( f

k

, ξ

i

)

df

k

.

(8)

The pdfs under the logarithm, that do not depend

on f

k

and therefore do not contribute to argmax

k

, can

be dropped. Thus, we obtain

α

i+1

= argmax

k

(

m

∑

j=1

p(c

j

|ξ

i

)E

p( f

k

|ξ

i

,c

j

)

[log

p( f

k

, ξ

i

|c

j

)

p( f

k

, ξ

i

)

]

)

.

(9)

The expression (9) requires estimation of multi-

variate pdfs as well as the conditional expectation

over multivariate pdf.

2.2.1 Kernel Density Method

In our case, we solve both problems with the ker-

nel method, a nonparametric smoothing technique de-

veloped by Rosenblatt (Rosenblatt, 1956) and Parzen

(Parzen, 1962).

Density Estimation: For a training set consisting of

T independently and identically distributed (iid) n-

dimensional samples X = {x

1

, ..., x

T

}, x

i

∈ R

n

, the

kernel density estimate (KDE) of the pdf ˆp(y) is

ˆp(y) = (T

n

∏

j=1

h

j

)

−1

∑

x

i

∈X

n

∏

j=1

K(

y

j

−x

i, j

)

h

j

), (10)

where K(·) is a univariate kernel function, h

j

is a

kernel bandwidth parameter and x

i, j

is the value of

the j

th

feature of the sample x

i

. Here, we use a so-

called product kernel, which is a commonly used sim-

pliﬁcation of the general multivariate kernel. Since

quality of the density estimation does not particu-

larly depend on the choice of the kernel, for con-

venience we restrict ourselves to Gaussian kernels

K(w) =

1

√

2π

exp(−

w

2

2

).

Bandwidth selection: The bandwidth parameters

h

j

control the smoothness of the estimated density.

Setting them too large, all details of the density struc-

ture are lost, whereas setting them too small will lead

to a highly variable estimate with many false peaks

around every sample point. Therefore, a choice of the

proper bandwidth parameters is important. We only

brieﬂy mention the bandwidth selection method that

we used, for details and an overview of other meth-

ods see (Turlach, 1993).

The normal reference rule (Silverman, 1986) is

one of the simplest methods based on the asymp-

totic mean integrated squared error between the true

and estimated densities and assumes that the data

is Gaussian. The method produces good estimates

for univariate densities but tends to oversmooth for

multivariate cases. Among more sophisticated meth-

ods that can be easily extended to the multivariate

densities are Markov chain Monte Carlo (MCMC)

methods. They estimate a bandwidth matrix through

the data likelihood using cross-validation and are re-

ported to have a good performance, e.g. see (Zhang

et al., 2004).

In higher dimensions data become sparser and

tend to move away from the modes of the distribution

(Scott, 1992). Therefore, the bandwidth parameter of

kernel functions should be adjusted to the data dimen-

sionality so that estimates are based on a sufﬁcient

number of data points. In our case, the dimension of

estimated densities grows iteratively. Moreover, we

estimate joint densities of different feature subsets.

Ideally, one has to select a unique optimal bandwidth

vector for every feature combination of different car-

dinality. Since it is computationally infeasible, we

pick the normal reference rule, which does not require

any optimization and automatically gives a bandwidth

depending on the dimension of the estimated density.

So the bandwidth for feature F

i

is deﬁned as

h

i

= (

4

d + 2

)

1

d+4

σ

i

T

−

1

d+4

, (11)

where d is the dimension of the estimated multivariate

density, σ

i

is the standard deviation of the data points

and T is the number of training samples.

AdaptiveSequentialFeatureSelectionforPatternClassification

477

2.2.2 Conditional Expectation

We estimate the conditional expectation over the mul-

tivariate pdf p( f

k

|ξ

i

, c

j

) using a kernel-based esti-

mator as well. Let us consider a training set X =

{(x

1

, y

1

), ..., (x

T

, y

T

)}, where x

i

and y

i

are realiza-

tions of n

x

− and n

y

−dimensional continuous random

variables x and y, respectively. Suppose, one needs

to estimate the expectation of some function g(x)

over the conditional distribution p(x|y = a), where a

is a particular observation of the variable y. Then,

using the nonparametric kernel regression estimator

proposed by Nadaraya (1964) and Watson (1964), the

conditional expectation of g(x) is:

E

p(x|y=a)

[g(x)] =

(T

n

y

∏

j=1

h

j

)

−1

∑

x

i

∈X

n

y

∏

j=1

K

j

(a, y

i

)g(x)

(T

n

y

∏

j=1

h

j

)

−1

∑

x

i

∈X

n

y

∏

j=1

K

j

(a, y

i

)

,

(12)

where (h

1

, ..., h

n

y

) is a bandwidth vector of the ker-

nel for the variable y and K

j

(a, y

i

) denotes K(

a

j

−y

i, j

h

j

).

Note that the denominator is KDE of ˆp(y = a).

Plugging (12) into the selection criterion (9), we

have:

α

i+1

= argmax

k

(

m

∑

j=1

p(c

j

|ξ

i

)

p(ξ

i

|c

j

)

(T

j

i

∏

q=1

h

α

q

)

−1

×

∑

x

r

∈X

j

i

∏

q=1

K

α

q

(ξ, x

r

)log

p( f

k

= x

r,k

, ξ

i

|c

j

)

p( f

k

= x

r,k

, ξ

i

)

)

,

(13)

where X

j

is a subset of the training samples belonging

to the class c

j

and T

j

= |X

j

|.

Note that the expression in the ﬁrst fraction sim-

pliﬁes just to p(c

j

), because

p(c

j

|ξ

i

)

p(ξ

i

|c

j

)

=

p(c

j

)

p(ξ

i

)

and p(ξ

i

)

can be dropped as it does not inﬂuence argmax

k

. Fi-

nally, using the kernel method to estimate densities

and after some simple algebraic transformations, the

expression (13) is of the form:

α

i+1

= argmax

k

(

m

∑

j=1

p(c

j

)T

j

−1

×

∑

x

r

∈X

j

i

∏

q=1

K

α

q

(ξ, x

r

)log

∑

x

s

∈X

j

K

k

(x

r

, x

s

)

i

∏

q=1

K

α

q

(ξ, x

s

)

∑

x

u

∈X

K

k

(x

r

, x

u

)

i

∏

q=1

K

α

q

(ξ, x

u

)

.

(14)

The expression under the logarithm measures a ra-

tio between values of two pdfs in the point x

r

. When

the pdfs are estimated from small training sets, unre-

liabilities can lead to large ratios even though there is

no real evidence for that. To cope with this, we add a

small value to both pdfs which can be interpreted as

smoothing them with an improper base distribution.

The smoothing should be adjusted to the current di-

mension of the pdf, thus, we take it to be proportional

to the maximum response of the product kernel of the

selected features over all training points. Such a sim-

ple smoothing works ﬁne for our problem, since we

do not need precise values of the criterion in (14), but

rather want to ﬁnd a feature that maximizes it.

3 EXPERIMENTS

Here, we provide an experimental comparison of our

method with two feature selection algorithms based

on CMI: Parzen window feature selector (PWFS)

(Kwak and Choi, 2002) and active testing model

(ATM) (Geman and Jedynak, 1996). In our terminol-

ogy PWFS is a static selection scheme (1). It is based

on the conventional CMI estimated with the kernel

method. ATM is a feature selector based on the adap-

tive CMI which uses a simplifying assumption that

features are conditionally independent given a class.

Since the estimation of the selection criterion, pro-

posed by Geman and Jedynak, was problem-speciﬁc,

here we use just the general idea of their method. That

is, in our experiments ATM selects features as fol-

lows:

α

i+1

= argmax

k

m

∑

j=1

p(c

j

)

i

∏

q=1

p( f

α

q

= ξ

α

q

|c

j

)

T

j

×

∑

x

r

∈X

j

log

p( f

k

= x

r,k

|c

j

)

m

∑

v=1

p(c

v

)p( f

k

= x

r,k

|c

v

)

i

∏

q=1

p( f

α

q

= ξ

α

q

|c

v

)

.

(15)

To make a fair comparison, all criteria are esti-

mated using KDE with the same bandwidth vector as

chosen by the normal reference rule (11).

3.1 Artiﬁcial Data Set

For the ﬁrst experiment, we artiﬁcially constructed a

data set for image classiﬁcation. It contains pixel-

based black-and-white images of digits belonging to

10 different classes. First, we constructed four dis-

tinct examples of every class. From this data set

IJCCI2012-InternationalJointConferenceonComputationalIntelligence

478

we generated a new one with 1000 samples by ran-

domly adding 5 pixels of noise to the original images

(Fig. 1). Further, we formed 20 training sets with 30

and 300 samples in each and one testing set contain-

ing 100 samples by randomly selecting an equal num-

ber of samples from each class.

Figure 1: Examples of original and noisy digits.

In our setup, each image is described by a vector

of complex features. These, in turn, are functions of

simple features of the image. Our simple features are

inspired by the complex cells in the primary visual

cortex discovered by D. Hubel and T. Wiesel in the

1960s (Hubel and Wiesel, 2005). Both are responsive

to primitive stimuli that are independent of their spa-

tial location. Here, each simple feature corresponds

to a 3 ×3 image patch and is activated proportional

to the frequency with which the corresponding patch

occurs in the image. For normalization and smooth-

ing purposes, patch frequencies are squashed in the

interval [−1, 1] via a sigmoidal function.

The complex features correspond to 3 ×3 image

patches as well. Their activation value is computed as

a weighted sum of the activations of the simple fea-

tures. The weight from the simple feature responding

to the same patch is 1. For the others, it drops in the

number of pixels that differ between the correspond-

ing image patches according to a Gaussian. Thus, the

complex features react more robust against pixel noise

than the simple features. Since there are 9 binary pix-

els in each 3 ×3 patch, an image is described by a

vector of 2

9

= 512 complex feature values.

As a classiﬁer we used the weighted k-nearest

neighbor algorithm (wk-NN). It assigns a class to a

testing sample based on a distance-weighted vote of

the k nearest training samples. The wk-NN is one of

the simplest classiﬁers, but the fact that it does not

need learning is useful because the adaptive scheme

requires multiple running of the classiﬁer with differ-

ent features. Here, we used k=20 hand-tuned using

validation sets.

To investigate the usefulness of the proposed AM-

IFS we ran experiments on training sets with T= 30

and 300 samples. Note, that all sets have fewer train-

ing samples than features which easily leads to over-

ﬁtting. The classiﬁcation errors were evaluated on

separate testing samples and compared with the cases

when feature selection was done using PWFS, ATM

and when the classiﬁer was run on the full feature vec-

tor, i.e. without feature selection (Fig. 2). All results

are averaged over 20 runs with the different training

sets.

One clearly sees the advantage of using an adap-

tive scheme for feature selection. Not only does the

error rate drop very quickly with an increasing num-

ber of features, it goes even below the error that the

classiﬁer achieves when using all available features.

In all our simulations, this effect never occurred for

the static scheme PWFS and was particularly pro-

nounced when using an extremely small number of

training samples (T = 30), i.e. when the classiﬁer

is prone to overﬁtting. Furthermore, our algorithm

outperforms the ATM scheme which assumes condi-

tional independence of the features. Thus, especially

at the beginning, i.e. when selecting the ﬁrst few fea-

tures, it is beneﬁcial to take dependencies between

features into account.

100 200 300 400 500

25

30

35

40

45

50

55

No.features

error, %

A: T = 30

AMIFS

PWFS

ATM

full

100 200 300 400 500

10

15

20

25

30

35

40

No.features

error, %

B: T = 300

AMIFS

PWFS

ATM

full

Figure 2: Error against the number of features for dig-

its classiﬁcation, the black markers indicate regions where

AMIFS is signiﬁcantly better than ATM according to the

Wilcoxon signed-rank test at the p-level= 0.05.

Further, we test the ability of the considered

schemes to select informative features in high dimen-

sions for the case T = 300. For this, we start with

initial feature subsets of size 50 and 100, which are

preselected by PWFS, and then select further fea-

tures according to the different algorithms. The re-

sults (Fig. 3) show that both adaptive schemes ﬁnd

additional features that are markedly better than the

AdaptiveSequentialFeatureSelectionforPatternClassification

479

statically selected ones. Also one can see that at

some point ATM, the adaptive scheme assuming con-

ditional independence of the features given a class,

starts outperforming AMIFS. This fact suggests that

after certain dimension AMIFS is not able anymore to

estimate correctly high-order dependencies between

the features. Interestingly, when AMIFS selects the

features from the beginning (see Fig. 2), it performs

better than ATM almost up to 200 features, meaning

that the ﬁrst good features can compensate for un-

reliable pdf estimates further in higher dimensions.

Based on this observation, one could think of a com-

bined scheme that starts with AMIFS and after select-

ing some features switches to ATM.

20 40 60 80 100 120 140

15

20

25

30

No.features

error, %

A: Initial subset of 50 features

AMIFS

ATM

PWFS

20 40 60 80 100 120 140 160 180 200

15

20

25

30

No.features

error, %

B: Initial subset of 100 features

AMIFS

ATM

PWFS

Figure 3: Comparison of ability to add informative features

to subsets of 50 and 100 features preselected by PWFS.

3.2 MNIST Data Set

We compared performance of PWFS, ATM and our

AMIFS on a real-world data set, the MNIST database

of handwritten digits (LeCun and Cortes, nd). The

images are 28 × 28 pixel, black and white, size-

normalized and centered. The original training and

testing sets consist of 60, 000 and 10, 000 samples, re-

spectively.

The features were learned by LeNetConvPool

(Bergstra et al., nd), a convolutional neural network

based on the LeNet5 architecture, which was origi-

nally proposed by LeCun (LeCun et al., 1998). The

convolutional networks are biologically inspired mul-

tilayered neural networks. In order to achieve some

degree of location, scale and distortion invariance,

they imitate arrangement and properties of simple and

complex cells in primary visual cortex by implement-

ing local ﬁlters of increasing size, shared weights and

spatial subsampling.

LeNetConvPool consists of 6 layers: 4 successive

convolutional and down-sampling layers (C- and S-

layers), a hidden fully-connected layer and a logistic

regressor as a classiﬁer. C-layers consist of several

feature maps with overlapping 5×5 linear ﬁlters. So

every ﬁlter receives an input from the 5 ×5 region

of the previous layer, computes its weighted sum and

passes it through a sigmoidal function. The S-layers

perform max-pooling with 2×2 non-overlapping ﬁl-

ters. That is, an output of such ﬁlter is the maximum

activation of units from 2 × 2 region of the corre-

sponding feature map in the previous C-layer. For

both types of the layers, all ﬁlters share the same

weight parameters within one feature map. First C-

and S-layers have 20 feature maps, the next ones -

50. The succeeding hidden layer, which is fully-

connected to all units of all feature maps in the previ-

ous S-layer, has 500 units with the sigmoidal activa-

tion function. The last classiﬁcation layer consists of

10 units, according to the number of classes, and per-

forms a logistic regression. The weight parameters of

all layers are learned using the gradient descent. For

all implementation details see (Bergstra et al., nd).

We trained LeNetConvPool on 15 training sets

with 5, 000 samples each. After that, the last classiﬁ-

cation layer was removed and the resulting networks

with 500 output units were used as feature extractors.

These units are initial features for the feature selec-

tors. Then, from every training set we formed 2 sets

of different size, with T = 100 and T = 300 samples,

which were used for feature selection and for clas-

siﬁcation. We use different amount of training data

for feature extraction and for further feature selection

and classiﬁcation to model a situation, when one has

good features but there is not enough training data to

build an efﬁcient classiﬁer. As a classiﬁer, we used an

unweighted k-NN with k = 5 (again, hand-tuned on

validation sets), which in contrast to wk-NN uses a

simple majority vote. For computational reasons, the

testing set was reduced to 500 samples, which were

randomly selected from the original MNIST testing

set, with an equal number of samples per class.

Overall, all algorithms show a similar behavior as

on the artiﬁcial data set (see Fig. 4). The smaller dif-

ferences can be attributed to the better available fea-

tures, as reﬂected in the much lower error rates, which

IJCCI2012-InternationalJointConferenceonComputationalIntelligence

480

0 100 200 300 400 500

2

3

4

5

6

7

8

9

10

No.features

error, %

A: T = 100, k−NN on 100 samples

AMIFS

PWFS

ATM

full

0 100 200 300 400 500

2

3

4

5

6

7

8

9

10

No.features

error, %

B: T = 300, k−NN on 300 samples

AMIFS

PWFS

ATM

full

Figure 4: Error against the number of features for MNIST

classiﬁcation. k-NN was run on the same set as feature se-

lection, the black markers indicate regions where AMIFS

is signiﬁcantly better than ATM according to the Wilcoxon

signed-rank test at the p-level= 0.05.

have been tuned by the LeNetConvPool. Again, AM-

IFS outperforms ATM on the ﬁrst selected features

and both adaptive schemes provide some robustness

against overﬁtting.

To see whether feature selection is as beneﬁcial

when the classiﬁer is well-trained, we repeated the ex-

periments with a training set of 5, 000 samples. How-

ever, as in the previous experiment, the feature selec-

tion was done on the small sets of 100 and 300 sam-

ples for computational reasons.

Fig. 5 shows that for this particular example one

needs approximately 200 features to achieve the min-

imum error. However, there is no advantage of us-

ing any sophisticated feature selection algorithm, and

one can see that a size of the training set used for se-

lecting features does not have much inﬂuence as well.

Moreover, even the random selection works about as

good as other methods. We do not want to generalize

results of this test by saying that for large data sets

one can always select features randomly. We rather

emphasize that for small data sets one can achieve

better performance with features selected adaptively

with our AMIFS .

0 100 200 300 400 500

1

2

3

4

5

6

7

8

9

No.features

error, %

A: T = 100, k−NN on 5000 samples

AMIFS

PWFS

ATM

random

full

0 100 200 300 400 500

1

2

3

4

5

6

7

8

9

No.features

error, %

B: T = 300, k−NN on 5000 samples

AMIFS

PWFS

ATM

random

full

Figure 5: Error against the number of features for MNIST

classiﬁcation. k-NN was run on 5, 000 training samples,

feature selection on A: 100 and B: 300 training samples.

4 DISCUSSION

Feature selection is a standard technique to reduce

data dimensionality. In high-dimensional spaces this

can be an efﬁcient way to cope with limited amounts

of training data. Usually, features are selected in a

preprocessing step. However, we propose an adap-

tive scheme for feature selection, where each feature

is selected as maximizing the expected mutual infor-

mation with the class given the data point, as well as

values of the features already considered.

Despite the fact that estimating the mutual infor-

mation in high-dimensional spaces is a difﬁcult prob-

lem on its own, we ﬁnd that adaptive feature selection

robustly improves the classiﬁcation performance. In

the considered examples, a small number of features

is sufﬁcient to achieve a good classiﬁcation. Since the

ﬁrst few features can be reliably detected, our method

does not overﬁt and can even compensate for short-

comings of the classiﬁer. Our results on both artiﬁcial

and real-world data show that in case of limited train-

ing data, when a classiﬁer is usually prone to over-

ﬁtting, AMIFS can even improve the error rate com-

pared to using all available features.

Even though the algorithm is less advantageous

AdaptiveSequentialFeatureSelectionforPatternClassification

481

on large datasets, we believe that this is not a short-

coming, but merely shows that the need to select fea-

tures is less pressing if enough data are available.

From the point of view of computational expenses,

in order to make AMIFS more applicable to large

amount of data, one has to think about an approxi-

mate implementation which can cut down the compu-

tational complexity or, for example, consider using a

hybrid scheme, i.e. starting with AMIFS and then af-

ter some iterations switching to ATM, which does not

require estimating multivariatedensities and therefore

is computationally cheaper.

In the future, we want to develop a neural imple-

mentation of our feature selection scheme. The brain

certainly faces a similar problem when it has to decide

which features are really relevant to classify a new ob-

servation. A neural model could thus provide insights

into how this ability can be achieved. Furthermore,

we would like to investigate to what extent informa-

tion theory provides guiding principles for informa-

tion processing in the brain. In addition, adaptive

feature selection could be accomplished via recurrent

processing interleaving bottom-up and top-down pro-

cesses.

REFERENCES

Abe, S. (2005). Modiﬁed backward feature selection by

cross validation. In Proc. of the Thirteenth European

Symposium on Artiﬁcial Neural Networks, pages 163–

168, Bruges, Belgium.

Battiti, R. (1994). Using mutual information for selecting

feature in supervised neural net learning. IEEE Trans.

Neural Networks, 5(4):537–550.

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu,

R., Desjardins, G., Turian, J., Warde-Farley, D., and

Bengio, Y. (n.d.). Deep learning tutorials. Retrieved

from: http://deeplearning.net/tutorial/lenet.html.

Bonnlander, B. V. (1996). Nonparametric selection of in-

put variables for connectionist learning. PhD thesis,

University of Colorado at Boulder.

Bonnlander, B. V. and Weigend, A. S. (1994). Selecting in-

put variables using mutual information and nonpara-

metric density estimation. In International Sympo-

sium on Artiﬁcial Neural Networks, pages 42–50, Tai-

wan.

Cover, T. M. and Thomas, J. A. (1991). Elements of in-

formation theory. Wiley-Interscience, New York, NY,

USA. pp. 12-49.

Ding, C. H. Q. and Peng, H. (2005). Minimum redun-

dancy feature selection from microarray gene expres-

sion data. J. Bioinformatics and Computational Biol-

ogy, 3(2):185–206.

Duch, W., Wieczorek, T., Biesiada, J., and Blachnik, M.

(2004). Comparison of feature ranking methods based

on information entropy. In Proc. of the IEEE Inter-

national Joint Conference on Neural Networks, pages

1415–1419, Budapest, Hungary.

Geman, D. and Jedynak, B. (1996). An active testing model

for tracking roads in satellite images. IEEE Trans.

Pattern Analysis and Machine Intel, 18(1):1–14.

Hubel, D. and Wiesel, T. (2005). Brain and visual percep-

tion: the story of a 25-year collaboration. Oxford

University Press US. p. 106.

Jiang, H. (2008). Adaptive feature selection in pattern

recognition and ultra-wideband radar signal analysis.

PhD thesis, California Institute of Technology.

Kwak, N. and Choi, C. (2002). Input feature selection by

mutual information based on parzen window. IEEE

Trans. Pattern Analysis and Machine Intel, 24:1667–

1671.

LeCun, J. and Cortes, C. (n.d.). The mnist

dataset of handwritten digits. Retrieved from:

http://yann.lecun.com/exdb/mnist/.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).

Gradient-based learning applied to document recogni-

tion. Proc. of the IEEE, 86(11):2278–2324.

Najemnik, J. and Geisler, W. S. (2005). Optimal eye move-

ment strategies in visual search. Nature, 434:387–391.

Narendra, P. and Fukunaga, K. (1977). A branch and bound

algorithm for feature subset selection. IEEE Trans.

Comput., 28(2):917–922.

Parzen, E. (1962). On estimation of a probability den-

sity and mode. Annals of Mathematical Statistics,

35:1065–1076.

Raudys, S. J. and Jain, A. K. (1991). Small sample size ef-

fects in statistical pattern recognition: Recommenda-

tions for practitioners. IEEE Trans. on Pattern Analy-

sis and Machine Intelligence, 13:252–264.

Renninger, L. W., Verghese, P., and Coughlan, J. (2007).

Where to look next? Eye movements reduce local un-

certainty. Journal of Vision, 7(3):117.

Rosenblatt, M. (1956). Remarks on some nonparametric es-

timates of a density function. Annals of Mathematical

Statistics, 27:832–837.

Scott, D. W. (1992). Multivariate Density Estimation: The-

ory, Practice, and Visualization. John Wiley. pp. 125-

206.

Silverman, B. W. (1986). Density estimation for statistics

and data analysis. Chapman and Hall.

Turlach, B. A. (1993). Bandwidth selection in kernel den-

sity estimation: a review. In CORE and Institut de

Statistique, pages 23–493.

Webb, A. (1999). Statisctical Pattern Recognition. Arnold,

London. pp. 213-226.

Zhang, X., King, M. L., and Hyndman, R. J. (2004). Band-

width selection for multivariate kernel density estima-

tion using MCMC. Technical report, Monash Univer-

sity.

IJCCI2012-InternationalJointConferenceonComputationalIntelligence

482