3 EXPERIMENTS
To illustrate the differencesbetween simpler and more
complex FS methods we have collected experimental
results under various settings: for two different clas-
sifiers, three FS search algorithms and eight datasets
with dimensionalities ranging from 13 to 65 and num-
ber of classes ranging from 2 to 6. We used 3
different mammogram datasets as well as wine and
wave datasets from UCI Repository (Asuncion and
Newman, 2007), satellite image dataset from ELENA
database (ftp.dice.ucl.ac.be), speech data from British
Telecom and sonar data (Gorman and Sejnowski,
1988). For details see Tables 1 to 8.
Note that the choice of classifier and/or FS setup
may not be optimal for each dataset, thus the reported
results may be inferior to results reported in the liter-
ature; the purpose of our experiments is mutual com-
parison of FS methods only. All experiments have
been done with 10-fold Cross-Validation used to split
the data into training and testing parts (to be denoted
“Outer CV” in the following), while the training parts
have been further split by means of another 10-fold
CV into actual training and validation parts for the
purpose of feature selection and classifier training (to
be denoted “Inner CV”).
The application of SFS and SFFS was straightfor-
ward. The OS algorithm as the most flexible proce-
dure has been used in two set-ups: slower random-
ized version and faster deterministic version. In both
cases the cycle depth set to 1 [see (Somol and Pudil,
2000) for details]. The randomized version, denoted
in the following as OS(1,r3), is called repeatedly with
random initialization as long as no improvement has
been found in last 3 runs. The deterministic version,
denoted as OS(1,IB) in the following, is initialized by
means of Individually Best (IB) feature selection.
The problem of determining optimal feature sub-
set size was solved in all experiments by brute force.
All algorithms were applied repeatedly for all possi-
ble feature sizes whenever needed. The final result
has been determined as that with the highest classi-
fication accuracy (and lowest subset size in case of
ties).
3.1 Notes on Obtained Results
All tables clearly show that more modern methods
are capable of finding criterion values closer to op-
timum – see column Inner-CV in each table.
The effect pointed out by Reunanen (Reunanen,
2003) of the simple SFS outperforming all more com-
plex procedures (regarding the ability to generalize)
takes place in Table 4, column Outer-CV, with Gaus-
sian classifier. Note the low consistency in this case.
Conversely, Table 2 shows no less outstanding per-
formance of OS with 3-Nearest Neighbor classifier
(3-NN) with better consistency and smallest subsets
found, while Table 3 shows top performance of SFFS
with both Gaussian and 3-NN classifiers. Although it
is impossible to draw decisive conclusions from the
limited set of experiments, it should be of interest to
extract some statistics (all on independent test data –
results in the column Outer-CV):
• Best result among FS methods for each given clas-
sifier: SFS 11×, SFFS 17×, OS 11×.
• Best achieved overall classification accuracy for
each dataset: SFS 1×, SFFS 5×, OS 2×.
Average classifier accuracies:
• Gaussian: SFS 0.652, SFFS 0.672, OS 0.663.
• 1-NN: SFS 0.361, SFFS 0.361, OS 0.349.
• 3-NN: SFS 0.762, SFFS 0.774, OS 0.765.
4 DISCUSSION AND
CONCLUSIONS
With respect to FS we can distinguish the follow-
ing entities which all affect the resulting classifica-
tion performance: search algorithms, stopping crite-
ria, feature subset evaluation criteria, data and classi-
fier. The impact of the FS process on the final classi-
fier performance (with our interest targeted naturally
at its generalization performance, i.e., its ability to
classify previously unknown data) depends on all of
these entities.
When comparing pure search algorithms as such,
then there is enough ground (both theoretical and ex-
perimental) to claim that newer, often more complex
methods, have better potential of finding better solu-
tions. This often follows directly from the method
definition, as newer methods are often defined to im-
prove some particular weakness of older ones. (Un-
like IB, SFS takes into account inter-feature depen-
dencies. Unlike SFS, +L-R does not suffer the nesting
problem. Unlike +L-R, Floating Search does not de-
pend on pre-specified user parameters. Unlike Float-
ing Search, OS may avoid local extremes by means of
randomized initialization etc.) Better solution, how-
ever, means in this context merely being closer to op-
timum with respect to the adopted criterion. This may
not tell much about final classifier quality, while cri-
terion choice has proved to be a considerable problem
in itself. Vast majority of practically used criteria have
only insufficient relation to correct classification rate,