the experiments carried out to validate the model, the
results and a comparison with other techniques for at-
tribute selection. Finally, Section 5 presents the con-
clusions and suggests possible future work.
2 MULTIOBJECTIVE
OPTIMISATION MODEL FOR
ATTRIBUTE SELECTION
The selection of variables concerns finding the small-
est subset of variables in a data base to obtain the most
accurate classification possible (Pappa et al., 2002).
Described more formally, with X being the number
of variables in an initial set T, the algorithm finds a
subset P of Y variables from the set T, where Y ≤ X,
with the aim of removing the irrelevant or redundant
variables, and obtaining good accuracy in the classifi-
cation (Aguilera et al., 2007). Therefore, the problem
of attribute selection can be approached as a multiob-
jective optimisation problem (Deb, 2001), the solu-
tion of which comprise as set of solutions called non-
dominated solutions (or Pareto solutions). Solution x
dominates another solution y if (Deb, 2001):
• Solution x is not worse than y for any of the pur-
poses in mind;
• Solution x is strictly better than y for at least one
of the objectives.
For the variables selection problem in mind, two
optimisation criteria have been considered: accuracy
and compactness. To formulate these criteria the fol-
lowing quantitative measures have been defined .
Given a solution x = { x
i
| x
i
∈ T}:
• Accuracy. Based on the classification ratio
CR(x) =
Φ(x)
N
, where Φ(x) is the number of data
correctly classified for a set of variables, x, by a
given classification algorithm, and N are the total
number of data.
• Compactness. By cardinality the card(x) of the
set x is established, that is, the number of variables
used to construct the model.
In this way, the optimisation model proposed with
the criteria defined is the following:
Maximize CR(x)
Minimize card(x)
(1)
The objectives were to increase the accuracy of
the model and to reduce the number of variable to the
greatest extent possible. In some cases, such as will
be presented, it was interesting to sacrifice accuracy
slightly, when the number of variables were reduced
significantly, in order to, simplify the model . Such
as can be appreciated, the objectives in the optimiza-
tion model 1 are contradictory since a lower number
of significant variables means a lower classification
rate and vice versa, that is the greater the number of
variables the greater the classification rate. The so-
lution to model 1 is a set of m ≤ X non-dominated
solutions C = {x
k
, k ∈ S}, S = {1, . . . , X}, where each
solution x
k
of C represents the best collection of sig-
nificant k variables. For example, for X = 5 (5 vari-
ables to be selected), a set of non-dominatedsolutions
C = {x
3
, x
5
} means that the Pareto front is composed
of non-dominated solutions of 3 and 5 variables, re-
spectively. The solutions with 1, 2 and 4 significant
variables are not on the Pareto front and will there-
fore be dominated.
3 MULTIOBJECTIVE
EVOLUTIONARY
COMPUTATION FOR
ATTRIBUTE SELECTION
Three elements can be distinguished in a variables se-
lection algorithm (Aguilera et al., 2007).
• A search algorithm, which explores the space of
the variables available.
• An evaluation function, which provides a measure
of the fitness of the variables chosen. According
to how this function is designed, the selection al-
gorithms can be classified as filter models or em-
bedded models. The former use measures that
take into account the separation of classes based
on information distance metrics, dependency met-
rics, etc., wile the latter use an estimate of the ac-
curacy attained by a classification algorithm using
selected variables.
• A fitness function that validates the subset of vari-
ables, which are finally chosen.
Evolutionary Computation has been used both for
filter and embedded models. The work described here
falls into the latter category since the accuracy and
the simplicity of the cklassification obtained id one
of the fundamental objectives. The NSGA-II (Deb
et al., 2002) algorithm, the principal components of
which are briefly described below, is used to resolve
the problem described in 1.
Representation of Solutions. A binary codification
of fixed length equal to the number of variables in the
problem is used. In this way, a gene of value 1 in the
HEALTHINF 2011 - International Conference on Health Informatics
468