2 BACKGROUND
This section briefly reviews some of the most com-
mon unsupervised and supervised FD and FS tech-
niques that have proven effective for many learning
problems. This description is hugely far from be-
ing exhaustive, as FD and FS are two fields with
a long research history. The interested reader is
referred to the works of (Dougherty et al., 1995),
(Kotsiantis and Kanellopoulos, 2006), (Liu et al.,
2002), and (Witten and Frank, 2005), for re-
views of FD methods. Reviews of FS methods
were done by (Guyon and Elisseeff, 2003), (Guyon
et al., 2006), (Hastie et al., 2009), and (Escolano
et al., 2009); see also the following special issue:
jmlr.csail.mit.edu/papers/special/feature03.html.
2.1 Feature Discretization
FD can be performed in supervised or unsupervised
modes, i.e., using or not the class labels, and aims at
reducing the amount of memory as well as improv-
ing classification accuracy (Witten and Frank, 2005).
The supervised mode may lead, in principle, to better
classifiers. In the context of unsupervised scalar FD
(Witten and Frank, 2005), two efficient techniques are
commonly used:
. equal-interval binning (EIB), i.e., uniform quan-
tization with a given number of bits per feature;
. equal-frequency binning (EFB), i.e., non-uniform
quantization yielding intervals such that, for each
feature, the number of occurrences in each inter-
val is the same, leading to a uniform (i.e., maxi-
mum entropy) distribution; this technique is also
known as maximum entropy quantization.
In EIB, the range of values is divided into bins
of equal width. It is simple and easy to implement,
but it is very sensitive to outliers, thus may lead to
inadequate discrete representations. The EFB method
is less sensitive to outliers. The quantization intervals
have smaller width in regions where there are more
occurrences of the values of each feature.
Recently, we have proposed (Ferreira and
Figueiredo, 2011) an unsupervised scalar discretiza-
tion method, based on the well-known Linde-Buzo-
Gray (LBG) algorithm (Linde et al., 1980). The LBG
algorithm is applied individually to each feature and
stopped when the MSE distortion falls below a thresh-
old ∆ or when the maximum number of bits q per fea-
ture is reached (setting ∆ to 5% of the range of each
feature and q ∈ {4, . . . , 10} were found to be adequate
choices). That algorithm, named unsupervised LBG
(U-LBG 1), which produces a variable number of bits
per feature, has been shown to lead to better classifi-
cation results than EFB on different kinds of (sparse
and dense) data (Ferreira and Figueiredo, 2011). The
key idea of using the LBG algorithm in this context
is that if we can represent a feature with low MSE,
we have a discrete version that approximates well the
continuous version of that feature, thus this represen-
tation should be adequate for learning. Algorithm 1
presents the U-LBG1 procedure.
Algorithm 1: U-LBG1.
Input: X, n× p matrix training set (p features, n patterns).
∆: maximum expected distortion.
q: the maximum number of bits per feature.
Output:
e
X: n× p matrix, discrete feature training set.
Q
1
, ..., Q
p
: set of p quantizers (one per feature).
1: for i = 1 to p do
2: for b = 1 to q do
3: Apply the LBG algorithm to the i-th feature to
obtain a b-bit quantizer Q
b
(·);
4: Compute MSE
i
=
1
n
∑
n
j=1
(X
ij
− Q
b
(X
ij
))
2
;
5: if (MSE
i
≤ ∆ or b = q) then
6: Q
i
(·) = Q
b
(·); {/* Store the quantizer. */}
7:
e
X
i
= Q
i
(X
i
); {/* Quantize feature. */}
8: break; {/* Proceed to the next feature. */}
9: end if
10: end for
11: end for
It has been found that unsupervised FD methods
tend to perform well in conjunction with several clas-
sifiers; in particular, the EFB method in conjunction
with na
¨
ıve Bayes (NB) classification produces very
good results (Witten and Frank, 2005). It has also
been found that applying FD with both EIB and EFB
to microarray data, in conjunction with support vec-
tor machine (SVM) classifiers, yields good results
(Meyer et al., 2008).
There are also many supervised approaches to fea-
ture discretization. (Fayyad and Irani, 1993) have ap-
plied an entropy minimization heuristic to choose the
cut points, and thus the discretization intervals. The
experimental results show that the proposed method
leads to the construction of better decision trees than
the previous methods. An efficient FD algorithm
for use in the construction of Bayesian belief net-
works (BBN), was proposed by (Clarke and Barton,
2000). The partitioning minimizes the information
loss, relative to the number of intervals used to rep-
resent the variable. Partitioning can be done prior to
BBN construction or extended for repartitioning dur-
ing construction. A supervised static, global, incre-
mental, and top-down discretization algorithm based
on class-attribute contingency coefficient was pro-
posed by (Tsai et al., 2008).
Very recently, a supervised discretization algo-
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
104