and ε
X
(n,K), twice the distance (usually chisen as the
Euclidean distance) from the n
th
observation in X to
its K
th
NN. Two slightly different estimators are then
derived whose most popular one is:
ˆ
I(X;Y ) = ψ(N) + ψ(K) −
1
K
−
1
N
N
∑
i=1
(ψ(τ
x
i
) + ψ(τ
y
i
))
(10)
where τ
x
i
is the number of points whose distance
from x
i
is not greater than 0.5 × ε(n, K) = 0.5 ×
max(ε
X
(n,K),ε
Y
(n,K)). By avoiding the evaluation
of high-dimensional pdf, the hope is to reach better re-
sults than with the previously introduced estimators.
It is also important to note that other NN based
density estimators have been proposed in the littera-
ture, whose recent examples are (Wang et al., 2009;
Li et al., 2011). However, as they are less popular
than (Kraskov et al., 2004) for feature selection, they
are not used in the present comparison.
3 EXPERIMENTS
Three sets of experiments are carried out in this sec-
tion. The objective is to assess the interest of the dif-
ferent estimators for incremental feature selection al-
gorithms. The criteria of comparison and the exper-
imental setup are thus very different from the ones
used in previous papers only focused on MI estima-
tion (see e.g. (Walters-Williams and Li, 2009)). First,
a suitable estimator should be accurate, i.e. it should
reflect the true dependency between groups of fea-
tures and increases (resp. decreases) when the depen-
dance between groups of features increases (resp. de-
creases). Then it should also be able to detect uninfor-
mative features and return a value close to zero when
two independent groups of features are given. Even-
tually, a good estimator should be quite independent
from the value of its parameters or some fast heuris-
tics to fix them should be available.
From a practical point of view, the implementa-
tion by Alexander Ihler has been used for KDE
1
. For
the NN-based estimator, the parameter K is set to 6
unless stated otherwise. For the B-splines estimator,
the degree of the splines is set to 3 and the number of
bins to 3. These values correspond to those advised in
the respective original papers (Kraskov et al., 2004;
Daub et al., 2004).
3.1 Accuracy of the Estimators
The first set of experiments consists in comparing the
1
http://www.ics.uci.edu/ ihler/code /
precision of the MI estimators as the dimension of
the data set increases. To this end, they will be used
to estimate the MI between n correlated Gaussians
X
1
...X
n
with zero mean and unit variance. This way,
the experimental results can be compared with exact
analytical expressions as the MI for n such Gaussians
is given by (Darbellay and Vajda, 1999):
I(X
1
...X
n
) = −0.5 ×log[det(σ)] (11)
where σ is the covariance matrix.
All the correlation coefficients are set to the same
value r which will be chosen to be 0.1 and 0.9. The
estimation is repeated 100 times on randomly gen-
erated datasets of 1000 instances and the results are
shown for n = 1...9. Even if this can be seen as a rela-
tively small number of dimensions, there are practical
limitations when using splines and histogram-based
estimators in higher dimensions. Indeed the gener-
alization of the B-splines-based estimator to handle
vectors of dimension d involves the tensor product of
d univariate B-splines, a vector of size M
d
, where M
is the number of bins. Histogram-based methods are
also limited in the same way since they require the
storage of the value of k
d
bins, where k is the num-
ber of bins per dimension. Nearest neighbors-based
methods are not affected by this kind of problems and
have only a less restrictive limitation regarding the
number n of data points since they require the cal-
culation of O(n
2
) pairwise distances. As will be seen,
the small number of dimensions used in the experi-
ments is sufficient to underline the drawbacks and ad-
vantages of the compared estimators.
Figure 1 shows that, as far as the precision is con-
cerned, Kraskov et al.’s estimator largely outperforms
its competitors for the two values of r (r = 0.1 and
r = 0.9). The estimated values are always very close
to the true ones and show small variations along the
100 repetitions. The adaptive histogram provides on
average accurate estimations up to dimension 8 and
6 for r = 0.1 and r = 0.9 respectively, with however
very strong fluctuations observed accross the experi-
ments. The B-spline estimator is also extremely ac-
curate for the five first dimensions and r = 0.1. For
r = 0.9 (and thus for higher values of MI), it severely
underestimates the true values while the aspect of the
true MI curve is preserved. This cannot be consid-
ered as a major drawback in a feature selection con-
text where we are interested by the comparison of MI
between groups of features. The results achieved by
the kernel density estimator are very poor as soon as
n exceeds 1, largely overestimating the true values for
r = 0.1 while immediately decreasing for r = 0.9. Fi-
nally, as one could expect, the basic histogram pro-
duces the worst results; the estimated values are too
A COMPARISON OF MULTIVARIATE MUTUAL INFORMATION ESTIMATORS FOR FEATURE SELECTION
179