a redundant feature in a feature set has strong cor-
relation with other features in the same set. Ac-
cording to Battiti’s recommendation (Battiti, 1994),
the correlation is measured using mutual information.
In fact, Maximum-Relevance-Minimum-Redundancy
(mRMR) (Peng et al., 2005) is a forward-selection al-
gorithm and iterates selection of a feature that shows
the best balance between mutual information to class
labels (relevance) and a sum of mutual information to
the features selected so far (redundancy). This greedy
algorithm has improved the efficiency of the feature
selection algorithms known so far, partly because it
avoids evaluation of correlation for feature sets: the
number of pairs of distinct features is n(n − 1)/2,
while the number of feature subsets is determined by
2
n
.
One problem with this approach is that it does not
incorporate interaction among features into the deter-
mination of relevance. Two or more features are said
to mutually interact when each individual feature has
no strong correlation to class labels but all the fea-
tures together strongly correlate to class label. Zhao et
al. (Zhao and Liu, 2007a) propose a practically fast al-
gorithm that incorporates such interaction into the re-
sults of selection, while Shin et al. (Shin et al., 2017)
further improved the efficiency and propose signifi-
cantly fast algorithms that can scale to real big data.
Study of unsupervised feature selection is, on the
other hand, more challenging, because class labels
cannot be used to guide selection. As a substitute
for class labels, pseudo-labels generated by cluster-
ing can be used to convert unsupervised problems
into supervised problems (Qian and Zhai, 2013; LI
et al., 2014; Liu et al., 2016). Also, some stud-
ies use preservation of manifold structures (He et al.,
2005; Cai et al., 2010; Zhao and Liu, 2007b) and
data-specific structures (Wei et al., 2016; Wei et al.,
2017) as criteria of selection. In many cases, however,
computationally-intensive procedures such as matrix
decomposition are used to solve optimization prob-
lems. More importantly, the proposed algorithms aim
to find a single answer which is merely a local so-
lution. Since pseudo-labels and structures are derived
from the entire feature set, which can include data that
should be understood as noise or outliers for the pur-
pose of selection, the solution can be inappropriate.
In contrast, this paper aims to develop a signifi-
cantly fast algorithm for unsupervised feature selec-
tion that is equipped with an adjustable parameter to
change local solutions that the algorithm selects. By
leveraging these attributes of the algorithm, we can
test a number of different parameter values. As a re-
sult, we can choose better solutions from the pool of
solutions that the algorithm finds.
Figure 1: Eleven datasets used in our experiment.
2 PRELIMINARY ASSUMPTIONS
AND NOTATIONS
In this paper, we assume that all continuous values
specified in a dataset are discretized beforehand, and
a feature always takes a finite number of categorical
values.
For the purpose of analysis, we use 11 relatively
large datasets of various types taken from the litera-
ture (Fig. 1): five from NIPS 2003 Feature Selection
Challenge, five from WCCI 2006 Performance Pre-
diction Challenge, and one from KDD-Cup. For con-
tinuous features included in the datasets, we catego-
rize the values of such features into five equally long
intervals before using them. The instances of all of
the datasets are annotated with binary labels.
In this paper, a dataset D is a set of instances and
F denotes the entire set of the features that describe
D. A feature f ∈ F is a function f : D → R( f ), where
R( f ) denotes the range of f , which is a finite set
of values. Also, we often treat f as a random vari-
able with the empirical probability distribution de-
rived from the dataset. That is, when N( f = v) de-
notes the number of instances in a dataset D that have
the value v at the feature f , Pr( f = v) = N( f = v)/|D|
determines the empirical probability.
A feature set S ⊆ F can be viewed as a random
variable associated with the joint probability for the
features that belong to S: for a value vector v
v
v =
(v
1
,...,v
n
) ∈ R( f
1
)×···×R( f
n
), Pr(S = v
v
v) = N( f
1
=
v
1
,..., f
n
= v
n
)/|D| determines the joint probability
for S = { f
1
,..., f
n
}. Furthermore, we introduce a ran-
dom variable C to represent class labels of instances,
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
204