The D
P
-hardness of the CFDP indicates that it is
both NP-hard and coNP-hard; therefore, it’s most
likely to be intractable (that is, unless P = NP).
2.2 Heuristic Solution for CFDP
From the analysis above it is clear that even deciding
if a given number k is a CFD (for the given
performance threshold T) is intractable, so, to
determine what that number is for a dataset is
certainly even more difficult. Nevertheless, a simple
heuristic method is proposed in the following, which
represents a practical approach in attempting to find
the CFD of a given dataset and a given performance
threshold with respect to a fixed learning machine.
Though the heuristic method described below
can be seen as actually pertaining to a different
definition of the CFD, we argue that it serves to
validate the concept that
, the CFD, if not for all
datasets; and we show that for most datasets with
which experiments were conducted a CFD indeed
exists. Finally, the
determined by this heuristic
method is hopefully close to the theoretically-
defined CFD.
In the heuristic method, the CFD of a dataset is
defined as that number (of features) where the
performance of the learning machine would begin to
drop notably below an acceptable threshold, and
would not rise again to exceed the threshold. The
features are initially sorted in descending order of
significance and the feature set is reduced by
deleting the least significant feature during each
iteration of the experiment while performance of the
machine is observed. (For cross validation purposes,
therefore, multiple runs of experiments can be
conducted: the same machine is used in conjunction
with different feature ranking algorithms; and the
same feature ranking algorithm is used in
conjunction with different machines; then we can
compare if different experiments resulted in similar
values of the CFDif so the notion that the dataset
possesses a CFD becomes arguably more apparent.).
2.2.1 Critical Dimension Empirically
Defined
Let A = {a
1
, a
2
, …, a
p
} be the feature set where a
1
, a
2
,
…, a
p
are listed in order of decreasing importance as
determined by some feature ranking algorithm R.
Let A
m
= {a
1
, a
2
, …, a
m
}, where m ≤ p, be the set of
m most important features. For a learning machine M
and a feature ranking method R, we call µ (µ ≤ p) the
T-Critical Dimension of (D
p
, M) if the following
conditions are satisfied: when M uses feature set Aµ
the performance of M is T, and whenever M uses
less than µ features its performance drops below T.
2.2.2 Learning and Ranking Algorithms
In the experiments the dataset is first classified by
using six different algorithms, namely Bayes net,
function, rule based, meta, lazy and decision tree
learning machine algorithm. The machine with the
best prediction accuracy is chosen as the classifier to
find the CFD for that dataset.
For the experiments reported below, the ranking
algorithm is based on chi-squared (
2
) statistics,
which evaluates the worth of a feature by computing
the value of the
2
statistic with respect to the class.
Note that in the heuristic method the performance
threshold T will not be specified beforehand but will
be determined during the iterative process where a
learning machine classifier’s performance is
observed as the number of features is decreased.
2.3 Results
Three large datasets are used in the experiments, each
is divided into 60% for training and 40% for testing.
Six different models are built and retrained to get the
best accuracy. The model that achieves the best
accuracy is used to find the CFD.
2.3.1 Amazon 10,000 Dataset
The Amazon commerce reviews dataset (Frank 2013)
is a writeprint dataset useful for purposes such as
authorship identification of online texts, etc.
Experiments were conducted to identify fifty
authors in the dataset of online reviews. For each
author 30 reviews were collected, totaling 1500.
There are 10,000 attributes and they include authors’
linguistic style, such as usage of digit, punctuation,
words and sentences’ length and usage frequency of
words and so on. This becomes a multiclass
classification problem with 50 classes, where the
dataset contains numerical values for all features.
The results are shown in Figure 1, where a CFD
is found at 2486 features. The justifications that this
is the CFD are, firstly, from 2486 downward, the
performance drops quickly andunlike the situation
at around 9000the performance never rises
thereafter; secondly, the performance at feature size
2486 is only slightly lower than the highest observed
performance (at around 9000 features). Another point
at around 6000 may also be taken as the CFD;
however, 2486 is deemed more “critical” since there
is a big difference between 6000 and 2486 but very
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
362