following are the two one-class classifiers we use:
Gaussian Model (Tax, 2001). Fit a unimodal mul-
tivariate normal distribution to the positives. When
applied to 1-dimensional data, this classifier simply
returns the distance to the mean.
One-class SVM. We use the one-class ν-SVM
(Sch
¨
olkopf et al., 2001) method, that computes hyper-
surfaces enclosing (most of) the positive data. We
set ν, the regularization parameter that controls how
much we expect our training data to be contaminated
with outliers, to 0.05. As it is common practice
in OCC we use the Gaussian kernel, initializing the
width of the kernel to the average pairwise Euclidean
distance in the training set.
In order to select an operating point for the classi-
fiers we compute a threshold by assuming that a 5%
of the training data are outliers. This is a common
choice in the one-class literature. The role of thresh-
old selection by train-rejection lies in one or both of
these two assumptions (a) the presence of noise and
some counterexamples in the train data, (b) our clas-
sifier is not powerful enough as to accommodate all
positive examples. Another underlying assumption is
that in the training data we have boundary cases, so
that the threshold will not be too tight as for reject-
ing too many positives. A more practical view is that,
probably, this is the most straightforward way of se-
lecting the operating point.
Threshold selection is directly related to the ro-
bustness and capital for one-class classifiers general-
ization capabilities. If it is too tight the number of
false negatives will be increased; this can happen if
the noise level specified by the user is too high. If it is
too loose, the number of false positives will increase;
this will happen if the noise level specified is too low.
In either case one-class classifiers become reject-all
or accept-all machines, which is a very common and
undesirable effect.
For each target class we perform a 10-fold cross-
validation, except for those classes with less than 10
examples, which we ignore, and those with sample
sizes between 10 and 15, for which we perform a
leave-one-out cross validation (in OCC this means
constructing a model using all positives to classify
all negatives, and constructing a model leaving out
each of the positives). Of course, the ANG sampling
and DR computations are also included in the cross-
validation loop, only granting them access to the train
data in each fold. We report the area under the ROC
curve (AUC) and the Balanced Accuracy Rate (BAR)
defined as the average of the True Positive (sensitiv-
ity) and True Negative (specificity) Rates.
4.2 Text Classification
We use a suite of text classification problems provided
by Forman (Forman, 2003)
1
. Those come from sev-
eral well-known text classification corpora (ohsumed,
reuters, trec...). In total this accounts for 265 different
classification tasks. These are high dimensional (from
2000 to 26832 features) low sample size datasets,
therefore the data is sparse. We use the Bag-of-Words
(BoW) representation that embodies a simplistic as-
sumption of word independence, and normalize each
document to unit-L2 norm, as is usual practice in in-
formation retrieval.
There is a fundamental trap when working with di-
mensionality reduction for text classification in OCC.
Due to the sparsity, many of the words do not appear
at all in any of the documents of the class. These
words are unobserved features, features that are con-
stant zero in the training set of a class. Unobserved
features are highly discriminative, but cannot be used
in a principled way for training one-class classifiers.
This phenomenon is pervasive, with unobserved ra-
tios per class ranging between 5% and 95% of the fea-
tures in the datasets evaluated. Unobserved features
can make a big difference in performance. For ex-
ample, using the Gaussian classifier the average AUC
varies from 0.9 when allowing unobserved features
in the training set to 0.68 when using only observed
features. In the present experiments we only use ob-
served features.
The results are shown in figure 4. The baseline
AUC for no dimension reduction is a poor 0.68. Nei-
ther PCA nor LPP provide useful projections when
trained with positive examples only. They are even
harmful performing worse than random projection,
which also performs poorly in this evaluation. In the
ANG realm we realize that both the Uniform and the
Marginal, while still improving over the baseline of
LPP, does not provide the best performance. There-
fore we focus on the three best techniques: Normal-
izer and LeftRight + LPP and the StdDevPr.
The StdDevPr is the best technique in our
test-bench. Its computation is extremely efficient
(O(mn)), requiring only a single pass over the pos-
itive examples. To the best of our knowledge it is
novel and have not been used before, although related
biases can be found in the literature (e.g., the term fre-
quency variance, where in a feature selection context
each word is scored by its variance in the whole cor-
1
Available for download at http://jmlr.csail.mit.edu/
papers/v3/forman03a.html. We used an extra data-
set, new3s, also supplied by Forman and available
at http://prdownloads.sourceforge.net/weka/19MclassText-
Wc.zip?download
ARTIFICIAL DATA GENERATION FOR ONE-CLASS CLASSIFICATION - A Case Study of Dimensionality Reduction
for Text and Biological Data
207