The samples generated by this procedure are not
homogeneously distributed in the Rosenbrock valley
and they do not represent all Hyperbanana “arms”
equally.
The 96-dimensional infeasible examples near the
class boundary are sampled in the same way as the
feasible ones but starting with the feasible Hyper-
banana samples and with 100 ≤ f (x) ≤ 500.
5.2 Experimental Setting
The experimental setting is divided into two parts:
data preprocessing and classification. All calculations
are done in Python. The first part, data preprocess-
ing (selection of feasible examples and generation of
infeasible examples) is done according to Sect. 4.1
and Sect. 4.2.
Selection of feasible examples is parametrized dif-
ferently for both data sets as a result of pre-studies.
The pre-studies were conducted with different min-
imal distances ε and ε
b
and evaluated according to
the number of resulting examples and their distribu-
tion in the 2-dimensional data subset. For the µCHP
data set instance selection is parametrized as follows,
the minimal distance between feasible examples is
set to ε = 0.001 and the number of new examples
used for each iteration t is set to t = 1000. Gener-
ation of artificial infeasible examples is parameter-
ized with n = 15000 initially feasible examples distur-
bance = N (0, 0.01) · α with α = 1 and minimal dis-
tance between infeasible examples and their nearest
feasible neighbors ε
b
= 0.025. For the Hyperbanana
data set the instance selection parameters are set to
ε = 0.002 and t = 1000 and parameters for generating
artificial infeasible examples are set to n = 20000, dis-
turbance = N (0,0.02) · α with α = 1 and ε
b
= 0.002.
The second part of the experimental study, the
three classification experiments, are done with the
cascade classifier, see Sect. 3, with different base-
line classifiers from SCIKIT-LEARN, (Pedregosa et al.,
2011), a One-Class SVM (OCSVM) and two binary
classifiers, k-nearest neighbors (kNN) and Support
Vector Machines (SVMs). The OCSVM baseline
classifier is used for all three experiments. The two
binary classifiers kNN and binary SVM are used for
the third experiment with both preprocessing methods
(fs + infs).
All experiments are conducted identically on both
data sets except for the parametrization. For all exper-
iments the number of feasible training examples N is
varied in the range of N = {1000, 2000, . . . , 5000} for
the µCHP data set and N = {1000,2000,...,10000}
for the Hyperbanana data set. For binary classifica-
tion N infeasible examples are added to the N feasible
training examples.
Parameter optimization is done with grid-search
on separate validation sets with the same number of
feasible examples N as the training sets and also N
artificial infeasible examples for the third experiment.
For the first experiment (no prepro.) and the second
experiment (fs) the parameters are optimized accord-
ing to true positive rates (TP rate or only TP), (TP rate
= (true positives) / (number of feasible examples)).
For the third experiment, where the validation
is done with N additional infeasible examples, pa-
rameters are optimized according to accuracy (acc
= (true positives + true negatives)/(number of posi-
tive examples + number of negative examples)). The
OCSVM parameters are optimized in the ranges ν ∈
{0.0001,0.0005,0.001, 0.002, . . . , 0.009, 0.01}, γ ∈
{50,60,...,200}, the SVM parameters in C ∈
{1,10,50,100,500,1000,2000}, γ ∈ {1,5,10,15,20}
and the kNN parameter in k ∈ {1,2,...,26}.
Evaluation of the trained classifiers is done on
a separate independent data set with 10000 feasible
and 10000 real infeasible 96-dimensional examples
according to TP and TN rates for varying numbers
of training examples N. The classification results
could be evaluated with more advanced measures, see
e.g. (He and Garcia, 2009; Japkowicz, 2013). For bet-
ter comparability of the results on both data sets and
the option to distinguish effects on the classification
of feasible and infeasible examples we use the simple
TP and TN rates. TN rates on both data sets are dif-
ficult to compare, because the infeasible µCHP power
output time series are distributed in the whole region
of infeasible examples, while the infeasible Hyper-
banana examples are distributed only near the class
boundary. As far as most classification errors occur
near the class boundary, the TN rates of the Hyper-
banana set are expected to be lower than the TN rates
on the µCHP data set.
5.3 Results
The proposed data preprocessing methods, selection
of feasible examples and generation of artificial in-
feasible examples show an increase in classification
performance of the cascade classifier in the experi-
ments.
On both data sets (µCHP and Hyperbanana) data
preprocessing leads to more precise decision bound-
aries than without data preprocessing, see Fig. 5
and Fig. 7. This can be also seen in the TP and TN
rates of the classification results, see Fig. 6 and Fig. 8.
For the µCHP data set, all three experiments lead
to TN rates of 1, therefore only the TP rates are plot-
ted in Fig. 6. But high TN rates for the µCHP data set
Improving Cascade Classifier Precision by Instance Selection and Outlier Generation
101