This relationship is deduced using linear regres-
sion or regression model trees with the gain of stack-
ing G as the target field. The vector of features is
composed by the diversity measures d
1
,d
2
,...,d
n
pre-
viously computed. The regression models show how
much each measure pitches in with the gain of stack-
ing.
4 EXPERIMENTAL EVALUATION
This section describes the experiments conducted to
evaluate the proposed method for analyzing the im-
pact of diversity on stacking supervised classifiers.
Each algorithm cited in Section 3 was also used to
train the meta-classifier. Base classifiers and the
stacking were evaluated based on the accuracy (Tan
et al., 2005), which estimates the quality of classifica-
tion, i.e. the prediction capacity of the model.
The experiments were performed on a personal
computer using the data mining tool Weka
1
(Witten
and Frank, 2011). The algorithms were parameterized
with the default values of this tool, using 10 partitions
in the cross validation.
4.1 Datasets
We have used 54 classification datasets extracted
from UCI machine learning repository
2
: Abalone,
Annealing, Audiology (Std.), Balance Scale, Ban-
knote Authent., Blood Transf. Serv. Center, Breast
Cancer Wisconsin, Car Evaluation, Chess (K-R vs.
K-P), Chronic Kidney Disease, Congressional Vot-
ing Rec., Connect. Bench (S,M vs. R), Con-
nect. Bench (VR-DD), Contrac. Method Choice,
Credit Approval, Dermatology, Diabetic Retinopat.
Debrec., Dresses Attribute Sales, Ecoli, Forest type
mapping, Glass Identification, Hill-Valley, ILPD -
Indian Liver Patient, Ionosphere, Leaf, Low Reso-
lution Spectrometer, Mammographic Mass, Molec-
ular Bio. (S-junction), Multiple Features, Nurs-
ery, Opt. Recog. Handwrit. Dig., Page Blocks
Classification, Pen-based Recog. Handwrit., Phish-
ing Websites, Primary Tumor, QSAR biodegradation,
Qualitative Bankruptcy, Seismic-Bumps, Solar Flare,
Soybean (Large), Spambase, SPECT Heart, Stat-
log (Vehicle Silh.), Thoracic Surgery, Thyroid Dis-
ease (Hypothyr.), Thyroid Disease (Sick), Tic-Tac-
Toe Endgame, Turkiye Student Eval., Vertebral Col-
umn, Waveform Database Gen. (V2), Wholesale cus-
tomers, Wilt, Wine Quality, and Yeast.
1
http://www.cs.waikato.ac.nz/ml/weka
2
http://archive.ics.uci.edu/ml
The chosen datasets cover several areas of knowl-
edge: business, computer, financial, game, life, physi-
cal and social. Many of them were widely cited in the
scientific literature and they have sundry objectives.
The field data types can be integer, real or categorical.
The amount of instances ranges from 187 to 12,960.
The number of fields and class labels varies from 5
to 217 and from 2 to 48 respectively. These datasets
were deposited in the UCI repository from the year
1987 to 2015.
A set of preprocessing operations was applied in
order to standardize the content make the datasets
able to execute the algorithms in Weka. The main
operations were removal of double spaces between
instances, naming data fields, changing field delim-
iter and data types from numeric to nominal. After
preprocessing they were used to train heterogeneous
classification models, i.e. using different algorithms
described in the previous section. Base classifiers pre-
dictions are stacked composing the level-1 training set
on which the final classification model is learned.
4.2 Results
The experimental results are summarized in Table 2
that shows for each dataset the following information:
the computed diversity measures double fault (d f ),
disagreement (Dis), statistic (Q), correlation coeffi-
cient (ρ), interrater agreement (k), Kohavi-Wolpert
variance (KW ) and entropy (E); the algorithm used to
learn the best base classifier (L
0
) and its accuracy in
percentage (A
L
0
); the algorithm used to learn the best
meta-classifier (L
1
) and its accuracy (A
L
1
); and the
gain of stacking (G), used to sort the results, also in
percentage. Values of d f , Dis, Q and ρ are averages
of the computed values for each pair of base classi-
fiers. Moreover, this table presents Q
0
, ρ
0
, k
0
and KW
0
that are the original diversity measures standardized
in a distribution of values in the closed range [0,1], as
well as d f , Dis and E.
We showed the results about datasets that reached
the worst and best G values, i.e. we have omitted the
results when the gain of stacking is not significant
and ranges between -1 and 1%. Observing Table 2,
we notice that stacking worked well only for 8 out of
54 datasets, where the gain ranged from 1.2 to 5.1%
(lines 1-8). The best gain of stacking was reached by
Balance Scale dataset, in an already very accurate re-
sult (90.7%) which is very difficult to improve. The
most frequent algorithm that reaches the best accu-
racy for level-0 was MLP ranging 26.6 ≤ A
L
0
≤ 90.7,
followed by RF with 84.8 ≤ A
L
0
≤ 92.9. In the level-
1, the best meta-classifiers were trained with SMO
(26.9 ≤ A
L
1
≤ 94.9) and RF (83.7 ≤ A
L
1
≤ 95.4).
An Analysis of the Impact of Diversity on Stacking Supervised Classifiers
237