Figure 2 summarizes the proposed algorithm for
generating data for calibration. This generated data
set was then used to train an isotonic regression model
that was used for calibrating the prediction scores of
previously unseen data. We call this the Data Genera-
tion (DG) calibration model. We also wanted to test if
grouping together calibration data points with similar
prediction scores before feeding them into isotonic re-
gression would further increase the calibration perfor-
mance. For this purpose, the 5 000 generated calibra-
tion data points were aggregated into groups of 100
data points and these aggregated data samples were
instead fed to the calibration algorithm. We call this
model the Data Generation and Grouping (DGG) cal-
ibration model. In essence, each aggregated sample
represents an average calibration score and an asso-
ciated fraction of positive samples in the aggregate.
The amount of data points to aggregate into a sample
is a compromise between the resolution of prediction
scores and the resolution of the fraction of positives
in the sample.
Training
data set
Classifier
Train
Predict
Calibration
data set
Figure 2: Calibration data set generation. Cross validation
was repeated until the calibration data set size reached 5 000
samples.
5 EXPERIMENTS
To test the algorithm we developed, an experiment
was set up as follows. Each data set was split into
training and test data sets. 30 % of the samples were
used as the test data while the rest served as the train-
ing data. Using the training data set only, a Na
¨
ıve
Bayes classifier was trained and four different cali-
bration schemes were run using the training data set:
control (no calibration, raw prediction scores), tra-
ditional isotonic regression calibration, and our two
developed algorithms (DG and DGG). For the tradi-
tional isotonic regression, 10 % of the training data
was put aside for calibration and the rest was used to
train the prediction model. For our developed algo-
rithms, cross validation was used to create the sep-
arate calibration dataset, as described in Section 4,
and the whole training data set was used to train
the prediction model. Next, the test data set sam-
ples were predicted and the prediction scores were
calibrated using the algorithms tuned in the previous
step. Threshold value used as prediction boundary
was tuned with the calibrated training data to max-
imize classification rate. This was done separately
for each calibration scheme. Using the threshold
from the previous step as the cut off prediction score,
the following metrics for classification and calibra-
tion performance were calculated for each calibration
scheme: classification rate, logarithmic loss (logloss),
and mean squared error (MSE). For each data set, this
procedure was repeated 10 times with a different split
into training and test data sets and the average perfor-
mance is reported in the results to reduce the amount
of chance in the results.
The experiments were run with the data sets
whose properties are presented in Table 1. All of
the problems were already or were converted into bi-
nary classification problems as described below. The
prediction task with QSAR biodegradation data set
(Mansouri et al., 2013) (Biodegradation) is to clas-
sify chemicals into ready or not ready biodegradable
categories based on molecular descriptors. In Blood
Transfusion Service Center data set (Yeh et al., 2009)
(Blood donation), the task is to predict whether pre-
vious blood donors donated blood again in March
2007. Contraceptive Method Choice data set (Con-
traceptive) is a subset of the 1987 National Indonesia
Contraceptive Prevalence Survey. The prediction task
is to predict the current contraceptive method choice.
A combination of classes short-term and long-term
were used as the positive class while the no-use class
served as the negative class. Letter Recognition data
set (Letter) is a database for letter identification based
on predetermined image features. We used a variation
of the data set by reducing it down into two similar
letters. The letter Q served as the positive class and
the letter O as the negative class. The Mushroom data
set contains descriptions of physical characteristics of
mushrooms and the prediction task is to determine if
the mushrooms are edible or poisonous. All data sets
are freely available from the UCI machine learning
repository (Lichman, 2013).
Comparison of traditional isotonic regression,
Data Generation calibration, and Data Generation and
Grouping calibration algorithms on the Mushroom
data set is shown in Figure 3. Figures 3a-3c show
the traditional isotonic regression, Data Generation,
and Data Generation and Grouping calibration mod-
els, respectively. Also, a calibration plot with the four
calibration algorithms is shown in Figure 3d.
Classification rates (CR), loglosses, and MSEs for
each calibration scheme are presented in Tables 2-6.
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
382