Table 3: Results of DT global surrogate.
Dataset Encoding MLP global surrogate SVM global surrogate Borda Winner
Spear. fi-
delity
Depth Leaves Spear. fi-
delity
Depth Leaves
BC Ordinal 0.285 15 67 0.348 14 83 SVM
One-hot 0.524 12 77 0.344 15 75 MLP
Dummy 0.671 18 72 0.375 16 65 SVM
Lymph Ordinal 0.368 6 16 0.454 8 25 MLP
One-hot 0.716 6 17 0.715 6 17 MLP
Dummy 0.629 7 19 0.607 7 17 MLP and SVM
dressing class imbalances in the Lymph dataset by
removing exceptionally small classes and applying
SMOTE to balance the training-validation sets, result-
ing in 159 no-recurrence and 63 recurrence cases for
the BC dataset, and 63 metastases and 50 malignant
cases for the Lymph dataset. The model hyperparam-
eters were optimised using PSO on the basis of ac-
curacy with a 10-fold cross validation using only the
training-validation set.
5.1 RQ1: What Is the Impact of the
CEs on Accuracy?
The concern of this empirical evaluation is the impact
of the CE on interpretability. Nevertheless, it starts
by investigating the impact of the CE on accuracy in
order to study the similarities of the effect of differ-
ent CEs on accuracy and interpretability, which can
explain changes in the trade-off.
Table 1 shows the performances of SVM along
with MLP from our previous study (Hakkoum. et al.,
2023). The Wilcoxon test was performed on the ac-
curacy of both models with the three CEs for: 1) each
dataset, and 2) both datasets. In order to further assess
the differences between the models (MLP vs. SVM)
the Borda count voting system was carried out to dis-
cover whether one model was outperforming the other
according to all the metrics.
For each dataset as well as regardless of the
dataset, the Wilcoxon test provided a p value higher
than 5% which validates the hypothesis of same dis-
tribution. Table 1 also shows the Borda winner where
SVM always outperformed MLP on Lymph while
MLP outperformed only once on BC with ordinal CE.
Although the Wilcoxon test showed that the models
are not significantly different, it is possible to con-
sider that, according to the Borda count voting sys-
tem, SVM slightly outperforms MLP.
Investigating the influence of CEs on model ac-
curacy, the SK test was used to evaluate and rank
the performance of ordinal, one-hot, and dummy CEs
first for each dataset, then across both datasets. The
evaluation considered four metrics: Accuracy, F1-
score, AUC, and Spearman correlation. The num-
ber of appearances of each CE in a SK rank (cluster)
was computed by considering the ranks of the CEs
for each metric in Table 1, and these are presented in
Table 2.
The appearances of each CE in a SK rank was
computed by considering each performance metric at
a time in different settings (each dataset/model, both
datasets/models). The ordinal CE generally outper-
formed the others, since it came first 8 times, second
4 times and last 4 times, followed by one-hot then
dummy. Despite these differences, an aggregated SK
analysis across both models and both datasets deemed
all three CEs similarly effective.
5.2 RQ2: Which CE Is Best for Global
Interpretability?
Exploring global interpretability, we assessed the per-
formance of DT surrogates constructed with various
CEs. Despite ordinal encoding’s superior accuracy in
RQ1, this phase evaluated global surrogate efficacy
via Spearman fidelity, tree depth, and leaf count as
shown in Table 3.
Based on Table 3, it appears that there is a higher
performance for the Lymph dataset when compared
to the BC dataset, which was also noticed in the
case of model performance (RQ1). When compar-
ing SVM to the MLP, it is also noted that, according
to the Borda count voting system, MLP slightly out-
performed SVM by 3 wins to 2 with 1 draw. Nev-
ertheless, according to the Wilcoxon test, this differ-
ence was not significant, since it led to very high p-
values (28%, 100%, and 46% on BC, Lymph, and
both datasets, respectively).
The SK clustering statistical test was used to help
rank the three CEs on the basis of global surrogate
metrics. The number of appearances of each CE in
each SK rank was computed according to Spearman
fidelity, depth, and number of leaves, and is sum-
marised in Table 4. The one hot CE generally outper-
formed the others, since it came first 6 times, second 4
times and last twice, followed by dummy and ordinal.
A Comparative Study on the Impact of Categorical Encoding on Black Box Model Interpretability
387