(CE) which maps the numerical values to an integer
to represent every category. Label CE can disregard
any order a feature might have like the degree of
malignancy in a cancer prognosis dataset. This can
have a negative impact on the relevance of the feature,
and therefore, on the performance of the model.
Therefore, ordinal CE is often used.
There is no doubt that data pre-processing (DP)
methods (Benhar et al. 2020), such as CE, have a
significant impact on model accuracy. According to
(Crone et al. 2006), the influence of DP is widely
overlooked, as shown in their SLR on studies
investigating data mining applications for direct
management. This SLR particularly showed that only
one publication discussed the treatment and use of
CE, despite the fact that categorical variables were
used and documented in 71% of all studies and are
commonly encountered in the application and ML
domains in general. The aforementioned authors
investigated the impact of different DP techniques
that included CE with four encoding schemes: one-
hot, ordinal, dummy, and thermometer encoding.
Tests performed on DT and a multilayer perceptron
(MLP) proved that CE can have a significant
influence on model performance.
Motivated by these findings showing the impact
of DP methods on accuracy and the lack of studies on
this effect on interpretability (Hakkoum et al. 2021a),
we investigated how interpretability techniques are
affected. Therefore, this study compares two well-
known interpretability techniques, global surrogates
using DT and Shapley Additive exPlanations (SHAP)
(Lundberg and Lee 2017), when used with an MLP
trained for breast cancer (BC) prognosis (Dua and
Graff 2017). Following the application of two
different CE, namely ordinal and hot, the MLP was
optimised using the particle swarm optimisation
algorithm (PSO) to ensure maximum accuracy. The
performance of the MLP with different CEs was first
compared using the Wilcoxon statistical test and
Borda count voting systems, after which the same
comparison was performed at the global and local
interpretability levels.
The key contributions of this study are the
identification of the impact of CEs on accuracy and
interpretability as well as the quantitative evaluation
of SHAP. In this respect, the research questions
(RQs) listed below will be addressed:
RQ1: What is the overall performance of MLP?
Which CE is the best?
RQ2: What is the overall global interpretability of
MLP? Which CE is the best?
RQ3: What is the overall local interpretability of
MLP? Which CE is the best?
The remainder of this paper is organised as
follows: Section 2 provides an overview of the chosen
black-box (MLP) as well as the interpretability
techniques (global surrogate and SHAP) used in this
study. Section 3 describes the BC dataset as well as
the performance metrics and statistical tests used to
identify the best-performing CEs. The experimental
design used in the empirical evaluation is detailed in
Section 4. Section 5 presents and discusses the
findings. Section 6 discusses the threats to the validity
of the study, and Section 7 reports the findings and
future directions.
2 METHODS
This section defines the models and methods
employed in this empirical evaluation, namely: CEs,
MLP, PSO, and the global and local interpretability
techniques.
2.1 Categorical Encodings (CEs)
Data transformation tasks are additional DP
procedures that help ML models to perform better. In
this step, the data were transformed into appropriate
forms for the mining process, resulting in more
efficient results or more understandable patterns
(Esfandiari et al. 2014). CE is a common data-
transformation method. This is the process of
converting categorical data into an integer format,
thus enabling it to be used by various ML models,
which are primarily mathematical operations that rely
entirely on numbers.
Ordinal CE is the most basic strategy for
categorical features in which observed levels from the
training set are mapped onto integers 1 to N (number
of categories) with respect to their original order. In
contrast, the indicator CE regroups one-hot and
dummy CEs. One-hot encoding refers to transforming
the categorical feature into N binary indicator
columns, in which the active category is represented
by 1. Meanwhile, dummy encoding results in only N-
1 indicator columns, and a reference feature level is
chosen, which is encoded with 0 in all indicator
columns.
2.2 Neural Networks
Black-box models are widely used in many domains
owing to their excellent performance. Their ability to
map nonlinear relationships and discover patterns in
databases that slip from white-box models has put
them in the spotlight.