From Interpretability to Clinically Relevant Linguistic Explanations:

The Case of Spinal Surgery Decision-Support

Alexander Berman

1 a

, Eleni Gregoromichelaki

1 b

and Catharina Parai

2,3 c

Centre for Linguistic Theory and Studies in Probability, Dept. of Philosophy, Linguistics and Theory of Science, University

of Gothenburg, Sweden

Department of Orthopedics, Sahlgrenska University Hospital, Gothenburg, Sweden

Sahlgrenska Academy, University of Gothenburg, Sweden

{alexander.berman, eleni.gregoromichelaki}@gu.se, catharina.parai@vgregion.se

Keywords:

Interpretable AI, Machine Learning, Explanations, Linguistics, Argumentation Theory, Decision-Support

Systems, Spinal Surgery.

Abstract:

Interpretable models are advantageous when compared to black-box models in the sense that their predictions

can be explained in ways that are faithful to the actual reasoning steps performed by the model. However,

interpretability does not automatically make AI systems aligned with how explanations are typically com-

municated in human language. This paper explores the relationship between interpretability and linguistic

explanation needs of human users for a particular class of interpretable AI, namely generalized linear models

(GLMs). First, a linguistic corpus study of patient-doctor dialogues is performed, resulting in insights that can

inform the design of clinically relevant explanations of model predictions. A method for generating natural-

language explanations for GLM predictions in the context of spinal surgery decision-support is then proposed,

informed by the results of the corpus analysis. Findings from evaluating the proposed approach through a

design workshop with orthopaedic surgeons are also presented.

1 INTRODUCTION

In research concerning how to explain outputs from

AI systems, two main paradigms have evolved. Post-

hoc explanations methods such as LIME (Ribeiro

et al., 2016) and SHAP (Lundberg and Lee, 2017)

give some insight into how deep neural networks and

other black-box models make their inferences. In con-

trast, predictions from interpretable models (or so-

called “glass-box AI”) operate according to reasoning

steps that are, at least in principle, comprehensible for

humans (Rudin, 2019; Rudin et al., 2022).

Is is sometimes argued that interpretable models

are superior to black-box models in the sense that pre-

dictions can be explained in ways that are inherently

faithful to the actual reasoning steps executed by the

model, making interpretable models more adequate

in high-stakes applications (Rudin, 2019). Neverthe-

less, research concerning interpretable models largely

focuses on efforts to develop well-performing mod-

https://orcid.org/0000-0003-0513-4107

https://orcid.org/0000-0002-6933-5314

https://orcid.org/0000-0002-8332-0426

els (see, e.g. (Rudin et al., 2022)), leaving the re-

lationship between interpretability and users’ expla-

nation needs in AI-assisted decision-making largely

unexplored. This gap in previous research concerns

both how model interpretability can be leveraged to

obtain linguistic explanations that meet users’ needs,

and how such fulﬁlment of such needs depends on

formal properties of interpretable models (sparsity,

monotonicity, etc.).

This paper takes a step towards bridging this gap.

Speciﬁcally, the paper focuses on a particular class

of interpretable AI, namely generalized linear mod-

els (GLMs), in the context of spinal surgery decision-

support. As its main contribution, the paper pro-

poses a method for generating concise and clinically

relevant linguistic explanations for predictions from

GLMs, informed by communicative strategies ob-

served in doctor-patient conversations.

The proposed method is applied in the context of

a web-based instrument used by spine clinics in Swe-

den. The purpose of the instrument is to assist doctors

and patients during medical consultations where de-

cisions concerning choice of treatment (usually sur-

gical or non-surgical treatment) are made. Based

Berman, A., Gregoromichelaki, E. and Parai, C.

From Interpretability to Clinically Relevant Linguistic Explanations: The Case of Spinal Surger y Decision-Support.

DOI: 10.5220/0013403800003890

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 1, pages 909-920

ISBN: 978-989-758-737-5; ISSN: 2184-433X

909

on GLMs, the tool predicts two patient-reported out-

comes of hypothetical surgery, as well as length of in-

hospital stay, for patients with degenerative spinal dis-

orders. The present study explores how the currently

deployed instrument, which does not offer patient-

speciﬁc explanations, can be modiﬁed and extended

to meet doctors’ and patients’ clinical needs related

to explainability.

The rest of the paper is organized as follows. Sec-

tion 2 situates the work in relation to previous ap-

proaches to generating linguistic explanations for in-

terpretable models. Section 3 is devoted to a lin-

guistic corpus study, where explanations for medi-

cal judgements are collected from existing corpora

and analysed in terms of communicative explanatory

strategies. Implications of the analysis for the de-

sign of clinically relevant linguistic explanations are

also discussed. In Section 4, a method for generat-

ing linguistic explanations of predictions from GLMs

is proposed, informed by the ﬁndings from the cor-

pus study. The section presents technical details con-

cerning the proposed method, as well as a preliminary

evaluation of the proposed method through a design

workshop with orthopaedic surgeons. Finally, Section

5 offers conclusions and discusses future work.

2 RELATED WORK

The perhaps earliest example of natural-language

explanations in the context of interpretable AI is

SHRDLU (Winograd, 1971), a system which can ex-

plain its rule-based reasoning. For example, when the

user asks why the system picked up a certain object,

it may respond: “To get rid of it”; when asked why

it got rid of it, it may respond: “To clean off the red

cube”, etc. In a similar vein, the more recent sys-

tem DAISY (Wahde and Virgolin, 2023) can (exhaus-

tively) explain how its use of hand-crafted or interac-

tively learned procedures yields speciﬁc results. For

example, when explaining how it concluded that the

largest city in France is Paris, it states: “I retrieved all

items in the city category”, “Then I found all items

belonging to France”, etc. Methods for producing

enthymematic (logically incomplete) explanations for

answers inferred on the basis of facts and rules are

proposed by (Xydis et al., 2020; Breitholtz, 2020;

Maraev et al., 2021). For example, if the user asks

why the system described by (Maraev et al., 2021)

recommends a particular route, it responds: “Because

the route is the shortest”, thereby stating a fact whose

relevance with respect to the explanandum hinges on

the implicit premise that short routes are better than

long ones. In contrast to these approaches, this paper

targets explanations for predictions from statistical

models. Various such approaches have been proposed

for black-box models, based on post-hoc explanation

techniques (see, e.g. (Forrest et al., 2018; Kaczmarek-

Majer et al., 2022; Slack et al., 2023)). One popular

explanation strategy is to rank the most important fea-

tures. For example, the system presented by (Slack

et al., 2023) generates explanations on the form ”For

[this prediction], the importance of the features have

the following ranking, where 1 is the most important

feature: 1: glucose, 2: bmi, 3: age ...” Presumably,

the lack of easily identiﬁable warrants makes such

explanations difﬁcult to understand, or cause a false

sense of understanding when explainees identify war-

rants that do not reﬂect the actual inner workings of

the model (Berman, 2024b).

Post-hoc explanation methods can also be used

for interpretable models, by treating them as black

boxes (see, e.g. (Ahmed et al., 2024)). As for ap-

proaches that instead leverage interpretability, (Baaj,

2022) shows how explanations can be generated for

possibilistic and fuzzy rule-based systems. For exam-

ple, a justiﬁcation for the judgement that a patient’s

blood sugar level will not be low can be generated

as: “This is mainly due to the fact that it is quite

certain that the activity consists of drinking coffee,

lunch or dinner and that the current blood sugar level

is medium or high.” A method for explaining predic-

tions by decision trees and fuzzy rules is presented by

(Alonso and Bugar

ın, 2019), enabling explanations

such as “Beer is type Porter because its strength is

standard and its color is brown”.

The method presented in this paper builds on work

by (Berman, 2024a), who proposes an interactive

method for generating explanations for predictions by

linear additive models, based on Toulmin’s theory of

argumentation (Toulmin, 2003). The method sup-

ports generation of both “data” (case-speciﬁc facts)

and “warrants” (general statistical patterns). For ex-

ample, if the model predicts that a person is intro-

verted (based on her music preferences), the most im-

portant datum can be expressed as “The person likes

high-energy music”, with the corresponding warrant

“Statistically, people that like high-energy music are

more likely to be introverted.” As detailed in Sec-

tion 4, the present work extends this approach to han-

dle a broader range of feature types (rather than only

continuous features). Furthermore, the present paper

shows how feature encoding can be jointly optimized

for performance and linguistic intelligibility.

Almost none of the previous approaches have

been empirically validated with end users. The only

exception is (Slack et al., 2023) who let participants

solve explainability-related tasks using two different

IAI 2025 - Special Session on Interpretable Artiﬁcial Intelligence Through Glass-Box Models

910

tools. However, the purpose of this validation was

to compare the authors’ conversational tool with a

graphical interface; the extent to which the tasks or

generated explanations were deemed clinically rele-

vant by participants was not studied.

In contrast to previous approaches, the present

work grounds the proposed natural-language genera-

tion method in a linguistic analysis of human explana-

tory interaction, and evaluates the method with end

users in a clinically relevant scenario.

3 LINGUISTIC CORPUS STUDY

To inform the design of linguistic explanations, a

qualitative linguistic corpus study is conducted by

collecting and analysing examples of how doctors

and patients explain (or support, or argue for/against)

judgements (e.g. certain treatments) in clinical set-

tings. Two empirical sources of clinical dialogues

were chosen: the Norwegian Corpus of Doctor-

Patient Consultations from Ahus (Gulbrandsen et al.,

2013) (henceforth abbreviated Ahus), and a Swedish

textbook in medicine focusing on the encounter be-

tween patient and doctor (Lindgren and Aspegren,

2004) (henceforth abbreviated L&A). The choice of

empirical material is primarily motivated by the topics

and types of situations that it encompasses. Further-

more, while one of the corpora (Ahus) is descriptive

and contains transcripts of actual consultations, the

other (L&A) is prescriptive and conveys communica-

tive norms. Both types of linguistic data were deemed

relevant for the purposes of the research.

The linguistic analysis builds on Toulmin’s the-

ory of argumentation (Toulmin, 2003). Toulmin iden-

tiﬁes elements of arguments, including the claim

(corresponding to the explanandum), data (speciﬁc

facts that support the claim), and warrants (general

norms or rules of thumb that justify how facts support

claims). For example, the claim “you have a cold”

can be supported (explained) by the datum “you have

a runny nose”, which in turn rests on warrants such as

“runny nose is a symptom of a common cold”.

Speciﬁcally, the corpus study aims to answer the

following research questions:

1. To what extent do interlocutors (doctors and pa-

tients) explicitly convey argumentative elements

(claims, data, and warrants)?

2. To what extent do interlocutors implicitly convey

argumentative elements?

3. How do interlocutors linguistically indicate the re-

lationship between argumentative elements?

The RQs were purposely formulated in relation to the

intended downstream application of the results, i.e.,

design of linguistic explanations of predictive mod-

els. Speciﬁcally, claims are conceived to be poten-

tially analogous with statistical predictions, data with

feature values, and warrants with statistical patterns

learned by predictive models. In other words, it is

assumed in principle conceivable that doctors reason

in ways that are analogous (to some extent) with how

machine-learning models make predictions.

3.1 Corpus Data

Occurrences of explanations (or related phenomena

such as arguments or justiﬁcations) pertaining to med-

ical judgements were identiﬁed using a search proce-

dure. In the case of Ahus, which contains transcrip-

tions of 220 consultations, this was done by search-

ing for the word “why” (“hvorfor” in Norwegian); for

L&A, the corpus was small enough to permit a man-

ual search of the entire empirical material.

The topic of interest (medical judgements) pri-

marily encompasses diagnosis (judging that a patient

has a certain condition) and recommendations (judg-

ing that a particular action or intervention is ade-

quate). The selection procedure resulted in two dia-

logues from each corpus, spanning a total of 88 utter-

ances.

3.2 Annotation

To address the RQs, the data was annotated with the

following labels (hypothetical examples in parenthe-

ses):

• claim (“I think that you have a cold”)

• datum (“since you have a runny nose”)

• warrant (“since runny nose is a symptom of a com-

mon cold”)

Note that arguments are not assumed a priori to

be marked with particular syntactic constructions or

particles such as “since”, “because”, or “therefore”

(Sbis

a, 1987).

A complete annotation was ﬁrst done by one of

the authors (with expertise in cognitive science, lin-

guistics, and machine learning), and then reviewed by

the two other authors (with expertise in linguistics and

medicine respectively). During the reviews, annota-

tions were open for collaborative amendments.

3.3 Analysis

The analysis reveals that in the empirical material,

claims are explicitly supported by up to three pieces

From Interpretability to Clinically Relevant Linguistic Explanations: The Case of Spinal Surgery Decision-Support

911

of data, whereas warrants are rarely communicated

explicitly. For example, in one of the dialogues, the

doctor expresses an intent to have the patient’s lungs

x-rayed (claim), which is justiﬁed with reference to

the patient’s low levels of oxygen in the blood (da-

tum). There is no explicit mention of how a person’s

levels of oxygen explain the recommendation to per-

form lung x-ray (warrant).

In cases where warrants are made explicit, they are

sometimes causal in nature. For example, in one di-

alogue, the patient expresses a wish to have her heart

checked-up since she gets very dizzy and wonders if

this is due to low blood pressure. In response, the doc-

tor explains that the patient’s dizziness can be caused

by her diabetes. However, in many cases, arguments

seem compatible with either causal or statistical war-

rants. For example, when a doctor judges that the

patient has no respiratory illness partly on the basis

that the “chest X-ray was completely normal”

, this

is compatible with either a causal warrant (e.g. res-

piratory illnesses cause abnormalities that can be de-

tected in a chest X-ray) or a statistical one (e.g. a nor-

mal chest X-ray correlates with absence of respiratory

illness).

Although warrants are rarely verbalized, data are

very frequently conveyed in ways that indicate what

kind of warrant the speaker might have in mind. In

one excerpt, the doctor says that the patient’s oxygen

levels are “a bit lower than one would expect”. Al-

though no warrant is explicitly conveyed, the words

“lower” and “expect” both allude at a warrant such as

“unexpectedly low levels of oxygen in the blood can

indicate lung disease”. In another example, the doctor

explains a recommended change of medication as fol-

lows: “since you have had [the medication for] over

two months and have increased the dosage and not

had any effect”; here, the lexical choices “over”, “in-

creased”, and “not ... any” trigger a warrant such as

“having used a medication for a long time without any

effects motivates trying another medication”.

Generally, two types of warrant triggers can

be observed in the data: scalar/gradable and

norm/expectation related. Examples of scalar trig-

gers include lower (levels of oxygen in the blood than

one would expect), over (two months of medication

use), increased (dosage), and no (effect of medica-

tion). Examples of norm-related warrant triggers in-

clude expected (levels of oxygen in the blood), nor-

mal (lung x-ray), abnormal (nothing abnormal in pa-

tient’s lungs), should (nothing observed that shouldn’t

be there), and good (cholesterol levels).

Although linguistic triggers help explainees to

Cited excerpts from the empirical material have been

translated to English by the authors.

identify potentially relevant warrants, a certain

amount of argumentative underspeciﬁcation (ambigu-

ity) can be observed. For example, conveying oxygen

level in the blood as lower than expected is compat-

ible with a warrant that posits a monotonically de-

creasing relation between oxygen level in the blood

and the risk of lung disease. However, it is also

compatible with a non-monotonic relationship, i.e.

that too high oxygen levels in the blood also indi-

cate a higher risk of disease. Similarly, when mul-

tiple pieces of data are presented in support of a claim

(such as having used a medication for a long time with

a high dosage), potential interactions between data re-

main unstated. This kind of underspeciﬁcation can

potentially be understood as serving the purposes of

relevance and brevity, i.e. only presenting informa-

tion that is deemed relevant for the patient, and not

providing more information than needed in the con-

text (cf. Grice’s maxims of relation/relevance and

quantity (Grice, 1975)).

As an additional ﬁnding, we observe that in-

terlocutors sometimes discuss mutually opposing

claims. For example, in one dialogue, the patient ex-

presses a wish to have her heart checked-up, while

the doctor argues that the patient’s dizziness can be

caused by her diabetes and that her blood pressure is

ﬁne, thereby constructing a counter-claim that a heart

check-up is not needed.

3.4 Implications for Linguistic Design

of Model Explanations

When using the results of the corpus study to in-

form the design of clinically relevant explanations for

model predictions, several guiding principles can be

distilled. First, given the limited amount of conveyed

data per claim in human-human dialogues, it is ad-

visable to focus speciﬁc (local) explanations only on

those features that are most important to the predicted

outcome.

Second, given the consistent use of warrant trig-

gers in the presentation of data, and the explanatory

function that these triggers can be assumed to carry,

it is advisable to formulate data in ways that allude to

the corresponding statistical patterns learned by the

model. For scalar triggers, this can be done by choos-

ing a suitable modiﬁer. For example, if the model

has learned that older age is associated with a lower

probability of being satisﬁed with surgery, and the pa-

tient’s age is statistically low, the patient’s age can be

presented as “relatively young”. Norm/expectation

triggers can also be conceived, e.g. “unexpectedly

high pain levels in the arm”. However, in the present

clinical context, this was not deemed applicable.

IAI 2025 - Special Session on Interpretable Artiﬁcial Intelligence Through Glass-Box Models

912

Third, given that interlocutors sometimes ex-

change claims and counter-claims, and give reasons

both for and against judgements, it is advisable to

convey circumstances that both support and contra-

dict a certain prediction (cf. (Miller, 2023)).

Furthermore, to provide more detailed informa-

tion about the statistical patterns learned by the model

and thereby help resolve potential ambiguities, ex-

plicit warrants could be offered. However, given that,

in the studied corpora, warrants were primarily con-

veyed implicitly, warrants should only be presented

on demand.

Finally, it is worth noting that the observed usage

of causal warrants in human-human dialogues may in-

dicate that statistical explanations may not always be

satisfactory or sufﬁcient for users (as previously ar-

gued by e.g. (Miller, 2019)). For example, a corre-

lation between low disability and high probability of

successful surgery might be difﬁcult to comprehend

without resorting to causal reasoning of some kind

(such as disability causing depression or other psy-

chological states that in turn inﬂuence pain percep-

tion). In principle, generated explanations for model

predictions could include such causal links, poten-

tially collected from domain experts and built into

the tool. However, such explanations could invite

false inferences concerning how the model actually

reasons. For this reason, we argue that generated

warrants should only convey actual statistical patterns

learned by the model, leaving causal matters open for

interpretation.

4 GENERATING CLINICALLY

RELEVANT EXPLANATIONS

Swedish spine clinics have access to an AI-based

“Dialogue Support” tool whose purpose is to assist

patients and doctors in their decision-making con-

cerning treatment options for four different types of

degenerative spinal disorders: disc herniation and

spinal stenosis in the lumbar spine respectively, or

chronic low back pain, as well as cervical radiculopa-

thy (Fritzell et al., 2022). The tool is used by the

doctor and patient together during brief (approx. 20

minutes) medical consultations. Based on sociode-

mographic information (age, gender, etc.) and other

answers provided by the patient in a questionnaire, the

tool presents predictions for three types of outcomes

of a hypothetical surgery:

• Satisfaction: the probability of responding to the

question “What is your attitude towards the result

of the surgery?” with one of the ﬁrst two options

in the Likert scale satisﬁed, hesitant, dissatisﬁed

one year after surgery.

• Global Assessment (GA): the probability of re-

sponding to the question “How is your pain to-

day as compared to before the surgery?” with one

of the ﬁrst two options in the Likert scale com-

pletely pain-free, much better, somewhat better,

unchanged, worse one year after surgery.

• Length of stay: the number of days of hospital-

ization in connection with the surgery.

The predictions are based on three different types of

GLMs trained on patient data from the national qual-

ity registry Swespine. Currently, the Dialogue Sup-

port tool explains its predictions “globally”, with in-

formation about features, sample size, etc. No case-

speciﬁc (“local”) explanations are presented.

In the subsequent sections, a method for gen-

erating linguistic explanations for predictions from

GLMs will be proposed, with the purpose of enabling

an alternative Dialogue Support tool based on the

same patient data and types of models. The proposed

method builds on previous work by (Berman, 2024a),

which is extended to support needs and desiderata

identiﬁed in the corpus study, as well as feedback

collected in a design workshop with orthopaedic sur-

geons (see Section 4.7).

4.1 Model Speciﬁcation

GLMs estimate an outcome on the basis of a lin-

ear combination of predictors (independent variables)

and a link function that transforms the linear combi-

nation to an outcome:

E[Y | X ] = g

−1

∑

where E[Y | X] is the expected value of the outcome

Y given the intercept (bias) term β

, the predictors

, the coefﬁcients β

, and the link function g. For

the purposes at hand, g is assumed to be monotonic,

which is typically the case. Speciﬁcally, in the study

at hand, g is either

• logit (i.e. logistic regression), for estimating sat-

isfaction,

• threshold function for the cumulative distribution

function (i.e. ordered probit), for estimating GA,

• rounding to non-negative integer (linear (Ridge)

regression adapted to counts), for predicting

length of stay.

While the prototype currently supports Swedish, this

paper uses English translations.

From Interpretability to Clinically Relevant Linguistic Explanations: The Case of Spinal Surgery Decision-Support

913

One model is used for each combination of diagnosis

and task (type of outcome). Since there are 4 diag-

noses and 3 tasks, a total of 12 models is used.

4.2 Datasets

Historical patient data was obtained from the Swe-

spine registry in the form of one dataset per diagno-

sis. The 4 datasets together encompass 37 features.

Table 1 presents the feature types, with examples of

features.

For the purposes of the study, features were se-

lected to jointly optimize for performance and spar-

sity. This was done using backward elimination, with

area under the ROC curve (AUC) as performance

metric for satisfaction and GA, and mean absolute

error (MAE) for length of stay. Among the best-

performing feature sets (i.e. performance not worse

than the best performance minus a tolerance thresh-

old), the feature set with the smallest number of fea-

tures was selected.

4.3 Feature Encoding

Different feature types/encodings afford distinct lin-

guistic expressions of data and warrants. Numeric

features enable the use of scalar triggers such as

“young age” or “no pain in the back”. For example,

if the patient’s young age is presented as a positive

factor, this invites the inference that the model gen-

erally associates a lower age with a more favourable

outcome.

Binary features enable self-evident warrant-

triggering contrasts. For example, if the patient being

female is presented as a positive factor, this implies

that female patients are generally estimated to have a

more favourable outcome than male patients.

In the case of multinomial categorical features,

warrants are not as straightforwardly triggered. For

example, if the fact that the patient is “treated at a uni-

versity clinic” (rather than a public or private clinic) is

presented as indicative of a negative outcome, a corre-

sponding warrant cannot easily be identiﬁed. Specif-

ically, the formulation does not convey whether the

prediction would be more positive if treated in a pub-

lic or private clinic. Since not many multinomial cat-

egoricals were among the most predictive features in

the studied case, this problem was not addressed.

As for ordinals, the picture is more nuanced. At

least two approaches can be conceived: to treat them

either as numeric or as multinomial categorical fea-

tures. In line with the reasoning above, numeric en-

coding is more favourable from the perspective of

linguistic intelligibility. However, a categorical en-

coding may yield better performance. To this end,

the choice of encoding for ordinals was made in a

ﬂexible way. Speciﬁcally, high-cardinality ordinals

(> 5 values) were encoded numerically, since it was

deemed highly unintuitive to treat each value on a

10-item pain scale as its own category. As for low-

cardinality ordinals, a data-driven approach was em-

ployed to jointly optimize for linguistic intelligibility

and performance. Among the best-performing encod-

ing candidates (i.e. performance not worse than the

best performance minus a tolerance threshold), the

feature set with the largest amount of numeric encod-

ings was selected.

The proposed strategy for choosing feature encod-

ings is summarized in Table 1.

4.4 Interface Design

Local explanations are presented in a waterfall chart,

where the estimated outcome is visualized in terms

of the outcome for an average patient, and the cu-

mulative effect of data (see Figure 1), grouped into

positive and negative, and ordered by decreasing im-

portance. For example, a moderate probability of a

successful outcome can be explained by the fact that

the patient has no other illnesses (positive factor) and

relatively severe back pain (negative factor). A max-

imum of three positive and negative factors respec-

tively is shown by default; additional data can be re-

vealed by clicking “Show more”.

Outcomes are visualized along a probability scale

(for satisfaction and GA) or integer scale (for length

of stay), while data (feature contributions) are visual-

ized without any explicit scale. In other words, math-

ematically, the waterfall chart informally conveys:

E[Y |

X] +

∑

−

) ≈ E[Y |X]

Warrants conveying information about the statistical

patterns learned by the model, e.g. that lower disabil-

ity is associated with a higher estimated probability

of a successful outcome, are presented inside a wid-

get titled “More information”. The widget also con-

tains information about the sample size and training

data, which argumentatively can be said to back the

warrants. The content of the widget is collapsed by

default, but can be expanded as needed.

Predictive performance after interpretability optimiza-

tions was AUC 0.65-0.69 for satisfaction, AUC 0.62-0.69

for GA, and mean absolute error 0.27-1.06 for length of

stay.

A completely faithful visualization would need to ac-

count for non-linearity of the link function. This degree of

faithfulness was not deemed motivated for the purposes at

hand.

IAI 2025 - Special Session on Interpretable Artiﬁcial Intelligence Through Glass-Box Models

914

Table 1: Features types among the datasets used to train the predictive models. The reasoning behind the choice of encodings

is described in Section 4.3.

Feature type Encoding Example(s)

numeric standardization (continuous value) age, BMI

binary binary gender, employment status

multinomial categorical one-hot (dummy variables) clinic type (public, private, or university

hospital)

high-cardinality ordinal standardization (continuous value) pain levels (10-item scale from 0 (no pain)

to 10 (worst conceivable pain))

low-cardinality ordinal one-hot (dummy variables) or stan-

dardization (continuous value)

walking distance (5-item scale from 0-100

meters to more than 2 years)

Basic information

Back-specific information

Subgroup

Diagnosis Spinal stenosis

Operated levels 1 2 3 4 5+

Clinic type

University hospital

Height (cm) 168

Sociodemographics

Age 45

Unemployed No

Education level Primary/secondary level

Health profile

Previous spine surgery No

Comorbidity No

Probability of being satisfied

67%

Probability of successful outcome

55%

Expected length of stay

Approx. 2 days

The pie chart shows the probability of different pain levels after a hypothetical surgery, based on the chosen patient profile.

Explanation

Unsuccessful outcome Successful outcome

Average patient

Positive factors:

No previous spine surgery

Relatively low disability

No other illness

Negative factors:

Treated at university hospital

Relatively long duration of leg pain

Relatively long duration of back

pain

Combined probability

The pie chart shows how the probability of successful outcome of surgery (pain completeley gone or much improved after 1

year) is calculated based on the probability of successful outcome for an average patient and factors concerning the chosen

patient profile.

22%

33%

20%

15%

11%

Pain free

Much improved

Slightly improved

Unchanged

Worse

61%

55%

More information

Figure 1: Screenshot of the prototype, with a hypothetical patient proﬁle.

4.5 Choice of Reference

As elaborated above, data are visually and linguis-

tically presented in relation to a reference. In the

proposed prototype, the reference is chosen as the

mean historical patient with the same diagnosis (R =

X), although such a notion may be perceived as ab-

stract. The main reason for not settling with the in-

tercept/bias as reference is that this would introduce

an undesired bias for binary features; e.g. the gender

encoded as 0 (in this case being male) would never be

highlighted as a factor.

The frequently observed comparison with expec-

tation/norm in the corpus (see Section 3.3) may sug-

gest using “healthy patient” as a reference in medical

contexts where it can be clearly established a priori

what constitutes a “healthy” feature value. Allowing

reference to be chosen interactively may also be an

option.

4.6 Generation of Data and Warrants

The proposed method for generating linguistic ex-

planations for GLMs consists of general (domain-

and language-independent) functions for generating

data and warrants, which depend on a domain- and

language-speciﬁc grammar containing functions that

produce linguistic surface realizations. The overall

architecture of the proposed method is illustrated in

Figure 2.

From Interpretability to Clinically Relevant Linguistic Explanations: The Case of Spinal Surgery Decision-Support

915

DATUM GENERATOR

Information

about feature

“Relatively mild back pain”

GRAMMAR

WARRANT

GENERATOR

“The less back pain, the higher the calculated probability of a

successful outcome. The diﬀerence can be up to 13 percentage points.”

Figure 2: Overall architecture of the proposed method for generating linguistic explanations for GLMs, with output examples

for the feature BackPain. Input to the datum and warrant generators consists of information such as feature type, feature

value, mean feature value, and coefﬁcient.

4.6.1 Generation of Data

In the visualization of local explanations, data are ex-

pressed either with scalar triggers (e.g. “Relatively

young/old age”) or a bare linguistic label (e.g. “Fe-

male”), depending on the type of feature. Formally,

we deﬁne DatumPhrase(t, f , x, ¯x) as a general func-

tion that, given information about a feature, returns a

datum phrase. Speciﬁcally,

DatumPhrase(t, f , x, ¯x) =

(

FeatureLevelPhrase( f , Level(x, ¯x)) if t = numeric,

Label( f , x) if t = categor.

where f is the feature symbol, x = X

is the feature

value, and ¯x is the mean value of the feature for the

entire sample. FeatureLevelPhrase( f , l) is a gram-

mar function that generates a feature-level phrase for

feature f and level l (e.g. “Relatively young age” for

f = Age and l = low), while the grammar function

Label( f , x) returns a bare linguistic label (e.g. “Fe-

male” for f = Gender and x = 1). The general func-

tion Level(x, r) returns the level of the feature value x

in relation to a reference value r = R

. Speciﬁcally,

L(x, r) =











zero if x = 0,

low if 0 < x < r,

high x ≥ r

For the grammar functions FeatureLevelPhrase

and Label, the present work uses simple mappings

(see Figure 3).

Results of applying the proposed method on his-

torical patients (i.e. instances in the datasets) are

shown in Table 2 and Table 3, presenting the 10 most

frequently occurring phrases for positive and negative

factors respectively.

Table 2: The 10 most frequently occurring phrases describ-

ing positive factors for historical patients, as generated by

the proposed method and aggregated across diagnoses and

outcomes. “Occurring” here means that the phrase is in-

cluded among the top three positive factors shown in the

interface. “Positive” refers to higher estimated probability

of satisfaction or successful outcome, or shorter duration of

stay. Frequencies are relative to the total number of patients.

Data presented as positive Freq.

Treated in private clinic 0.46

No other illnesses 0.35

No previous spine surgery 0.32

Relatively low disability 0.18

Relatively few operated levels 0.17

Relatively short duration of back pain 0.15

Has university education 0.13

Relatively mild back pain 0.13

Can walk relatively far in normal pace 0.12

No university education 0.12

Inferences that are invited by presenting data in

the manner described above are mathematically guar-

anteed to correctly reﬂect the actual reasoning steps

performed by the model. For example, if “relatively

mild back pain” (compared to an average patient)

is presented as a positive factor, this invites the in-

ferences that the patient has some amount of back

pain, but lower than an average patient, and that mild

back pain is associated with a more favourable pre-

dicted outcome than severe back pain. Mathemati-

cally, this corresponds to conveying X

and that

−

) > 0 contributes positively to the outcome

(assuming a monotonically increasing inverse link

function), and linguistically inviting the factual infer-

ence 0 < X

and the warrant inference β

< 0. The

correctness of the factual inference follows from the

condition for selecting the word “mild” (L(X

) =

low iff 0 < X

). As for the warrant inference,

IAI 2025 - Special Session on Interpretable Artiﬁcial Intelligence Through Glass-Box Models

916

FeatureLevelPhrase( f , l) =











if f = BackPain :











“No back pain” if l = zero,

“Relatively mild back pain” if l = low,

“Relatively severe back pain” if l = high

if f = . . .

Label( f , x) =











if f = Gender :

(

“Male” if x = 0,

“Female” if x = 1

if l = . . .

Figure 3: Examples of domain- and language-speciﬁc linguistic mappings belonging to an English grammar in the domain of

spinal disorders.

Table 3: The 10 most frequently occurring phrases describ-

ing negative factors for historical patients, as generated by

the proposed method and aggregated across diagnoses and

outcomes. “Occurring” here means that the phrase is in-

cluded among the top three negative factors shown in the

interface. “Negative” refers to lower estimated probability

of satisfaction or successful outcome, or longer duration of

stay. Frequencies are relative to the total number of patients.

Data presented as negative Freq.

Relatively high disability 0.19

Treated in public clinic 0.18

Relatively severe back pain 0.18

Has other illnesses 0.16

No university education 0.16

Relatively long duration of back pain 0.15

Relatively short height 0.13

Relatively many operated levels 0.13

Relatively old age 0.13

Relatively long duration of leg pain 0.10

its correctness can be veriﬁed as follows. Since

−

) > 0 and X

−

< 0 (because X

the product β

−

) can only be positive if β

< 0.

In other words, the mathematical guarantee hinges on

the monotonicity of the link function and on the linear

additive treatment of features.

4.6.2 Generation of Warrants

Warrants conveying correlations learned by the mod-

els are expressed in ways that communicate both ef-

fect size and, when relevant, polarity. For example,

the warrant for a numeric feature can be formulated

as: “The less back pain, the higher the calculated

probability of a successful outcome. The difference

can be up to 24 percentage points.” Effect sizes are

calculated with respect to the magnitude of the coefﬁ-

cient and the absolute maximum slope of the inverse

link function.

Table 4 shows generated warrants for a particular

|max((g

−1

)

′

(x))| is

for logistic regression,

√

2π

for

ordered probit, and 1 for linear regression.

diagnosis and outcome.

4.7 Evaluation

An early prototype of the proposed solution was eval-

uated through a design workshop with orthopaedic

surgeons. In the invitation, potential participants were

informed that they would test a new alternative inter-

face to the Dialogue Support tool, which they all had

experience of using. 3 out of 5 invited respondents

participated in the workshop on site, while one tested

the prototype individually and then gave written feed-

back via email.

In the ﬁrst part of the workshop, participants indi-

vidually accessed the prototype, where a randomized

patient proﬁle was shown.

They were then asked to

imagine having a dialogue with a patient who wants

to know why the computer estimates an X% probabil-

ity of successful outcome, and to answer the patient’s

question very brieﬂy. Participants noted their answers

and were then asked to voluntarily read them aloud.

In the second part, participants were asked: “Is

there anything in the explanations that can be im-

proved?”

The discussion was moderated by the or-

ganizer of the workshop (one of the authors).

The version of the prototype tested by partici-

pants only supported one diagnosis (disc herniation)

and two outcomes (satisfaction and GA). Instruc-

tions were conveyed to participants verbally and via a

beamer presentation. Feedback was recorded in notes

and later organized according to themes.

One of the authors participated in the workshop in the

role of orthopaedic surgeon. This author had not been in-

volved in the development of the prototype or the organiza-

tion of the workshop.

The randomization was done individually for each par-

ticipant. Feature values were uniformly sampled from pre-

deﬁned ranges.

Participants were also asked if there is anything else

with the alternative interface that can be improved. Results

from this part of the workshop are not reported here.

From Interpretability to Clinically Relevant Linguistic Explanations: The Case of Spinal Surgery Decision-Support

917

Table 4: Generated warrants conveying correlations for spinal stenosis and pain assessment (GA). Warrants are shown to users

when they click “More information” in the interface.

Disability The lower the disability, the higher the calculated probability of a successful outcome. The

difference can be up to 33 percentage points.

Leg pain du-

ration

The shorter the leg pain duration, the higher the calculated probability of a successful out-

come. The difference can be up to 19 percentage points.

Previous spine

surgery

Patients who have not undergone previous spine surgery are calculated to have a higher

probability of a successful outcome. The difference can be up to 14 percentage points.

Back pain The less back pain, the higher the calculated probability of a successful outcome. The

difference can be up to 13 percentage points.

Back pain du-

ration

The shorter the back pain duration, the higher the calculated probability of a successful

outcome. The difference can be up to 11 percentage points.

Comorbidity Patients with no other illnesses are calculated to have a higher probability of a successful

outcome. The difference can be up to 8 percentage points.

Unemployment Patients with employment are calculated to have a higher probability of a successful out-

come. The difference can be up to 8 percentage points.

Type of clinic Patients treated at private clinics are calculated to have the highest probability of a suc-

cessful outcome. Patients treated at university hospitals are calculated to have the lowest

probability of a successful outcome. The difference can be up to 7 percentage points.

4.7.1 Results

In the ﬁrst part, one of the participants gave the fol-

lowing answer: “There is a 76% probability of be-

ing satisﬁed with surgery which is 10% worse than

an average pat[ient] who is operated for disc herni-

ation. The reason that it looks this way is that you

have had back pain for a longer time and furthermore

xxx diseases. Furthermore, your age gives you a sta-

tistically somewhat lower probability of a successful

result.” (our translation)

As for the second part, one participant com-

mented that the textual explanations for the model

and the speciﬁc outcomes were good and informa-

tive. Three suggestions related to global explanations

were raised. First, it was recommended to use pos-

itive instead of negative wording; e.g. substituting

expressions like “The older the age, the lower the...”

with “The younger the age, the higher the...”. Second,

the wording “is estimated” was advised to be replaced

with alternatives such as “is calculated,” “results have

previously shown,” or “based on previous patients’ re-

sults of surgery”. Third, it was suggested that features

should be sorted by descending effect size. All these

suggestions were subsequently accommodated.

A question was raised as to whether the correla-

tion between age and satisfaction is in reality “a curve

rather than linear”. The question highlights a discrep-

ancy between the participant’s domain knowledge and

monotonicity assumptions built into the model. The

fact that this discrepancy surfaced can be seen as a

positive ﬁnding, in the sense that the tool’s explana-

tion enabled the user to form a correct mental model

of how the AI reasons (and to contrast this model with

his/her own reasoning).

Two participants wondered if it would be adequate

to tell the patient that the model makes its prediction

based on data concerning other patients with similar

characteristics. In Toulmin’s framework, this relates

to backing, i.e. arguing for the general acceptability

of a warrant. To enable more accurate backing, infor-

mation was added in the tool to clarify that predictions

are made on the basis of the entire available sample of

patients with the diagnosis at hand.

One participant observed that some factors were

missing from the global explanations; this was later

attributed to a bug, which has since been resolved.

As an indirect ﬁnding, it was observed that no

questions or critical comments were raised regarding

the phrasing of data. Furthermore, when participants

exempliﬁed how they would respond to the patient’s

request for an explanation, the participants seemed to

have interpreted the data as intended. This can be

taken as an indication that the generation of datum

phrases is serving its intended purpose.

In summary, the design workshop resulted in gen-

erally positive ﬁndings regarding the proposed lin-

guistic explanations, as well as suggested improve-

ments that have been accommodated into the proto-

type.

5 CONCLUSIONS AND FUTURE

WORK

Informed by a linguistic and argumentative analysis

of doctor-patient dialogues, this paper has proposed

IAI 2025 - Special Session on Interpretable Artiﬁcial Intelligence Through Glass-Box Models

918

a method for generating linguistic explanations for

predictions from GLMs in the context of a statisti-

cal instrument aiming to assist treatment decisions

concerning degenerative spinal disorders. Unlike

previous approaches to generating natural-language

explanations for predictions from statistical models,

the method proposed here is grounded in empirical

ﬁndings concerning analogous human communicative

strategies. Speciﬁcally, the proposed method linguis-

tically formulates salient case-speciﬁc facts in ways

that concisely indicate the underlying statistical pat-

terns used by the model, without overloading the user

with an exhaustive account of the model’s entire rea-

soning logic. This way, the approach balances pur-

poseful brevity with the possibility to get more de-

tailed information that reduces potential ambiguity re-

garding the model’s reasoning.

The simplicity of the concrete linguistic explana-

tions obtained with the proposed method (“relatively

young age”, etc.) reﬂects a desirable alignment with

how humans typically formulate explanations. No-

tably, this simplicity is not obtained at the cost of

reduced faithfulness, unlike with popular post-hoc

explanation methods for black-box models (such as

LIME and SHAP).

While the interface design and linguistic surface

realization presented in this paper are adapted to a

speciﬁc medical use case, the method as such is

simple in nature and straightforwardly generalizes to

other clinical use cases with similar needs, and po-

tentially also to other domains. Despite recent de-

velopments in deep neural networks and generative

AI, GLMs are still commonly used in high-stakes do-

mains such as healthcare, in part due to their inter-

pretability (Pantanowitz et al., 2024). From this per-

spective, one of the main implications of the pre-

sented work is to demonstrate how interpretability,

conceived as pertaining to certain abstract or for-

mal properties, can be leveraged in practice to obtain

downstream value in the form of linguistic explana-

tions that are aligned with how humans typically ex-

plain decisions and judgements. This can be poten-

tially valuable not only for AI-based decision support,

but also in situations where linguistically intelligible

explanations for AI-based decisions are required for

ethical or legal reasons (see, e.g. (Berman, 2024b)).

On a theoretical level, the work also shows how prac-

tical applications of interpretable AI depend on spe-

ciﬁc formal properties of interpretable models – in

this case monotonicity and linear additivity.

The proposed method for generating linguistic ex-

planations has been tested and reﬁned through a de-

sign workshop with orthopaedic surgeons. Overall,

the ﬁndings from the workshop yielded generally pos-

itive feedback. Nevertheless, the extent to which the

explanations meet needs of doctors and patients in

real-world clinical settings has not been studied. In

a planned next step, the approach will therefore be

tested in a clinical study, where interactions between

patient, doctor, and predictive tool will be recorded

and analysed. Doctors’ and patients’ experiences of

using the tool will also be investigated through inter-

views and questionnaires.

In future work, it might also be useful to conduct

a more ﬁne-grained argumentation analysis with all

Toulmin’s (Toulmin, 2003) argumentative elements

(e.g. backing and qualiﬁer). Moreover, in the cor-

pus analysis it was observed that claims and their ac-

companying data/warrants might be complex in that

they are structured with one argument element em-

bedded in another, having a coordinative arrangement

or other potential arrangements (see, e.g. (Eemeren

et al., 2021)). More elaborate annotation will be re-

quired to bring out these forms of structuring and de-

termine their potential relevance in relation to model

interpretability and AI explanations.

6 ETHICS STATEMENT

Approval from the Swedish Ethical Review Author-

ity was obtained (case number 2024-00839-01) for

handling (de-identiﬁed) patient data from Swespine’s

registry. The Ahus data has previously been collected

with informed consent and stored in anonymized form

in accordance with approval from the Regional Ethics

Committee for Medical Research (Gulbrandsen et al.,

2013); since the data is anonymized, no ethical ap-

proval in Sweden was needed for using the data for

the present study.

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous re-

viewers for useful comments and suggestions. This

work was supported by the Swedish Research Coun-

cil (VR) grant 2014-39 for the establishment of the

Centre for Linguistic Theory and Studies in Probabil-

ity (CLASP) at the University of Gothenburg.

REFERENCES

Ahmed, S., Shamim Kaiser, M., Hossain, M. S., and An-

dersson, K. (2024). A Comparative Analysis of LIME

and SHAP Interpreters with Explainable ML-Based

Diabetes Predictions. IEEE Access, pages 1–1.

From Interpretability to Clinically Relevant Linguistic Explanations: The Case of Spinal Surgery Decision-Support

919

Alonso, J. M. and Bugar

ın, A. (2019). ExpliClas: Au-

tomatic Generation of Explanations in Natural Lan-

guage for Weka Classiﬁers. In 2019 IEEE Interna-

tional Conference on Fuzzy Systems (FUZZ-IEEE),

pages 1–6. IEEE.

Baaj, I. (2022). Explainability of possibilistic and fuzzy

rule-based systems. Theses, Sorbonne Universit

Berman, A. (2024a). Argumentative Dialogue As Basis For

Human-AI Collaboration. In Proceedings of HHAI

2024 Workshops.

Berman, A. (2024b). Too Far Away from the Job Mar-

ket – Says Who? Linguistically Analyzing Rationales

for AI-based Decisions Concerning Employment Sup-

port. Weizenbaum Journal of the Digital Society, 4(3).

Breitholtz, E. (2020). Enthymemes and Topoi in Dialogue:

the use of common sense reasoning in conversation.

Brill.

Eemeren, F. H., Garssen, B., and Labrie, N. (2021). Argu-

mentation between Doctors and Patients. Understand-

ing clinical argumentative discourse. John Benjamins

Publishing Company.

Forrest, J., Sripada, S., Pang, W., and Coghill, G. (2018).

Towards making NLG a voice for interpretable ma-

chine learning. In Proceedings of The 11th Interna-

tional Natural Language Generation Conference. As-

sociation for Computational Linguistics (ACL).

Fritzell, P., Mesterton, J., and Hagg, O. (2022). Prediction

of outcome after spinal surgery—using The Dialogue

Support based on the Swedish national quality regis-

ter. European Spine Journal, pages 1–12.

Grice, H. P. (1975). Logic and conversation. Syntax and

semantics, 3:43–58.

Gulbrandsen, P., Finset, A., and Jensen, B. (2013). Lege-

pasient-korpus fra Ahus.

Kaczmarek-Majer, K., Casalino, G., Castellano, G., Do-

miniak, M., Hryniewicz, O., Kami

nska, O., Vessio,

G., and D

ıaz-Rodr

ıguez, N. (2022). Plenary: Explain-

ing black-box models in natural language through

fuzzy linguistic summaries. Information Sciences,

614:374–399.

Lindgren, S. and Aspegren, K. (2004). Kliniska f

ardigheter:

informationsutbytet mellan patient och l

akare. Stu-

dentlitteratur AB.

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed ap-

proach to interpreting model predictions. In Guyon, I.,

Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R.,

Vishwanathan, S., and Garnett, R., editors, Advances

in Neural Information Processing Systems 30, pages

4765–4774. Curran Associates, Inc.

Maraev, V., Breitholtz, E., Howes, C., and Bernardy, J.-

P. (2021). Why should I turn left? Towards active

explainability for spoken dialogue systems. In Pro-

ceedings of the Reasoning and Interaction Conference

(ReInAct 2021), pages 58–64.

Miller, T. (2019). Explanation in artiﬁcial intelligence: In-

sights from the social sciences. Artiﬁcial Intelligence,

267:1–38.

Miller, T. (2023). Explainable AI is Dead, Long Live Ex-

plainable AI! Hypothesis-driven Decision Support us-

ing Evaluative AI. In Proceedings of the 2023 ACM

Conference on Fairness, Accountability, and Trans-

parency, FAccT ’23, page 333–342, New York, NY,

USA. Association for Computing Machinery.

Pantanowitz, L., Pearce, T., Abukhiran, I., Hanna, M.,

Wheeler, S., Soong, T. R., Tafti, A. P., Pantanowitz,

J., Lu, M. Y., Mahmood, F., Gu, Q., and Rashidi,

H. H. (2024). Nongenerative Artiﬁcial Intelligence in

Medicine: Advancements and Applications in Super-

vised and Unsupervised Machine Learning. Modern

Pathology, page 100680.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”Why

Should I Trust You?”: Explaining the Predictions

of Any Classiﬁer. In Proceedings of the 22nd

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, pages 1135–1144.

Rudin, C. (2019). Stop explaining black box machine learn-

ing models for high stakes decisions and use inter-

pretable models instead. Nature Machine Intelligence,

1(5):206–215.

Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L.,

and Zhong, C. (2022). Interpretable machine learn-

ing: Fundamental principles and 10 grand challenges.

Statistics Surveys, 16(none):1 – 85.

Sbis

a, M. (1987). Acts of explanation: A speech act anal-

ysis. Argumentation: Perspectives and approaches,

pages 7–17.

Slack, D., Krishna, S., Lakkaraju, H., and Singh, S. (2023).

Explaining machine learning models with interactive

natural language conversations using TalkToModel.

Nature Machine Intelligence, 5(8):873–883.

Toulmin, S. E. (2003). The uses of argument. Cambridge

university press.

Wahde, M. and Virgolin, M. (2023). DAISY: An Implemen-

tation of Five Core Principles for Transparent and Ac-

countable Conversational AI. International Journal of

Human–Computer Interaction, 39(9):1856–1873.

Winograd, T. (1971). Procedures as a representation for

data in a computer program for understanding natu-

ral language. PhD thesis, Massachusetts Institute of

Technology.

Xydis, A., Hampson, C., Modgil, S., and Black, E. (2020).

Enthymemes in dialogues. In Computational Models

of Argument, pages 395–402. IOS Press.

IAI 2025 - Special Session on Interpretable Artiﬁcial Intelligence Through Glass-Box Models

920