Natural Language Explanatory Arguments for Correct and Incorrect

Diagnoses of Clinical Cases

Santiago Marro

, Benjamin Molinet

, Elena Cabrio

and Serena Villata

Universit

e C

ote d’Azur, Inria, CNRS, I3S, France

Keywords:

Natural Language Processing, Information Extraction, Argument-based Natural Language Explanations,

Healthcare.

Abstract:

The automatic generation of explanations to improve the transparency of machine predictions is a major chal-

lenge in Artiﬁcial Intelligence. Such explanations may also be effectively applied to other decision making

processes where it is crucial to improve critical thinking in human beings. An example of that consists in the

clinical cases proposed to medical residents together with a set of possible diseases to be diagnosed, where

only one correct answer exists. The main goal is not to identify the correct answer, but to be able to explain

why one is the correct answer and the others are not. In this paper, we propose a novel approach to generate

argument-based natural language explanations for the correct and incorrect answers of standardized medical

exams. By combining information extraction methods from heterogeneous medical knowledge bases, we pro-

pose an automatic approach where the symptoms relevant to the correct diagnosis are automatically extracted

from the case, to build a natural language explanation. To do so, we annotated a new resource of 314 clinical

cases, where 1843 different symptoms are identiﬁed. Results in retrieving and matching the relevant symp-

toms for the clinical cases to support the correct diagnosis and contrast incorrect ones outperform standard

baselines.

1 INTRODUCTION

Explanatory Artiﬁcial Intelligence (XAI) is a main

topic in AI research nowadays, given, on the one side,

the predominance of black box methods, and on the

other side, the application of these methods to sen-

sitive scenarios like medicine. Among the huge set

of contributions in this area (Tjoa and Guan, 2019;

Saeed and Omlin, 2021), some approaches highlight

the need to build explanations which are clearly in-

terpretable and possibly convincing, leading to the in-

vestigation of the generation of argument-based ex-

planations (Cyras et al., 2021). These explanations

are intended to be not only rational, but “manifestly”

rational (Johnson, 2000), such that arguers can see

for themselves the rationale behind inferential steps

taken. This task becomes even more challenging if we

target the generation of natural language argument-

based explanations (Cyras et al., 2021; Vassiliades

et al., 2021).

https://orcid.org/0000-0001-6220-0559

https://orcid.org/0000-0002-8208-2139

https://orcid.org/0000-0003-3495-493X

In this paper, we tackle this challenging task, fo-

cusing on a speciﬁc application scenario, i.e., the gen-

eration of explanatory natural language arguments in

medicine. More precisely, our goal is to automati-

cally generate natural language argument-based ex-

planations to be used for educational purposes to train

medical residents. These students are trained through

tests where ﬁrst there is the description of a clinical

case (i.e., symptoms experienced by the patient, re-

sults of clinical exams and analysis, and some further

information concerning the patient herself like age,

gender, or population group), and they need to an-

swer the following question: ”Which of the following

is the most likely diagnosis?”. The test is composed

of a number of possible answers to this question, i.e.,

potential diagnoses, among which, one of them is the

correct diagnosis, and the others are incorrect. The

solution consists in selecting the correct answer. In

addition, medical residents are asked to justify their

answer through an explanation. In order to autom-

atize this training phase, we address the task of au-

tomatically generating explanations of the kind: ”The

patient is affected by [diagnosis

] because the follow-

ing relevant symptoms have been identiﬁed: [correct

438

Marro, S., Molinet, B., Cabrio, E. and Villata, S.

Natural Language Explanatory Arguments for Correct and Incorrect Diagnoses of Clinical Cases.

DOI: 10.5220/0011927000003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 1, pages 438-449

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

diagnosis symptoms]. The [diagnosis

] is incorrect

because the patient is not showing the symptoms [in-

correct diagnosis symptoms]”.

To address this task, a full pipeline needs to be

designed in order to (i) detect the symptoms in the

clinical case description, (ii) match them with the

symptoms, in a medical knowledge base, to iden-

tify to which diseases they are associated with, and

what is their frequency, and ﬁnally, (iii) generate

pattern-based natural language explanations employ-

ing the elements identiﬁed in the two previous steps.

To do so, we ﬁrst annotate a new resource of 314

unique clinical cases in English, with the symptoms

which are relevant to derive the correct and incor-

rect diagnoses. These symptoms are extracted from

the Human Phenotype Ontology (HPO) knowledge

base (K

ohler et al., 2021), where each disease is as-

sociated with the list of symptoms that can be mani-

fested in this disease.

Relying on contextual embedding search, our con-

tribution is threefold: (i) we detect in the clinical case

description, the symptoms from a newly annotated

resource; (ii) we automatically match the symptoms

with those available on HPO, with the aim to asso-

ciate them to the correct and incorrect diagnoses, and

(iii) natural language explanatory arguments are auto-

matically generated. We address an extensive evalua-

tion of this new full pipeline to generate natural lan-

guage explanations for clinical cases, obtaining very

promising results. The work we present in this paper

is motivated by the lack of existing medical textual

resources annotated with symptoms associated with

diagnoses and the need for effective methods to ad-

dress natural language explanations in medicine. To

the best of our knowledge, this is the ﬁrst approach to

generate such a kind of natural language explanations

in the medical domain for educational purposes, i.e.,

to train medical residents to generate effective natural

language explanations about the correct and incorrect

diagnosis of a clinical case.

2 RELATED WORK

Since the introduction of BERT (Devlin et al., 2019),

transformer-based models have recently had a ma-

jor impact on most NLP tasks. Multiple models

evolved from it with different design choices, like

RoBERTa (Liu et al., 2019), ELECTRA (Clark et al.,

2020) and ALBERT (Lan et al., 2019). These models

are trained on a large amount of data from multiple

sources and domains, which means that they are not

necessarily prepared for the biomedical domain.

In recent years, a great number of resources

and NLP tools have been developed speciﬁcally for

the biomedical domain. For entity extraction, the

most popular datasets are BC4CHEMD (Krallinger

et al., 2015), B5CDR-Chem (Li et al., 2016), NCBI-

Disease (Do

gan et al., 2014), BC2GM (Smith et al.,

2008), JNLPBA (Kim et al., 2004), where the an-

notations range from drug-disease interactions to the

identiﬁcation of diseases, genes, and molecular enti-

ties such as protein, DNA, RNA. Symptom detection,

i.e., the task we address in this paper, can be seen as a

sub-task of the broader task of medical entity extrac-

tion.

Off-the-shelf NLP tool-kits such as Spacy (Honni-

bal and Montani, 2017), MedSpacy (Eyre et al., 2021)

and CLAMP (Soysal et al., 2018) provide multiple

modules for text processing. In particular, MedSpacy

is built on top of Spacy speciﬁcally for clinical natural

language processing, while CLAMP offers a method

for named entity recognition (NER) as well as a visual

interface for annotating and labeling clinical text.

Most of the recent approaches treat NER as a se-

quence labeling task where specialized transformer-

based models hold the best results. For exam-

ple, (Naseem et al., 2021) showed that pre-training the

ALBERT model on a huge biomedical corpus ensured

that the model captured better biomedical context-

dependent NER. Results outperform non-specialized

models obtaining SOTA results in a lot of datasets.

Similar results can be seen in (raj Kanakarajan et al.,

2021), where the authors pre-train a biomedical lan-

guage model using biomedical text and vocabulary

with the technique proposed by ELECTRA. Other

specialized models based on BERT have been pro-

posed by (Beltagy et al., 2019), (Lee et al., 2020)

and (Gu et al., 2020) and BioMed-RoBERTa (Guru-

rangan et al., 2020) based on RoBERTa.

(Michalopoulos et al., 2020) propose UmlsBERT,

a contextual embedding model that integrates domain

knowledge from the Uniﬁed Medical Language Sys-

tem (UMLS) (Bodenreider, 2004), taking into consid-

eration structured expert domain knowledge. They

show that UmlsBERT can associate different clini-

cal terms with similar meanings in the UMLS knowl-

edge base and create meaningful input embeddings by

leveraging information from the semantic type of each

word. In our work, we compare the representation of

the symptoms found in the clinical case with different

contextual embeddings with the goal to ﬁnd a repre-

sentation which matches the one provided in the Hu-

man Phenotype Ontology (HPO).

Ngai et al. (Ngai and Rudzicz, 2022) also tackle

the problem of ﬁnding relevant clinical information,

where among the entities they also identify symp-

toms. In contrast to our work, they only focus on 6

Natural Language Explanatory Arguments for Correct and Incorrect Diagnoses of Clinical Cases

439

speciﬁc diagnoses. Furthermore, their goal is to pre-

dict the correct diagnosis and explain these predic-

tions using feature attribution methods, whilst ours is

to generate high-quality explanations in natural lan-

guage for educational purposes, i.e., to improve med-

ical residents’ skills in explaining their answer to the

test.

Besides detecting symptoms from clinical cases,

in our work, we also aim to accurately map them

to medical ontologies, such as the Human Pheno-

type Ontology (HPO), to identify the relationship be-

tween the symptoms (originally described in layper-

son terms) and diseases. Recent work by (Manzini

et al., 2022) proposes a tool for automatically translat-

ing between layperson terminology and HPO, using

a vector space and a neural network to create vector

representations of medical terms and compare them

to layperson versions. However, this approach has a

limitation in that it translates layperson terms with-

out considering their context, potentially missing rel-

evant information that may change the semantics of

the term. In our work, we propose a method that takes

into account the context in which the layperson term

is introduced, leading therefore to an accurate map-

ping to an HPO term.

Natural language explanation generation has re-

ceived a lot of attention in recent years, grounding

on the progress of generative models to train speciﬁc

models for explanations. (Camburu et al., 2018) gen-

erate explanations by justifying a relation (entailment,

contradiction or neutral) for a premise-hypothesis

pair by training a Bi-LSTM on their e-SNLI dataset,

i.e., the Stanford Natural Language Inference (Bow-

man et al., 2015) dataset augmented with an explana-

tion layer which explains the SLNI relations. (Kumar

and Talukdar, 2020) propose to generate short expla-

nations with GPT-2 (Radford et al., 2019), learned to-

gether with the input by a classiﬁer to improve the

ﬁnal label prediction, using e-SNLI (Camburu et al.,

2018). These solutions are not applicable to our use

case given that explaining a medical diagnosis is a

more challenging task than restraining the explana-

tions to the three basic relations considered by (Cam-

buru et al., 2018) and (Kumar and Talukdar, 2020).

(Narang et al., 2020) propose an approach based on

the T5 model (Raffel et al., 2019) to generate an ex-

planation after prediction. Again, this solution is not

applicable to the speciﬁc medical scenario we tar-

get, where explanations require to be structured fol-

lowing precise argumentative structures (Josephson

and Josephson, 1994; Campos, 2011; Dragulinescu,

2016) and to ground on medical knowledge, like the

one we inject through the HPO.

Other approaches use explanations via tem-

plates (Reiter and Dale, 1997), e.g., (Abujabal et al.,

2017) uses templates and inject the reasoning steps

and query of their Q&A system. To the best of our

knowledge, no related work generates natural lan-

guage post-hoc explanations for the medical domain.

3 DATASET

To train and evaluate the proposed approach to build

natural language explanatory arguments, we rely on

the MEDQA dataset (Jin et al., 2021), which contains

a set of clinical case descriptions together with a set

of possible questions and answers on the correct di-

agnosis. The questions and their associated answers

were collected from the National Medical Board Ex-

amination in the USA (USMLE), Mainland China

(MCMLE), and Taiwan (TWMLE). In this work, we

only focus on the clinical cases and the questions in

English (i.e., USMLE). In total, the MEDQA-USMLE

dataset consists of 12,723 unique questions on differ-

ent topics, ranging from questions like “Which of the

following symptoms belongs to schizophrenia?” to

questions about the most probable diagnosis, treat-

ment or outcomes for a certain clinical case which is

described (Jin et al., 2021). To reach our goal, we ex-

tract the clinical cases belonging to the latter group,

which are intended to test medical residents to make

the correct diagnosis. We end up with 314 unique

clinical cases associated with the list of possible di-

agnoses.

Annotation of the MEDQA-USMLE Clinical

Cases. To annotate the clinical cases from the

MEDQA-USMLE dataset, we rely on the labels from

the Uniﬁed Medical Language System (UMLS) (Bo-

denreider, 2004) Semantic Types, making it consis-

tent with standard textual annotations in the medi-

cal domain (Campillos-Llanos et al., 2021; Albright

et al., 2013; Mohan and Li, 2019). In particular, we

annotate the following elements in the clinical case

descriptions: Sign or Symptom, Finding, No Symp-

tom Occurrence, Population Group, Age Group, Lo-

cation and Temporal Concept. In this paper, we use

only the symptoms, but we addressed a complete an-

notation to employ these data for future work. Quan-

tiﬁers deﬁning a symptom have not been annotated

(e.g., we can ﬁnd “moderate pain”, where we only

annotate “pain”). The labels Sign or Symptom and No

Symptom Occurrence are associated only to the text

snippet deﬁning the symptom in a sentence. Findings

consist of such information discovered by direct ob-

servation or measurement of an organism’s attribute

or condition. For instance, components in ”Her tem-

NLPinAI 2023 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

440

perature is 39.3°C (102.8°F), pulse is 104/min, respi-

rations are 24/min, and blood pressure is 135/88 mm

Hg”. Location refers to the location of a symptom in

the human body, and Temporal Concept is used to tag

time-related information, including duration and time

intervals. Population Group and Age Group highlight

information on the age and gender of the patient.

To address the annotation process of the MEDQA-

USMLE dataset, we ﬁrst carried out a semi-automatic

annotation relying on the UMLS database. We pro-

cessed each clinical case through the UMLS database

and obtained all the entities detected along their Con-

cept Unique Identiﬁers (CUI) and their semantic type.

The semantic type is then used to disambiguate the

entities and generate the pre-annotated ﬁles. After the

deﬁnition of the detailed annotation guidelines (sum-

marized above) in collaboration with clinical doctors,

three annotators with a background in computational

linguistics carried out the annotation of the 314 clin-

ical cases. To ensure the reliability of the annotation

task, the inter-annotator agreement (IAA) has been

calculated on an unseen shared subset of 10 clinical

cases annotated by four annotators, obtaining a Fleiss’

kappa (Fleiss, 1971) of 0.70 for all of the annotated

labels, 0.61 for Sign or Symptom, 0.94 for Location,

0.71 for Population Group, 0.66 for Finding, 0.96 for

Age Group and 0.96 for No Symptoms Occurrence.

We can see a substantial agreement for Sign or Symp-

tom, Finding and Population Group, and an almost

perfect agreement for Location, Age Group and No

Symptoms Occurrence.

Table 1 reports on the statistics of the ﬁnal dataset,

named MEDQA-USMLE-Symp.

The accuracy of

the annotations provided by the three annotators has

been validated from a medical perspective with a clin-

ical doctor. Of the seven entity labels, only three con-

tain medical vocabulary (Sign or Symptom, Finding,

and No Symptom Occurrence) and they have been

evaluated by this expert. More speciﬁcally, we ran-

domly sampled 10% of the data (i.e., 30 cases) and

we asked the clinician to verify whether the entity was

correctly labeled and whether there were any missing

or extra words. The results of the validation showed

that 98% of the data was labeled correctly. Less than

2% of the instances were evaluated as incorrectly la-

beled (e.g., a Finding that was labeled as a Sign or

Symptom or vice versa).

Knowledge Base of Diseases and Relevant Symp-

toms. To collect the medical knowledge needed to

deﬁne whether a detected symptom is relevant with

https://github.com/Wimmics/

MEDQA-USMLE-Symp

Table 1: Statistics of the MEDQA-USMLE-Symp dataset.

Label # Entities

Sign or Symptom 1579

Finding 1169

Temporal Concept 567

Location 498

Population Group 364

Age Group 304

No Symptom Occurrence 264

respect to a given disease, we employ the HPO knowl-

edge base to retrieve (i) the relevant information of

each diagnosis which is proposed as an option to an-

swer the question ”Which of the following is the most

likely diagnosis?”, and (ii) the symptoms (named

terms in HPO) associated to each diagnosis. This

knowledge base also includes information on the fre-

quency

of the occurrence of symptoms, deﬁned in

collaboration with ORPHA

as follows: Excluded

(0%); Very rare (1-4%); Occasional (5-29%); Fre-

quent (30-79%); Very frequent (99-80%). Obligate

(100%); HPO integrates different sources of symp-

toms, including ORPHA and OMIM

. This knowl-

edge base is quite rich and contains also links and

hierarchical links between symptoms (Symptom S2

subclass of Symptom S1), genes or deﬁnitions.

4 PROPOSED FRAMEWORK

An overview of the framework we propose to address

automatic symptom relevancy assessment and match-

ing to build our natural language explanations is visu-

alized in Figure 1. Starting from the clinical cases in

which the correct and incorrect diagnosis are already

identiﬁed, the goal is to assess the relevant symptoms

present in the case such that these symptoms can be

used to explain why a certain diagnosis is the correct

one and why the incorrect ones have to be discarded.

In order to accurately diagnose a patient’s condi-

tion, it is important to identify the symptoms that are

most relevant to the possible diagnoses. This means

looking at all of the symptoms that have been detected

and determining which ones are most likely to be re-

lated to the underlying cause of the patient’s condi-

tion. This can be done by considering the individual

symptoms and their potential connections to the pos-

sible diagnoses. It is also important to consider any

additional information that may be available, such as

https://hpo.jax.org/app/browse/term/HP:0040279

https://www.orpha.net/consor/cgi-bin/index.php?lng=

https://www.omim.org/

Natural Language Explanatory Arguments for Correct and Incorrect Diagnoses of Clinical Cases

441

the patient’s medical history and other relevant fac-

tors, in order to be able to fully explain the diagnosis.

Our work focuses on identifying relevant symptoms

in order to accurately diagnose a patient’s condition.

The relevancy assessment model associates, when

possible, the pertinent symptoms mentioned in the

clinical case description with a symptom of a diagno-

sis found in the HPO knowledge base. The proposed

framework consists of two different steps, where: (i)

we retrieve from HPO the required diagnosis informa-

tion (i.e., the symptoms and how common they are),

then the symptoms in the case are detected and ex-

tracted using an attention-based neural architecture

which relies on the clinical case text only; (ii) the

relevancy of each symptom is assessed by matching

the detected symptoms with the ones retrieved from

HPO. The matched symptoms are then used to gener-

ate natural language argument-based explanations for

correct and incorrect diagnoses. In the following, we

explain in detail each sub-task in the pipeline:

Symptoms Detection, consisting in detecting the dif-

ferent symptoms described in the clinical case (medi-

cal terms or symptoms described by the patient’s own

words). In order to detect these entities, we propose

a neural approach based on pre-trained Transformer

Language Models.

Symptoms Alignment, to align a symptom detected

in the clinical case with an identical term in HPO.

We ﬁrst compute an embedding vector for each found

symptom and then calculate the cosine distance with

each term in HPO. We then assign the closest concept

to that symptom. We evaluated both static and con-

textual embedding methods.

Explanation Generation We propose template-

based explanations based solely on the symptoms that

are relevant to explain the diagnosis. To do this we

propose several templates that tackle different kinds

of explanations, going from explaining why a patient

was given a certain diagnosis to explaining why the

alternatives cannot be considered viable options. We

support our explanations with statistical information

obtained from HPO such as the frequency of each

symptom incidence, and we propose to look for pos-

sible symptoms that were not detected by the system

but are frequent for a certain disease.

5 EXPERIMENTS

In this section, we report on the experimental setup,

the obtained results and the error analysis for the

symptom detection and symptom alignment methods.

Setup. For the symptom detection task, we exper-

imented with different transformer-based Language

Models (LMs) such as BERT (Devlin et al., 2019),

SciBERT (Beltagy et al., 2019), BioBERT (Lee et al.,

2020), PubMedBERT (Gu et al., 2020) and Umls-

BERT (Michalopoulos et al., 2020) initialized with

their respective pre-trained weights. All the mod-

els we employ are specialized in the biomedical do-

main, with the exception of BERT which will serve

us as a baseline. We cast the symptom detection prob-

lem as a sequence tagging task. Following the BIO-

tagging scheme, each token is labeled as either be-

ing at the Beginning, Inside or Outside of a compo-

nent. This translates into a sequence tagging problem

with three labels, i.e., B-Sign-or-Symptom, I-Sign-or-

Symptom and Outside. To ﬁne-tune the LMs, we use

the PyTorch implementation of huggingface (Wolf

et al., 2020) (v4.18). For BERT, we use the un-

cased base model with 12 transformer blocks, a hid-

den size of 768, 12 attention heads, and a learning

rate of 2.5e-5 with Adam optimizer for 3 epochs. The

same conﬁguration was used to ﬁne-tune SciBERT

BioBERT, PubMedBERT and UmlsBERT. For SciB-

ERT, we use both the cased and uncased versions, and

for BioBERT we use version 1.2. Batch size was 8

with a maximum sequence length of 128 subword to-

kens per input example.

Regarding the matching module, we experimented

with two different methods to align our detected

symptoms with terms in HPO by (i) directly compar-

ing the computed embeddings of the detected symp-

toms with the embeddings of the terms in HPO, and

(ii) by taking into account the context in which the

symptoms are detected and applying the same context

to every term in HPO.

To align our detected symptoms (in the clinical

case) with the equivalent HPO terms, we calculate the

cosine distance of each embedding of the HPO terms

with respect to the embedding of the detected symp-

tom. In the experimental setting of (i) and (ii), we use

the static pre-trained embeddings GloVe 6B as well

as BERT, SciBERT, BioBERT and UmlsBERT in the

same conﬁgurations as in the symptom detection task.

For (ii), it is necessary to calculate the context embed-

dings “on the ﬂy” because each context is unique and

depends on the clinical case where it was detected. It

is not reasonable to recalculate all HPO term embed-

dings on the ﬂy for each new context since the on-

tology contains 10,319 unique terms, so we propose

to generate all the HPO terms embedding at once and

save them. Therefore, this module takes as input the

symptoms detected by the previous module and ﬁnds

the context

of these symptoms in the clinical case.

The context consists of the sentence(s) containing the

NLPinAI 2023 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

442

Figure 1: Overview of our full pipeline for symptom prediction and alignment, and NL explanation generation module.

The context C is embedded using sentence embed-

ding methods and saved separately from the symp-

tom S, and the two embeddings are added together

(C + S) to form the reference R. This same context

embedding C is added in the same way to each HPO

term embedding T

, T

, . . . , T

to form the candidates

, C

, . . . , C

We compute and retrieve the ﬁve best cosine dis-

tances between C and R to address a fair comparison

with other systems.

We deﬁned a test set of 23 cases where (i) we re-

trieved from HPO the symptoms related to the dis-

eases for each case, and (ii) we manually aligned the

annotated symptoms in the case to the concepts from

HPO. This resulted in 162 symptoms aligned to a spe-

ciﬁc term in HPO that serve us as a testing set for our

matching module.

As mentioned in Section 2, the system proposed

by (Manzini et al., 2022) offers a similar approach to

translating layperson terms to medical terms in HPO.

However, their work does not take into account the

context in which a symptom is found. To the best

of our knowledge, this system constitutes the state-

of-the-art when translating layperson terms to HPO

terms so we decided to compare our proposal with

theirs. However, due to the unavailability of their

model, we rely on their online demo, which outputs

only the top 5 ranking of the HPO terms that are clos-

est to the input symptom. To perform a compari-

son with our pipeline, we ﬁrst compute the accuracy

of the aligned symptoms using our symptoms align-

ment module and then replaced it with (Manzini et al.,

2022) proposed system (DASH). Results are shown in

Table 4.

Since a symptom can be composed of several

words (e.g., “shortness of breath”), we split the

symptom and the entire clinical case.

symptom into words that we encode by either us-

ing each word as an input on Glove (Pennington

et al., 2014), or extracting directly from the con-

textualized models the representation of the symp-

tom by summarizing the hidden states of the last

four layers in the model. We then sum the vec-

tors of each word to get an n-gram representation

of the symptom. We also explore sentence embed-

dings, by making use of Sentence-BERT (Reimers

and Gurevych, 2019), a new model that derives

semantically meaningful sentence embeddings (i.e.,

semantically similar sentences are close in vector

space) that can be compared using cosine similar-

ity. Sentence-BERT can be used with different pre-

trained models, in this work we focus on the models

BERT (Devlin et al., 2019) , SciBERT (Beltagy et al.,

2019), UMLSBERT (Michalopoulos et al., 2020) and

S-PubMedBert by (Deka et al., 2022). The ﬁrst repre-

sents a competitive baseline in our experiments since

it is the SOTA model for comparing sentences cross-

domain, while the three latter models are pre-trained

on scientiﬁc or medical data or both.

To tackle both tasks we make use of our annotated

dataset (Section 3). The annotations are converted

into two datasets, one for each part of the pipeline.

The ﬁrst dataset is used for the symptom detection

task, and it is in the CoNLL format for token-wise

labels. The second dataset, for the symptom align-

ment task, is converted into a csv format, where each

symptom in the clinical case description and avail-

able related knowledge (i.e., the list of symptoms and

their frequencies for each possible diagnosis associ-

ated with the case) extracted from HPO are paired.

Results. Results for the symptom detection task are

shown in Table 2 in macro multi-class precision, re-

call, and F1 score. We can observe that all models

Natural Language Explanatory Arguments for Correct and Incorrect Diagnoses of Clinical Cases

443

Table 2: Results for entity recognition in macro multi-class

precision, recall, and F1-score.

Model P R F1

BERT 0.85 0.84 0.84

BioBERT v1.2 0.84 0.85 0.84

UmlsBERT 0.85 0.85 0.85

PubMedBERTbase 0.83 0.84 0.83

SciBERT cased 0.85 0.85 0.85

SciBERT uncased 0.85 0.86 0.86

perform similarly, with the best results from the spe-

cialized SciBERT (Beltagy et al., 2019) model. The

biggest difference in performance is given by com-

paring SciBERT uncased with PubMedBERT, with

the SciBERT model performing better. Interestingly,

BERT performs closely to the specialized models,

and, in some cases, it outperforms them. This may be

due to the fact that the clinical cases from our dataset

are written for medical exams at the med school. They

contain some technical specialized words, but overall

the symptoms are described in layperson terms.

It is worth noticing that the majority of our labels

do not pertain to medical terminology (e.g. Age and

Population Group, Location and Temporal Concept).

Sign or Symptom and Finding are the only labels that

require specialized vocabulary.

Overall, SciBERT uncased is the best-performing

model (in bold) with a macro F1-score of 0.86, out-

performing the other approaches for each of the cate-

gories. In Table 3 we report the performances for each

entity with the best-performing model. The Sign or

Symptom detection task obtains a 0.82 F1 score. In the

work of (Ngai and Rudzicz, 2022), the authors also

detect symptoms obtaining an F1 score of 0.61. How-

ever, these results can not be directly compared since

the datasets on which both models were ﬁne-tuned are

different: we train on clinical cases, while they use

dialogues between doctors and patients. Moreover,

given that the dataset they use is not released, we can

not evaluate our approach to their data.

The results of the symptoms alignment module

experiments are summarised in Table 4. As baseline

models, we propose to use the same methods but with-

out the context of the symptom, similarly to (Manzini

et al., 2022) DASH. In Table 4 we show only the best-

performing baseline PubMedBERT no context obtain-

ing similar results to DASH (0.41 and 0.37, respec-

tively). Adding contextual representation to the em-

beddings results in a signiﬁcant improvement (up to

0.53 in accuracy) supporting the hypothesis that con-

text plays an important role when translating layper-

son terms to formal medical terms.

Table 3: Results for entity recognition using our best per-

forming model (SciBERT uncased) in P, R, and F1-score.

Entity P R F1

Other 0.93 0.91 0.92

Age Group 1.00 0.97 0.98

Finding 0.85 0.88 0.86

Location 0.74 0.80 0.77

No Symptom Occurrence 0.79 0.72 0.75

Population Group 0.88 0.95 0.91

Sign or Symptom 0.83 0.82 0.82

Temporal Concept 0.78 0.87 0.82

Weighted avg 0.89 0.89 0.89

Macro avg 0.85 0.86 0.86

Table 4: Results for DASH and our symptom alignment

method using different embeddings with and without con-

text (accuracy score).

Model Accuracy

DASH 0.37

PubMedBERT no context 0.41

BERT + context 0.38

SciBERT + context 0.39

UMLSBERT + context 0.44

S-PubMedBERT + context 0.53

Error Analysis. HPO has limitations with respect

to the number of symptoms associated with each di-

agnosis. For some diagnoses, we have multiple symp-

toms, while for others we can have only one or none.

We notice that in those cases where the diagnosis is a

mental disease, the model tends to make more mis-

takes. Inspecting HPO for this kind of diagnoses,

we ﬁnd that either the diagnosis does not appear in

the HPO ontology or the symptoms tend to be more

general, including a lot of common symptoms like

changes in appetite or low energy, that alone may not

be relevant but all together indicate a precise diagno-

sis. Moreover, some relevant symptoms may not be

described explicitly but encoded in the clinical cases

as Findings. These ﬁndings often refer to a relevant

symptom that is not explicitly mentioned in the case,

like in the example introduced in Section 3 about ﬁnd-

ings, where we have ”respirations are 24/min” that,

combined with the fact that the patient is a 34-year-

old woman, means that she has dyspnea. Automat-

ically deriving this implicit knowledge remains an

open challenging issue. Given that we rely on HPO

only, some diseases or diagnoses are not present in the

knowledge base, preventing us to generate the associ-

ated explanations. Combining HPO with more spe-

cialized medical knowledge bases is a future direc-

tion for this work, both to complete the information

we have, and also to integrate new diagnoses.

NLPinAI 2023 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

444

6 NATURAL LANGUAGE

EXPLANATION GENERATION

In the previous section, we described the ﬁrst steps

of our pipeline for automatically identifying the rel-

evant symptoms which occur in the clinical case de-

scription and then matching them with the symptoms

associated with the diseases in the medical knowl-

edge base HPO. We move now to the last step of the

pipeline, i.e., the generation of natural language ex-

planatory arguments, according to the identiﬁed rel-

evant symptoms for the correct and incorrect diag-

noses. Given the speciﬁcity of the clinical data we

are dealing with, we decide to address this task by

generating explanations through the deﬁnition of ex-

planatory patterns (Josephson and Josephson, 1994;

Campos, 2011; Dragulinescu, 2016). We have there-

fore deﬁned different patterns which take into account

the different requirements of our use case scenario,

where we aim at (i) explaining the correct answer by

the detected symptoms and their frequency, (ii) ex-

plaining why the incorrect options cannot hold, and

(iii) highlighting the relevant symptoms not explicitly

mentioned in the clinical case. Let us consider the fol-

lowing clinical case, where in bold we highlight the

symptoms and we underline the relevant symptoms

supporting the correct answer.

Clinical Case. A previously healthy 34-year-old

woman is brought to the physician because of fever and

headache for 1 week. She has not been exposed to any

disease. She takes no medications. Her temperature

is 39.3°C (102.8°F), pulse is 104/min, respirations are

24/min, and blood pressure is 135/88 mm Hg. She is

confused and oriented only to person. Examination

shows jaundice of the skin and conjunctivae. There

are a few scattered petechiae over the trunk and back.

There is no lymphadenopathy. Physical and neurologic

examinations show no other abnormalities. Test of the

stool for occult blood is positive. Laboratory studies

show: Hematocrit 32% with fragmented and nucleated

erythrocytes Leukocyte count 12,500/mm3 Platelet count

20,000/mm3 Prothrombin time 10 sec Partial thromboplas-

tin time 30 sec Fibrin split products negative Serum Urea

nitrogen 35 mg/dL Creatinine 3.0 mg/dL Bilirubin Total

3.0 mg/dL Direct 0.5 mg/dL Lactate dehydrogenase 1000

U/L Blood and urine cultures are negative. A CT scan of

the head shows no abnormalities. Which of the following

is the most likely diagnosis?

The correct diagnosis is Thrombotic thrombocy-

topenic purpura, whilst the other (incorrect) options

are Disseminated intravascular coagulation, Immune

thrombocytopenic purpura, Meningococcal meningi-

tis, Sarcoidosis and Systemic lupus erythematosus.

Why Pattern. We focus here on the correct diagno-

sis explanation pattern, which allows explaining why

this is the correct diagnosis. We deﬁne the follow-

ing template to generate our natural language expla-

nations:

Template 1. (Why for correct diagnosis) The pa-

tient is showing a [CORRECT DIAGNOSIS] as these fol-

lowing symptoms [PERFECT MATCHED SYMPTOMS,

MATCHED SYMPTOMS] are direct symptoms of [COR-

RECT DIAGNOSIS].

Moreover, [OBLIGATORY SYMPTOMS] are obliga-

tory symptoms (always present, i.e., in 100% of the cases)

and [VERY FREQUENT SYMPTOMS] are very frequent

symptoms (holding on 80% to 99% of the cases) for [COR-

RECT DIAGNOSIS] and are present in the case descrip-

tion.

In Template 1, the [CORRECT DIAGNOSIS]

represents the correct answer to the question ”Which

of the following is the most likely diagnosis?” and

therefore the correct diagnosis of the described

disease. The [SYMPTOMS] in bold represent the

symptoms automatically detected through the ﬁrst

module of our pipeline, and they are also underlined

when they are considered as relevant by our matching

module, i.e., they are listed among the symp-

toms for the disease in the HPO knowledge base.

Both [PERFECT MATCHED SYMPTOMS] and

[MATCHED SYMPTOMS] in Template 1 are con-

sidered relevant but they differ in the conﬁdence level

the system assigns to the matched symptoms. This

allows us to integrate a notion of granularity in our

explanations and to rely on the symptoms detected in

the clinical case that strongly match with a symptom

in HPO. If the system does not detect any relevant

symptom, no explanation is generated for the correct

answer. Furthermore, we employ the information

about the symptom frequencies (retrieved through

HPO) in the [OBLIGATORY SYMPTOMS

] and

[VERY FREQUENT SYMPTOMS] to generate

stronger evidence to support our natural language

argumentative explanations. Sometimes the frequen-

cies are not available in the HPO, in which case we

do not display them in our ﬁnal explanation.

We present now some examples of explanatory ar-

guments automatically generated by our system.

Example 1. The patient is showing a [Thrombotic

thrombocytopenic purpura] as these following symptoms

[Headache, Fever, Confusion (Oriented to persons) and

Reticulocytosis (Jaundice of the skin)] are direct symp-

toms of [Thrombotic thrombocytopenic purpura].

Moreover [Reticulocytosis (Jaundice of the skin)] are

very frequent symptoms (holding on 80% to 99% of the

cases) for [Thrombotic thrombocytopenic purpura] and are

present in the case description.

Sources from HPO: https://hpo.jax.org/app/browse/

term/HP:0040279

Natural Language Explanatory Arguments for Correct and Incorrect Diagnoses of Clinical Cases

445

When ﬁlling the [SYMPTOMS]

span in Template 1, we inject only the

symptoms matched in the HPO for the

[PERFECT MATCHED SYMPTOMS], and

we combine the HPO symptoms with the

symptoms detected in the case description for

the [MATCHED SYMPTOMS] in this form:

[matched symptom in HPO (detected symp-

tom in the clinical case)] (e.g., in Exam-

ple 1: Confusion (Oriented to persons) and

Reticulocytosis (Jaundice of the skin))

Why not Template. Explaining why a diagnosis is

the correct one is important, but it is also necessary to

be able to say why the other options are not correct

as possible diagnoses for the clinical case under in-

vestigation (Miller, 2019). We, therefore, propose to

provide explanations based on the relevant symptoms

for the incorrect options by contrasting them with the

clinical case at hand.

Template 2. (Why not for incorrect diagnosis) Con-

cerning the [INCORRECT DIAGNOSIS] diagnosis, it has

to be discarded because the patient in the case descrip-

tion is not showing [INCORRECT DIAGNOSIS SYMP-

TOMS FROM HPO (MINUS DETECTED SYMPTOMS

IN CASE)] symptoms.

Despite [SHARED CORRECT SYMPTOMS] symp-

toms shared with the [CORRECT DIAGNOSIS] correct

diagnosis, the [INCORRECT DIAGNOSIS] diagnosis is

based on [INCORRECT DIAGNOSIS SYMPTOMS].

Moreover, [OBLIGATORY SYMPTOMS] are obliga-

tory symptoms (always present, i.e., in 100% of the cases)

and [VERY FREQUENT SYMPTOMS] are very frequent

symptoms (holding on 80% to 99% of the cases) for [IN-

CORRECT DIAGNOSIS], and they are not present in the

case description.

Template 2 can be applied to each incorrect pos-

sible answer of the case, individually. The incor-

rect answer corresponds to the [INCORRECT DIAG-

NOSIS] and [INCORRECT DIAGNOSIS SYMP-

TOMS] are all relevant symptoms associated with

this disease in the HPO knowledge base, without the

symptoms in common with the correct answer. Again,

in the template, we use the frequencies provided by

HPO to provide further evidence to make our explana-

tory arguments more effective. The template includes

therefore with [OBLIGATORY SYMPTOMS] and

[VERY FREQUENT SYMPTOMS] the mandatory

and very frequent symptoms of the incorrect diagno-

sis, which are missing in the clinical case description.

The following explanations are automatically gener-

ated for (one of) the incorrect diagnoses of the clinical

case we introduced at the beginning of this section.

Example 2. Concerning the [Meningococcal meningitis]

diagnostic, it has to be discarded because the patient in the

case description is not showing [Stiff neck, Nuchal rigidity

or CSF pleocytosis, Increased CSF protein, Hypoglycor-

rhachia] symptoms.

Despite [Petechiae, Fever, Headache] symptoms

shared with the [Thrombotic thrombocytopenic purpura]

correct diagnosis, the [Meningococcal meningitis] diagno-

sis is based on [Stiff neck, Nuchal rigidity or CSF pleocy-

tosis, Increased CSF protein and Hypoglycorrhachia].

Moreover, [Stiff neck, Nuchal rigidity, CSF pleocy-

tosis, Increased CSF protein or Hypoglycorrhachia] are

very frequent symptoms (holding on 80% to 99% of the

cases) for [Meningococcal meningitis] and are not present

in the case description.

Example 2 shows the NL explanation of why the

possible answer [Meningococcal meningitis] is not

the correct diagnosis given the symptoms discussed

in the clinical case description. In case the disease is

not found in HPO, we do not generate the associated

explanation.

Additional Explanatory Arguments. In order to

enrich our explanations with additional explanatory

arguments to improve critical thinking in the medi-

cal residents, we also generate another template. In-

deed, in some clinical cases, it is possible that the

symptoms are not sufﬁcient to explain the diagnosis

or sometimes the symptom has to be combined with

vital signs or other characteristics of the patient to be

correctly interpreted. Some of these signs represent

potentially important symptoms for the diagnosis, as

in the previous example, where the sentence respira-

tions are 24/min could be associated with the symp-

tom of Dyspnea in HPO. Template 3 aims at drawing

the medical residents’ attention to (statistically) im-

portant symptoms that are missing or not explicitly

mentioned in the clinical case description:

Template 3. Furthermore, [CORRECT DIAGNOSIS

VERY FREQUENT SYMPTOMS (MINUS MATCHED

SYMPTOMS)] are also frequent symptoms for [CORRECT

DIAGNOSIS] and could be found in the ﬁndings of the clin-

ical case.

Example 3 is generated by our system and brings at-

tention to Dyspnea. This additional explanatory ar-

gument complements the explanation we generate for

the correct and incorrect diagnoses in the case pre-

sented at the beginning of this section.

Example 3. Furthermore, [Dyspnea, Thrombocytope-

nia, Generalized muscle weakness, Reticulocytosis, and

Microangiopathic hemolytic anemia] are also frequent

symptoms for [Thrombotic thrombocytopenic purpura] and

could be found in the ﬁndings of the clinical case.

Limitations. Our work aims to generate template-

based natural language explanations to explain from a

symptomatic point of view why a diagnosis is correct

and why the remaining ones are incorrect. Template-

based explanations are limited in several ways. First,

NLPinAI 2023 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

446

they are design-dependent, which means that if the

templates are not well-designed, they are not helping

the user in getting a better understanding of the rea-

son behind a correct/incorrect diagnosis. This can re-

duce the user’s overall satisfaction with the program

and make it less effective at achieving its intended

goals, i.e. supporting medical residents’ training. In

our case, we tried to build our template in collabora-

tion with doctors to have a result compliant with their

expectations and requirements. Templates are also in-

ﬂexible and are ﬁxed in advance, they may not be able

to adapt to changing circumstances or to new informa-

tion. This can make them less effective in dynamic or

rapidly-changing environments. Again, this is not a

serious issue in our case because we are using only

the symptoms as the source of data for the moment,

which are not evolving.

7 CONCLUSION

In this paper, we present a full pipeline to generate

natural language explanatory arguments for correct

and incorrect diagnoses in clinical cases. More pre-

cisely, based on a novel annotated linguistic resource,

our pipeline ﬁrst automatically identiﬁes in a clinical

case description the relevant symptoms and matches

them to the HPO medical knowledge base terms to

associate symptoms to the correct and incorrect diag-

noses proposed as potential answers to the test, and

second, automatically generates a natural language

explanatory argument which highlights why a certain

answer is the correct diagnoses and why the others are

not. Extensive experiments on a dataset of 314 clini-

cal cases in English on various diseases show good re-

sults (0.86 F1-Score on symptom detection and 0.53

Accuracy on relevant symptom alignment for Top 5

matches), outperforming competitive baselines and

SOTA approaches.

Several future work lines arise from this work.

First, we plan to address a user evaluation with medi-

cal residents. Even though clinical doctors have been

involved in the deﬁnition of the annotation guidelines

we deﬁned, a user evaluation with medical residents

is required to get their feedback on our explanatory

arguments. Second, we plan to make these explana-

tions interactive to address a rule-based dialogue with

the student to focus on precise aspects of the clinical

case and go into more precise or generic explanations

if required by the student.

ACKNOWLEDGEMENTS

This work has been supported by the French govern-

ment, through the 3IA C

ote d’Azur Investments in

the Future project managed by the National Research

Agency (ANR) with the reference number ANR- 19-

P3IA-0002. This work was supported by the CHIST-

ERA grant of the Call XAI 2019 of the ANR with the

grant number Project-ANR-21-CHR4-0002.

REFERENCES

Abujabal, A., Roy, R. S., Yahya, M., and Weikum, G.

(2017). Quint: Interpretable question answering over

knowledge bases. In Proceedings of the 2017 Con-

ference on Empirical Methods in Natural Language

Processing: System Demonstrations, pages 61–66.

Albright, D., Lanfranchi, A., Fredriksen, A., Styler IV,

W. F., Warner, C., Hwang, J. D., Choi, J. D., Dligach,

D., Nielsen, R. D., Martin, J., et al. (2013). Towards

comprehensive syntactic and semantic annotations of

the clinical narrative. Journal of the American Medi-

cal Informatics Association, 20(5):922–930.

Beltagy, I., Lo, K., and Cohan, A. (2019). Scibert: A

pretrained language model for scientiﬁc text. arXiv

preprint arXiv:1903.10676.

Bodenreider, O. (2004). The uniﬁed medical language sys-

tem (umls): integrating biomedical terminology. Nu-

cleic acids research, 32(suppl 1):D267–D270.

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D.

(2015). A large annotated corpus for learning natural

language inference. arXiv preprint arXiv:1508.05326.

Camburu, O.-M., Rockt

aschel, T., Lukasiewicz, T., and

Blunsom, P. (2018). e-snli: Natural language infer-

ence with natural language explanations. In NeurIPS.

Campillos-Llanos, L., Valverde-Mateos, A., Capllonch-

Carri

on, A., and Moreno-Sandoval, A. (2021). A clin-

ical trials corpus annotated with umls entities to en-

hance the access to evidence-based medicine. BMC

medical informatics and decision making, 21(1):1–19.

Campos, D. G. (2011). On the distinction between peirce’s

abduction and lipton’s inference to the best explana-

tion. Synthese, 180(3):419–442.

Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D.

(2020). Electra: Pre-training text encoders as dis-

criminators rather than generators. arXiv preprint

arXiv:2003.10555.

Cyras, K., Rago, A., Albini, E., Baroni, P., and Toni,

F. (2021). Argumentative xai: A survey. ArXiv,

abs/2105.11266.

Deka, P., Jurek-Loughrey, A., and Deepak, P. (2022). Im-

proved methods to aid unsupervised evidence-based

fact checking for online health news. Journal of Data

Intelligence, 3(4):474–504.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. In Proceed-

ings of NAACL-HLT 2019, pages 4171–4186.

Natural Language Explanatory Arguments for Correct and Incorrect Diagnoses of Clinical Cases

447

gan, R. I., Leaman, R., and Lu, Z. (2014). Ncbi disease

corpus: a resource for disease name recognition and

concept normalization. Journal of biomedical infor-

matics, 47:1–10.

Dragulinescu, S. (2016). Inference to the best explanation

and mechanisms in medicine. Theoretical medicine

and bioethics, 37:211–232.

Eyre, H., Chapman, A. B., Peterson, K. S., Shi, J., Alba,

P. R., Jones, M. M., Box, T. L., DuVall, S. L., and

Patterson, O. V. (2021). Launching into clinical space

with medspacy: a new clinical text processing toolkit

in python. In AMIA Annual Symposium Proceedings,

volume 2021, page 438. American Medical Informat-

ics Association.

Fleiss, J. L. (1971). Measuring nominal scale agree-

ment among many raters. Psychological bulletin,

76(5):378.

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama,

N., Liu, X., Naumann, T., Gao, J., and Poon, H.

(2020). Domain-speciﬁc language model pretraining

for biomedical natural language processing.

Gururangan, S., Marasovi

c, A., Swayamdipta, S., Lo, K.,

Beltagy, I., Downey, D., and Smith, N. A. (2020).

Don’t stop pretraining: Adapt language models to do-

mains and tasks. In Proceedings of ACL.

Honnibal, M. and Montani, I. (2017). spaCy 2: Natural lan-

guage understanding with Bloom embeddings, convo-

lutional neural networks and incremental parsing. To

appear.

Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., and

Szolovits, P. (2021). What disease does this patient

have? a large-scale open domain question answer-

ing dataset from medical exams. Applied Sciences,

11(14):6421.

Johnson, R. H. (2000). Manifest Rationality: A Pragmatic

Theory of Argument. Lawrence Earlbaum Associates.

Josephson, J. R. and Josephson, S. G. (1994). Abductive in-

ference: Computation, Philosophy, Technology. Cam-

bridge University Press.

Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., and Col-

lier, N. (2004). Introduction to the bio-entity recog-

nition task at jnlpba. In Proceedings of the interna-

tional joint workshop on natural language process-

ing in biomedicine and its applications, pages 70–75.

Citeseer.

ohler, S., Gargano, M., Matentzoglu, N., Carmody, L. C.,

Lewis-Smith, D., Vasilevsky, N. A., Danis, D., Bal-

agura, G., Baynam, G., Brower, A. M., et al. (2021).

The human phenotype ontology in 2021. Nucleic

acids research, 49(D1):D1207–D1217.

Krallinger, M., Rabal, O., Leitner, F., Vazquez, M., Salgado,

D., Lu, Z., Leaman, R., Lu, Y., Ji, D., Lowe, D. M.,

et al. (2015). The chemdner corpus of chemicals and

drugs and its annotation principles. Journal of chem-

informatics, 7(1):1–17.

Kumar, S. and Talukdar, P. (2020). Nile: Natural language

inference with faithful natural language explanations.

arXiv preprint arXiv:2005.12116.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma,

P., and Soricut, R. (2019). Albert: A lite bert for

self-supervised learning of language representations.

arXiv preprint arXiv:1909.11942.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,

and Kang, J. (2020). Biobert: a pre-trained biomedi-

cal language representation model for biomedical text

mining. Bioinformatics, 36(4):1234–1240.

Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-H.,

Leaman, R., Davis, A. P., Mattingly, C. J., Wiegers,

T. C., and Lu, Z. (2016). Biocreative v cdr task cor-

pus: a resource for chemical disease relation extrac-

tion. Database, 2016.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Manzini, E., Garrido-Aguirre, J., Fonollosa, J., and Perera-

Lluna, A. (2022). Mapping layperson medical termi-

nology into the human phenotype ontology using neu-

ral machine translation models. Expert Systems with

Applications, 204:117446.

Michalopoulos, G., Wang, Y., Kaka, H., Chen, H., and

Wong, A. (2020). Umlsbert: Clinical domain knowl-

edge augmentation of contextual embeddings using

the uniﬁed medical language system metathesaurus.

arXiv preprint arXiv:2010.10391.

Miller, T. (2019). Explanation in artiﬁcial intelligence: In-

sights from the social sciences. Artif. Intell., 267:1–

38.

Mohan, S. and Li, D. (2019). Medmentions: A large

biomedical corpus annotated with umls concepts.

arXiv preprint arXiv:1902.09476.

Narang, S., Raffel, C., Lee, K., Roberts, A., Fiedel, N.,

and Malkan, K. (2020). Wt5?! training text-to-text

models to explain their predictions. arXiv preprint

arXiv:2004.14546.

Naseem, U., Khushi, M., Reddy, V. B., Rajendran, S., Raz-

zak, I., and Kim, J. (2021). Bioalbert: A simple and

effective pre-trained language model for biomedical

named entity recognition. 2021 International Joint

Conference on Neural Networks (IJCNN), pages 1–7.

Ngai, H. and Rudzicz, F. (2022). Doctor XAvIer: Explain-

able diagnosis on physician-patient dialogues and

XAI evaluation. In Proceedings of the 21st Workshop

on Biomedical Language Processing, pages 337–344,

Dublin, Ireland. Association for Computational Lin-

guistics.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Proceedings of EMNLP 2014, pages 1532–1543.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,

Sutskever, I., et al. (2019). Language models are un-

supervised multitask learners. OpenAI blog, 1(8):9.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang,

S., Matena, M., Zhou, Y., Li, W., and Liu, P. J.

(2019). Exploring the limits of transfer learning

with a uniﬁed text-to-text transformer. arXiv preprint

arXiv:1910.10683.

raj Kanakarajan, K., Kundumani, B., and Sankarasubbu,

M. (2021). Bioelectra: pretrained biomedical text en-

coder using discriminators. In Proceedings of the 20th

NLPinAI 2023 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

448

Workshop on Biomedical Language Processing, pages

143–154.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. arXiv

preprint arXiv:1908.10084.

Reiter, E. and Dale, R. (1997). Building applied natural

language generation systems. Natural Language En-

gineering, 3(1):57–87.

Saeed, W. and Omlin, C. W. (2021). Explainable ai (xai):

A systematic meta-survey of current challenges and

future opportunities. ArXiv, abs/2111.06420.

Smith, L., Tanabe, L. K., Kuo, C.-J., Chung, I., Hsu, C.-

N., Lin, Y.-S., Klinger, R., Friedrich, C. M., Ganchev,

K., Torii, M., et al. (2008). Overview of biocreative

ii gene mention recognition. Genome biology, 9(2):1–

19.

Soysal, E., Wang, J., Jiang, M., Wu, Y., Pakhomov, S., Liu,

H., and Xu, H. (2018). Clamp–a toolkit for efﬁciently

building customized clinical natural language process-

ing pipelines. Journal of the American Medical Infor-

matics Association, 25(3):331–336.

Tjoa, E. and Guan, C. (2019). A survey on explainable

artiﬁcial intelligence (XAI): towards medical XAI.

CoRR, abs/1907.07374.

Vassiliades, A., Bassiliades, N., and Patkos, T. (2021). Ar-

gumentation and explainable artiﬁcial intelligence: a

survey. The Knowledge Engineering Review, 36:e5.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,

C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-

icz, M., Davison, J., Shleifer, S., von Platen, P., Ma,

C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger,

S., Drame, M., Lhoest, Q., and Rush, A. M. (2020).

Transformers: State-of-the-art natural language pro-

cessing. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing:

System Demonstrations, pages 38–45, Online. Asso-

ciation for Computational Linguistics.

Natural Language Explanatory Arguments for Correct and Incorrect Diagnoses of Clinical Cases

449