
(ICD), notably for the task of predicting future diag-
noses (Nerella et al., 2023). Behrt (Li et al., 2020),
an adaptation of BERT for EHR data, is pre-trained
using a masked language model before being trained
on sequences of ICD codes and age data to predict
future diagnoses. Hi-BEHRT (Li et al., 2022), an ex-
tension of Behrt, uses a hierarchical structure to pro-
cess long sequences of medical data more efficiently.
Furthermore, Med-BERT (Rasmy et al., 2021) mod-
ifies the pre-training task to include the prediction
of length of stay and uses a combination of ICD-9
and ICD-10 codes to predict diabetes and heart fail-
ure. ICD-9 and ICD-10 are two different versions
of disease classification. Proposed in 1979, ICD-9
comprises 14, 000 codes covering diagnoses and pro-
cedures. The codes are mainly numerical and fairly
general. Adopted in 1990 and implemented in many
countries in the early 2000s, ICD-10 is much more
detailed, with around 70, 000 diagnostic codes. It pro-
vides a much more precise description of diseases and
their symptoms. HiTANet (Hierarchical Time-aware
Attention Network) (Luo et al., 2020) incorporates a
temporal vector to represent the time elapsed between
consecutive visits, combined with the embedding of
the original visit to predict future diagnoses on three
disease-specific databases. Finally, RAPT (Represen-
tAtion by Pre-training time-aware Transformer) (Ren
et al., 2021) integrates an explicit duration vector with
additional pre-training tasks such as similarity predic-
tion and reasonableness checking to address issues
of insufficient data, incompleteness, and the typical
short sequences of EHR data. RAPT is evaluated
for predicting pregnancy outcomes, risk periods, as
well as diagnoses of diabetes and hypertension dur-
ing pregnancy.
2.2 Validation of Self-Attention Links
Among the studies that use Transformer-type archi-
tectures on electronic health record (EHR) data, those
that evaluate model performance by validating the
self-attention links learned by the model fall into two
groups. The first category includes works that assess
the relevance of self-attention weights through a few
selected examples. Among these works, the authors
of LSAN (Ye et al., 2020), using a hierarchical at-
tention module, randomly select samples to analyze
which symptoms receive the most attention during
each visit for risk prediction. Others, such as the au-
thors of Behrt (Li et al., 2020), Med-BERT (Rasmy
et al., 2021) and (Meng et al., 2021), use the bertviz
(Vig, 2019) tool to visualize interactions between di-
agnoses with significant self-attention weights. This
tool allows to visualize self-attention links between
pairs of elements in a sentence, by choosing from the
attention heads and layers of the model.
The second category includes work that modi-
fies the representation of input data, making self-
attention weights more interpretable. For example,
(Dong et al., 2021) represent data as graphs linking
domain concepts. This modification of the data rep-
resentation improves the explainability of the atten-
tion mechanism, as it relies on the attention weights
assigned to each graph instance and not just on the
direct relationships between inputs and outputs. Sim-
ilarly, (Peng et al., 2021) introduce ontologies as input
data, demonstrating that it is possible to obtain more
interpretable medical codes links.
Theses works aims to interpret and validate model
learning through self-attention, but experiments in the
EHR field are often limited to validating performance
through manually evaluated visual examples. In this
work, we propose a method that evaluates the learn-
ing of self-attention links by representing them as a
graph and comparing them to a ground truth also rep-
resented as a graph. To represent self-attention links
as a graph, we first extract these weights (Section 3.1)
during inference of a Behrt (Li et al., 2020) model,
by choosing a specific layer. These weights show the
attention that each token gives to every other token in
the same sequence. In parallel with this collection of
data for all sequences, we identify the most influen-
tial tokens for prediction, by analyzing their gradient
(Section 3.2). We then use these information to con-
struct a directed graph (Section 3.3). In this graph, the
tokens of importance are the source nodes, and they
are connected to the other tokens to which they are
linked in the sequences. We also add self-attention
links between tokens that are linked to those identi-
fied as important, illustrating the interactions and self-
attention weights between the different tokens playing
a primary or secondary role in the prediction made by
the model. Finally, we evaluate the graph by measur-
ing the weight of directed edges common to those of
a graph representing ground truth (Section 3.4).
3 METHODOLOGY
Figure 2 illustrates the different steps of our method-
ology.
3.1 Creation of the Global Attention
Matrix
Let T = {t
1
, t
2
, . . . , t
V
} be a set (or vocabulary) of V
distinct tokens. We consider a labeled dataset X =
(X
i
, y
i
)
i∈J1;NK
consisting of N sequences X
i
and their
Evaluating Transformers Learning by Representing Self-Attention Weights as a Graph
697