Vocabulary Modiﬁcations for Domain-adaptive Pretraining of Clinical

Language Models

Anastasios Lamproudis, Aron Henriksson and Hercules Dalianis

Department of Computer and System Sciences, Stockholm University, Stockholm, Sweden

Keywords:

Natural Language Processing, Language Models, Domain-adaptive Pretraining, Clinical Text, Swedish.

Abstract:

Research has shown that using generic language models – speciﬁcally, BERT models – in specialized domains

may be sub-optimal due to domain differences in language use and vocabulary. There are several techniques

for developing domain-speciﬁc language models that leverage the use of existing generic language models,

including continued and domain-adaptive pretraining with in-domain data. Here, we investigate a strategy

based on using a domain-speciﬁc vocabulary, while leveraging a generic language model for initialization.

The results demonstrate that domain-adaptive pretraining, in combination with a domain-speciﬁc vocabulary

– as opposed to a general-domain vocabulary – yields improvements on two downstream clinical NLP tasks

for Swedish. The results highlight the value of domain-adaptive pretraining when developing specialized

language models and indicate that it is beneﬁcial to adapt the vocabulary of the language model to the target

domain prior to continued, domain-adaptive pretraining of a generic language model.

1 INTRODUCTION

The paradigm of pretraining and ﬁne-tuning language

models has become a cornerstone of modern natural

language processing. By utilizing transfer learning

techniques, language models such as BERT (Devlin

et al., 2019) are pretrained using vast amounts of un-

labeled text data and subsequently adapted, or ﬁne-

tuned, to carry out various downstream tasks using

labeled, task-speciﬁc data. While pretraining is gen-

erally computationally expensive – and therefore typi-

cally carried out by resource-rich organizations – ﬁne-

tuning is computationally inexpensive. These ﬁne-

tuned models have obtained state-of-the-art results in

many natural language processing tasks.

Language models are often pretrained using large,

readily available corpora in the general domain, e.g.

Wikipedia. However, the use of generic language

models in specialized domains may be sub-optimal

because of domain differences, in terms of, for in-

stance, language use and vocabulary (Lewis et al.,

2020, Gururangan et al., 2020). This has motivated

efforts to develop domain-speciﬁc language models,

e.g. SciBERT (Beltagy et al., 2019) and BioBERT

(Lee et al., 2020). Specialized language models have

been developed either by (i) pretraining a language

model with in-domain data from scratch, possibly in

combination with out-domain data, or by (ii) contin-

uing to pretrain an existing, generic language model

with in-domain data (domain-adaptive pretraining),

either by using large amounts of in-domain data, if

available, or by only using task-related unlabeled data

(task-adaptive pretraining).

However, the domain of the language model is

not only manifested in the pretraining or ﬁne-tuning

sessions, but also in the model’s vocabulary. Every

language model requires a vocabulary for processing

the input text. For transformer-based language mod-

els, this vocabulary acts not only as a prepossessing

method but for connecting the input text to the model.

The vocabulary is used for creating the mapping of the

language to the learned – or to-be-learned – represen-

tations of the model at its lowest level: the embedding

table. This vocabulary can be built using different al-

gorithms, such as WordPiece (Wu et al., 2016), Sen-

tencePiece (Kudo and Richardson, 2018) and Byte-

Pair encoding (Sennrich et al., 2016). As such, the

vocabulary of a language model is determined by the

algorithm and the data which was used to generate it.

A drawback of continued, domain-adaptive pre-

training of an existing, generic language model is that

domain-speciﬁc words are often tokenized poorly.

For example, when tokenizing the common clinical

term r

ontgenunders

okning (English: x-ray investiga-

tion) with a general-domain language model (KB-

BERT), it is split into multiple subtokens: [‘ro’,

180

Lamproudis, A., Henriksson, A. and Dalianis, H.

Vocabulary Modiﬁcations for Domain-adaptive Pretraining of Clinical Language Models.

DOI: 10.5220/0010893800003123

In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 5: HEALTHINF, pages 180-188

ISBN: 978-989-758-552-4; ISSN: 2184-4305

‘##nt’, ‘##gen’, ‘##under’, ‘##so’, ‘##kning’]. There

have therefore been efforts to adapt the vocabulary to

the target domain prior to domain-adaptive pretrain-

ing (Tai et al., 2020, Koto et al., 2021).

In this paper, a clinical language model for

Swedish is developed by adapting the vocabulary that

the model uses. More speciﬁcally, a clinical vocab-

ulary is constructed by applying the WordPiece al-

gorithm on large amounts of Swedish clinical text.

The clinical language model uses this vocabulary, but

also leverages an existing, generic language model

for Swedish for parameter initialization, both for the

shared vocabulary and for non-overlapping tokens.

An empirical investigation shows that this approach,

in addition to domain-adaptive pretraining, leads to

improved performance on two downstream clinical

natural language processing tasks.

2 RELATED WORK

There have been many efforts to develop domain-

speciﬁc, specialized language models by pretraining

with in-domain data, particularly for the biomedical

domain. A notable example is BioBERT (Lee et al.,

2020), which was initialized using a generic BERT

model while inheriting its vocabulary, after which

domain-adaptive pretraining was carried out. Another

example is BioMegatron (Shin et al., 2020) – a larger

model that obtained further improvements by train-

ing on even larger in-domain corpora and, in some

cases, using a domain-speciﬁc vocabulary. In another

study (Gu et al., 2021), it was shown that training

biomedical language models from scratch – as op-

posed to domain-adaptive pretraining of a generic lan-

guage model – may yield improved performance, al-

though requiring substantial computational resources.

Domain-speciﬁc language models have also been

developed for the clinical domain. Alsentzer et al.

(2019) pretrained BERT models using clinical notes

from MIMIC-III and showed that initializing the lan-

guage models with parameters from BioBERT, as op-

posed to BERT, yielded better downstream perfor-

mance. Lewis et al. (2020) developed clinical and

biomedical RoBERTa-based language models and

studied the impact of various training corpora and

model sizes, along with a domain-speciﬁc vocabulary.

Liu et al. (2019) conducted experiments which sug-

gest using a larger, more powerful generic language

model may be better than using a smaller, less power-

ful domain-speciﬁc language model. However, it was

also shown that using in-domain data may lead to im-

proved performance, in particular in the clinical do-

main. Learning a domain-speciﬁc vocabulary yielded

improvements on sequence labeling tasks, while the

impact was less clear for classiﬁcation tasks. Further-

more, Gururangan et al. (2020) also investigated the

potential gains of domain-adaptive pretraining using

an existing BERT model and explored a number of

different settings: (i) domain-adaptive pretraining for

a limited amount of time, (ii) domain-adaptive pre-

training on the unlabeled training set of the intended

downstream task, and (iii) domain-adaptive pretrain-

ing on available unlabeled data directly related to the

future downstream task. A clinical language model

for Swedish was developed through domain-adaptive

pretraining of a generic language model while inherit-

ing its vocabulary, yielding improvements on several

downstream tasks (Lamproudis et al., 2021).

Some efforts have focused on adapting a generic

language model to a specialized domain by modify-

ing the vocabulary. In exBERT (Tai et al., 2020),

the vocabulary of a generic language model was aug-

mented with domain-speciﬁc terms, while adding ex-

tra parameters for the extended vocabulary. exBERT

was then pretrained using in-domain data and shown

to outperform its original counterpart on a number

of domain-speciﬁc tasks. In another relevant study

(Koto et al., 2021), albeit not in the biomedical do-

main, the general-domain vocabulary was replaced

with a domain-speciﬁc vocabulary, while using the

generic language model for initialization, both for

overlapping and non-overlapping tokens. The pro-

posed model yielded promising results without fur-

ther pretraining, while yielding improved results after

a further, domain-adaptive pretraining session.

3 DATA & METHODS

In this paper, we report on efforts to improve a clinical

language model for Swedish by adapting the model’s

vocabulary. A domain-speciﬁc, clinical vocabulary is

constructed by applying the WordPiece algorithm to

large amounts of clinical text. However, rather than

pretraining the model with a purely clinical vocab-

ulary from scratch, we leverage an existing, generic

language model for Swedish for parameter initializa-

tion. The approach is evaluated in two steps: (i)

through vocabulary adaptation and inheriting param-

eters from the generic language model alone (see sec-

tion 3.2), and (ii) followed by a session of domain-

adaptive pretraining (see section 3.3). The result-

ing language models are subsequently ﬁne-tuned and

evaluated on two downstream clinical natural lan-

guage processing tasks: (i) detection of Protected

Health Information (PHI), i.e. a named entity recog-

nition task, and (ii) automatic assignment of ICD-

Vocabulary Modiﬁcations for Domain-adaptive Pretraining of Clinical Language Models

181

10 codes to discharge summaries, i.e. a multi-class,

multi-label classiﬁcation task. We save checkpoints

during the pretraining session in order to assess the

impact of different degrees of domain-adaptive pre-

training. The proposed models are compared to two

baselines: KB-BERT (Malmsten et al., 2020), which

is the generic language model for Swedish that is used

for initialization in the proposed approach, and Clin-

ical KB-BERT (Lamproudis et al., 2021), which car-

ries out domain-adaptive pretraining with the general-

domain vocabulary of KB-BERT.

The hypothesis is that pretraining a clinical lan-

guage model with a domain-speciﬁc vocabulary will

lead to improved performance on downstream tasks

in comparison to using a general-domain vocabulary.

We describe the evaluation setup in more detail in sec-

tion 3.4. Finally, differences between the vocabular-

ies are analyzed, in terms of term frequencies of the

shared vocabulary in the clinical domain, as well as

the impact of the two vocabularies on tokenization of

clinical text in terms of the resulting sample lengths.

3.1 Data

The clinical corpus used for domain-adaptive pre-

training contains 17.8 GB of clinical text

from the re-

search infrastructure Health Bank

– Swedish Health

Record Research Bank at DSV/Stockholm University

(Dalianis et al., 2015). The clinical text is written in

Swedish and encompasses a large number of clinical

units. The amount of text is comparable to the size

of the general-domain text used for pretraining KB-

BERT (Malmsten et al., 2020), a generic language

model for Swedish.

Two manually annotated data sets, corresponding

to two important downstream tasks, are used for ﬁne-

tuning and evaluating the language models:

• Identifying Protected Health Information (PHI)

in clinical notes, which is a fundamental step in

de-identiﬁcation and is typically approached as

a named entity recognition task. The Stockholm

EPR PHI Corpus comprises 21,653 sample sen-

tences, 380,000 tokens and contains 4,480 anno-

tated entities corresponding to 9 PHI classes: First

Name, Last Name, Age, Phone Number, Loca-

tion, Health Care Unit, Organization, Full Date,

and Date Part. Details about the dataset can be

found in (Velupillai et al., 2009).

• Assigning ICD-10 diagnosis codes to discharge

summaries, which is a document-level multi-

This research has been approved by the Swedish Ethi-

cal Review Authority under permission no. 2019-05679.

http://dsv.su.se/healthbank

class, multi-label classiﬁcation task. The Stock-

holm EPR Gastro ICD-10 Corpus contains 6,062

discharge summaries (986,436 tokens) and their

assigned ICD-10 diagnosis codes that are gastro-

related and divided into 10 groups (classes) with a

more coarse granularity compared to the full ICD-

10 codes; the groups correspond to different body

parts and range from K00 to K99. There is, on av-

erage, 1.2 labels per sample. More details about

the dataset can be found in (Remmer et al., 2021).

3.2 Vocabulary Adaptation

The vocabulary of KB-BERT (Malmsten et al., 2020)

– an existing language model for Swedish, pretrained

using general-domain corpora encompassing newspa-

pers, government documents, e-books, social media

and Swedish Wikipedia – is adapted to the clinical do-

main by replacing its general-domain vocabulary with

a clinical vocabulary. The clinical vocabulary is con-

structed by applying the WordPiece algorithm to the

clinical corpus described in section 3.1. We set the

vocabulary size to be roughly equal to the vocabulary

size of KB-BERT, i.e. 50,325 words, and obtain a vo-

cabulary comprising 50,320 words plus the 5 special

tokens. A comparison of the general-domain vocab-

ulary of KB-BERT and the obtained clinical-domain

vocabulary shows that they have 16,519 words and

subwords in common (Figure 1). This means that

33,801 tokens are unique to the clinical domain and

do not exist in the vocabulary of KB-BERT.

Figure 1: The ﬁgure illustrates the two different vocabulary

sets. It shows the total amount of words present in each

vocabulary along with the number of words or tokens that

are shared between the two.

One option would be to pretrain a clinical lan-

guage model from scratch using the obtained clinical

vocabulary. However, in order to leverage the existing

language model for Swedish, KB-BERT, it is used for

HEALTHINF 2022 - 15th International Conference on Health Informatics

182

model initialization. More speciﬁcally, a new model

with the clinical-domain vocabulary is initialized in

which the learned parameters of KB-BERT are trans-

ferred. For shared tokens – i.e. tokens that exist in

both vocabularies – existing representations are in-

herited from KB-BERT. Clinical tokens that are not

present in KB-BERT are tokenized using the general-

domain tokenizer and initialized using the average of

the resulting subwords, as this has been shown to lead

to better performance compared to random initializa-

tion (Koto et al., 2021).

3.3 Domain-adaptive Pretraining

The language model obtained by using a clinical vo-

cabulary and leveraging KB-BERT for model initial-

ization is then further pretrained using clinical text,

i.e. it is subjected to domain-adaptive pretraining. To

that end, we use masked language modeling, omit-

ting the next sentence prediction task since it has

been shown that it is redundant in the development of

RoBERTA (Liu et al., 2019). Masked language mod-

eling can be described as randomly masking a per-

centage of the input tokens, using a special [MASK]

token. The model is then required to predict the

masked tokens based on the non-masked tokens, in

essence being forced to look for and build context in

its representations. The task is self-supervised, mean-

ing that it does not require manually annotated data.

When pretraining, the instructions and hyper-

parameters used in BERT (Devlin et al., 2019)

is followed with only two exceptions. Following

RoBERTA, instead of pretraining with a mix of se-

quence lengths, we only use sequences of maximum

length during our pretraining session. To create these

sequences, shorter sequences are concatenated using a

special [SEP] token to indicate the end and beginning

of each sequence. Furthermore, since the aim is to

adapt an already trained language model to a domain

of interest using limited resources, the pretraining is

carried out only for a limited amount of steps, corre-

sponding to one epoch of the data. The pretraining

hyperparameters are shown in Table 1.

To evaluate the domain-adaptive pretraining ses-

sion, the experiments do not rely on the loss computed

by a validation set; instead, the resulting models are

ﬁne-tuned and evaluated on downstream tasks.

3.4 Fine-tuning & Evaluation

The models are ﬁne-tuned following the general prac-

tice introduced by BERT and followed since. This

means using each language model as the core of each

task-speciﬁc model and only changing the last layer

Table 1: Hyperparameters for domain-adaptive pretraining.

hyperparameters values

learning rate 10

−4

batch size 256

Adam optimizer 4

0.9

0.999

L2 weight decay 0.01

warm up steps 10,000

linear learning rate decay 4

dropout probability 10%

update steps ≈ 40,000 equal to 1 epoch

training sequence length 512

Masked language modeling probability 15%

to match the requirements of each task. As input to

these task-speciﬁc layers, the [CLS] token represen-

tations are used for the classiﬁcation task (ICD-10),

while, for the named entity recognition task (PHI), all

of the individual token representations are used.

An extensive hyperparameter search is not per-

formed as the goal is not to obtain state-of-the-art re-

sults on the downstream tasks, but rather to evaluate

and compare the underlying language models. How-

ever, in a setup where the goal is to achieve the best

possible performance, an extensive hyperparameter

search should be conducted. Instead, we perform a

narrow hyperparameter search to obtain the ones used

in the experiments in this work. These hyperparame-

ters are shown in Table 2.

Table 2: Fine-tuning hyperparameters.

hyperparameters PHI ICD-10 Uncertainty

learning rate 3 · 10

−5

2 · 10

−5

3 · 10

−5

batch size 64 32 64

For each experiment, 10% of the dataset is used as

a validation or development set and 10% as a held-out

test set; the remaining 80% is used for training. In or-

der to obtain more reliable performance estimates, ten

experiments for each model and downstream task are

performed, in which the held-out test set is randomly

selected; we then report the mean performance.

4 RESULTS

The results after ﬁne-tuning and evaluating the lan-

guage models on the two downstream tasks are shown

in Table 3. As can be seen, the best performance is ob-

tained when using a clinical vocabulary in conjunc-

tion with domain-adaptive pretraining. On the PHI

task, this model obtains an F

-score of 0.942, com-

pared to 0.920 when ﬁne-tuning a generic language

model without adapting the vocabulary or carrying

out domain-adaptive pretraining. Clinical KB-BERT

Vocabulary Modiﬁcations for Domain-adaptive Pretraining of Clinical Language Models

183

Table 3: A comparison of pretrained models using different vocabularies and with or without domain-adaptive pretraining

with Swedish clinical text for two downstream tasks (F

-score).

Model

Vocabulary

adaptation

Domain-adaptive

pretraining

PHI ICD-10

General Vocabulary (KB-BERT) No No 0.920 0.799

General Vocabulary (Clinical KB-BERT) No Yes 0.934 0.831

Clinical Vocabulary (ours) Yes No 0.911 0.796

Clinical Vocabulary (ours) Yes Yes 0.942 0.835

obtains a F

-score of 0.934 on this task, showing that

domain-adaptive pretraining is useful even without

vocabulary modiﬁcations; however, the performance

is improved further when also adapting the vocabu-

lary to the target domain. There is a similar pattern

for the ICD-10 task, even if the improvement in per-

formance by adapting the vocabulary prior to domain-

adaptive pretraining is rather small (F

-score of 0.835

vs. 0.831). Compared to KB-BERT, however, the dif-

ference is substantial (F

-score of 0.835 vs. 0.799).

Furthermore, it is evident from the results that

modifying the vocabulary and inheriting model pa-

rameters from a generic language model alone – with-

out any domain-adaptive pretraining – is not sufﬁ-

cient. In fact, doing so leads to worse downstream

performance compared to the generic language model

(PHI: 0.911 vs. 0.920; ICD-10: 0.796 vs 0.799 F

score).

In Figure 2, downstream performance of the lan-

guage models at various checkpoints of the domain-

adaptive pretraining session is shown. The purpose of

this evaluation is to track and detect any possible dif-

ferences between the two domain-adaptive pretrain-

ing sessions, i.e. when using a general vocabulary

(Clinical KB-BERT) vs. using a clinical vocabulary

with parameters inherited from a generic language

model (ours). For reference, the two models without

domain-adaptive pretraining are also included, using

a general vocabulary (KB-BERT) vs. a clinical vo-

cabulary (ours). As already observed from the ﬁnal

results, the vocabulary-adapted (clinical vocabulary)

model starts off with a worse performance compared

to Clinical KB-BERT (general vocabulary). How-

ever, already after 10% of a single epoch of domain-

adaptive pretraining, the results have been overturned.

Overall, for 9 (ICD-10) and 7 (PHI) out of 10 check-

points of domain-adaptive pretraining, respectively,

the vocabulary-adapted language model outperforms

the general-vocabulary alternative.

In order to gain further insights into domain differ-

ences w.r.t. vocabulary and, in particular, the nature of

the shared vocabulary, we investigate the distribution

of the intersection of the general and clinical vocabu-

laries according to term frequency in the clinical do-

main (Figure 3). This distribution is obtained by sam-

pling 10% of the clinical corpus used for pretraining;

in this analysis, we only consider whole words, i.e.

not the subwords in the vocabularies, denoted by the

‘##token’. The whole words in the clinical vocabulary

(∼ 36.000) are ranked according to term frequency in

the clinical corpus; the ﬁgure shows how the whole

words in the shared vocabulary are distributed accord-

ing to their ranked term frequency in the clinical cor-

pus. Each bin corresponds to 100 words and is ranked

from top to bottom, meaning that on the left we have

the most frequent words and on the right the least fre-

quent. As each bin corresponds to 100 words, the

count value can be interpreted as the percentage of the

shared words that belong to the corresponding rank-

ing. The ﬁgure shows that that the shared vocabulary

is not over-represented among the most frequent clin-

ical terms; if anything, it appears that many of the in-

tersecting terms are among the less frequent clinical

terms. The ﬁgure also shows that, among the top 100

most frequent clinical terms, only around a third are

present in the general-domain vocabulary.

Finally, the impact of using a tokenizer based on a

general vs. a clinical vocabulary is investigated. Fig-

ure 4 shows this impact in terms of sequence length

before and after tokenization on one of the datasets

for the downstream tasks (ICD-10). The length of se-

quences before tokenization corresponds to the num-

ber of space-separated tokens, while after tokeniza-

tion, words may, to different degrees, be split into

subwords. The more words are split into subwords,

the longer the resulting sequences will be. The ﬁg-

ure shows that, in both cases and as expected, the se-

quences become longer after tokenization. However,

when tokenizing clinical text with a general vocabu-

lary the sequences become longer – i.e. more words

are split into subwords – compared to when using a

clinical vocabulary.

Below are some examples of terms in the clinical

vocabulary and how they are split by the general vo-

cabulary tokenizer of KB-BERT:

• rituximab → [‘ritu’, ‘##xi’, ‘##ma’, ‘##b’]

(English: rituximab)

HEALTHINF 2022 - 15th International Conference on Health Informatics

184

Figure 2: Downstream performance (F

-score) of various checkpoints during the pretraining process.

• somatiskt → [‘som’, ‘##atiskt’]

(English: somatic)

• mna → [‘mn’, ‘##a’]

(English: mna, mini nutritional assessment)

• parkinsson → [‘park’, ‘##ins’, ‘##son’]

(English: parkinson)

• lymfk

ortelstationer → [‘lym’, ‘##f’, ‘##kort’,

‘##els’, ‘##tat’, ‘##ioner’]

(English: lymph node stations)

• lungf

andringarna → [‘lung’, ‘##for’, ‘##and’,

‘##ringarna’]

(English: the lung changes)

• r

ontgenunders

okning → [‘ro’, ‘##nt’, ‘##gen’,

‘##under’, ‘##so’, ‘##kning’]

(English: x-ray examination)

5 DISCUSSION

In this paper, we have investigated two factors of

potential importance when developing a clinical lan-

guage model based on exploiting the existence of a

generic language model: (i) vocabulary adaptation

and (ii) domain-adaptive pretraining, i.e. continued

pretraining with in-domain data. This situation is

important as it potentially allows for developing a

specialized language model with limited resources:

by adapting the vocabulary to the target domain in

conjunction with only one epoch of domain-adaptive

pretraining, the performance on downstream tasks is

clearly improved. This approach beneﬁts from an

existing, albeit generic, language model, which has

been trained for many more epochs. However, the

results indicate that vocabulary adaptation alone, i.e.

without domain-adaptive pretraining, is not beneﬁcial

and, in contrast, degrades performance. This might

indicate that the parameter transfer and approxima-

tion of the unknown representations disrupt the cal-

ibration of the model. However, with only a little

bit of domain-adaptive pretraining – as little as 10%

of an epoch – the performance of this model outper-

forms Clinical KB-BERT (Lamproudis et al., 2021),

i.e. the alternative model that does not involve any

vocabulary modiﬁcations. In future work, we plan to

compare these two approaches to pretraining a clin-

ical language model from scratch, which would cor-

respond to adapting the vocabulary and carrying out

domain-adaptive pretraining, but without resorting to

parameter transfer from a generic language model.

Furthermore, an alternative to replacing entirely the

general vocabulary with a clinical vocabulary would

Figure 3: The distribution of the intersection of the general

and clinical vocabularies according to term frequency in the

clinical domain. Each bin corresponds to 100 words and is

ranked from left (most frequent) to right (least frequent).

Vocabulary Modiﬁcations for Domain-adaptive Pretraining of Clinical Language Models

185

Figure 4: The distribution of sample lengths in terms of number of tokens from the Stockholm EPR Gastro ICD-10 Corpus.

be instead to extend the general vocabulary to in-

clude also clinical terms, e.g. by adding the non-

overlapping part of the clinical vocabulary, following

exBERT (Tai et al., 2020).

The fact that the size of the intersection between

the general-domain and clinical-domain vocabularies

is only around a third of each respective vocabulary

size indicates that there are substantial domain dif-

ferences between the general domain and the clini-

cal domain in terms of terminology. When looking

deeper at the shared vocabulary and how these terms

are distributed w.r.t to term frequency in the clinical

domain, we observe that they are not over-represented

among the most frequent clinical terms. However, it

should be noted that both vocabularies are limited to

approximately 50,000 tokens, which means that rela-

tively rare words and subwords in both domains have

already been excluded. When using a general-domain

tokenizer, such as the one in KB-BERT and Clini-

cal KB-BERT, these terms would be split into sub-

words by the tokenizer. The impact of this is shown

in Figure 4. It is, moreover, illustrated by the ex-

amples given in section 4, where important clinical

terms are split into subwords, often in such a way

that the meaning of the term cannot be derived from

the subwords. This applies also to abbreviations and

acronyms, which are prevalent in clinical text, and il-

lustrated by mna, which stands for mini nutritional

assessment and is split into two subwords: ‘mn’ and

‘##a’. For domain-speciﬁc tasks, it would be sub-

optimal for the tokenizer to split important domain-

speciﬁc terms into subwords, especially when done

in a manner that is not semantically decomposable.

Instead, by using a domain-speciﬁc vocabulary prior

to domain-adaptive pretraining, representations for

common clinical terms would be learned: it makes

sense that this would be beneﬁcial when ﬁne-tuning

language models to perform domain-speciﬁc tasks.

6 CONCLUSIONS

In this paper, we report on the development of a

clinical language model for Swedish that leverages

an existing generic language model and adapts it

to the clinical domain through vocabulary adapta-

tion and domain-adaptive pretraining. The results on

two downstream natural language processing tasks in

the clinical domain demonstrate that applying both

of these two strategies yields improved performance

compared to applying only one or neither. Most of

the improvement seems to stem from the domain-

adaptive pretraining, while vocabulary adaptation

without any domain-adaptive pretraining is counter-

productive. However, very little domain-adaptive pre-

training – as little as 10% of an epoch – is needed

for vocabulary adaptation to be effective and outper-

form the same amount of domain-adaptive pretrain-

ing with the general vocabulary of the existing lan-

guage model. Furthermore, we have provided some

insights into vocabulary-based domain differences in

an effort to motivate the results, which demonstrate

that the proposed approach outperforms the baselines.

In future work, we plan to compare the approach

proposed in this paper, as well as Clinical KB-

BERT, with pretraining a clinical language model

from scratch, i.e. without relying on KB-BERT for

HEALTHINF 2022 - 15th International Conference on Health Informatics

186

tokenization or model initialization. We also plan to

carry out longer domain-adaptive pretraining sessions

and study the impact of this on the respective ap-

proaches. Finally, another approach we plan to inves-

tigate is to extend and augment the general-domain

vocabulary with clinical terms instead of completely

replacing the general vocabulary with a clinical vo-

cabulary, as was done here.

ACKNOWLEDGEMENTS

This work was partially funded by Region Stockholm

through the project Improving Prediction Models for

Diagnosis and Prognosis of COVID-19 and Sepsis

with Natural Language Processing of Clinical Text.

REFERENCES

Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi,

D., Naumann, T., and McDermott, M. (2019). Pub-

licly available clinical BERT embeddings. In Pro-

ceedings of the 2nd Clinical Natural Language Pro-

cessing Workshop, pages 72–78, Minneapolis, Min-

nesota, USA. Association for Computational Linguis-

tics.

Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A Pre-

trained Language Model for Scientiﬁc Text. In Pro-

ceedings of the 2019 Conference on Empirical Meth-

ods in Natural Language Processing and the 9th Inter-

national Joint Conference on Natural Language Pro-

cessing (EMNLP-IJCNLP), pages 3606–3611.

Dalianis, H., Henriksson, A., Kvist, M., Velupillai, S.,

and Weegar, R. (2015). HEALTH BANK- A Work-

bench for Data Science Applications in Healthcare.

CEUR Workshop Proceedings Industry Track Work-

shop, pages 1–18.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding. In Pro-

ceedings of the 2019 Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies, Volume

1 (Long and Short Papers), pages 4171–4186, Min-

neapolis, Minnesota. Association for Computational

Linguistics.

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama,

N., Liu, X., Naumann, T., Gao, J., and Poon,

H. (2021). Domain-speciﬁc language model pre-

training for biomedical natural language process-

ing. ACM Transactions on Computing for Healthcare

(HEALTH), 3(1):1–23.

Gururangan, S., Marasovi

c, A., Swayamdipta, S., Lo, K.,

Beltagy, I., Downey, D., and Smith, N. A. (2020).

Don’t Stop Pretraining: Adapt Language Models to

Domains and Tasks. In Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 8342–8360, Online. Association

for Computational Linguistics.

Koto, F., Lau, J. H., and Baldwin, T. (2021). IndoBER-

Tweet: A Pretrained Language Model for Indone-

sian Twitter with Effective Domain-Speciﬁc Vocabu-

lary Initialization. In Proceedings of the 2021 Con-

ference on Empirical Methods in Natural Language

Processing, pages 10660–10668.

Kudo, T. and Richardson, J. (2018). SentencePiece: A sim-

ple and language independent subword tokenizer and

detokenizer for Neural Text Processing. In Proceed-

ings of the 2018 Conference on Empirical Methods

in Natural Language Processing: System Demonstra-

tions, pages 66–71.

Lamproudis, A., Henriksson, A., and Dalianis, H. (2021).

Developing a Clinical Language Model for Swedish:

Continued Pretraining of Generic BERT with In-

Domain Data. In Proceedings of RANLP 2021: Re-

cent Advances in Natural Language Processing, 1-3

Sept 2021, Varna, Bulgaria, pages 790–797.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,

and Kang, J. (2020). BioBERT: a pre-trained biomed-

ical language representation model for biomedical text

mining. Bioinformatics, 36(4):1234–1240.

Lewis, P., Ott, M., Du, J., and Stoyanov, V. (2020). Pre-

trained Language Models for Biomedical and Clinical

Tasks: Understanding and Extending the State-of-the-

Art. In Proceedings of the 3rd Clinical Natural Lan-

guage Processing Workshop, pages 146–157.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen,

D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy-

anov, V. (2019). RoBERTa: A Robustly Opti-

mized BERT Pretraining Approach. arXiv preprint

arXiv:1907.11692.

Malmsten, M., B

orjeson, L., and Haffenden, C. (2020).

Playing with Words at the National Library of

Sweden–Making a Swedish BERT. arXiv preprint

arXiv:2007.01658.

Remmer, S., Lamproudis, A., and Dalianis, H. (2021).

Multi-label Diagnosis Classiﬁcation of Swedish Dis-

charge Summaries – ICD-10 Code Assignment Using

KB-BERT. In Proceedings of RANLP 2021: Recent

Advances in Natural Language Processing, RANLP

2021, 1-3 Sept 2021, Varna, Bulgaria, pages 1158–

1166.

Sennrich, R., Haddow, B., and Birch, A. (2016). Neu-

ral Machine Translation of Rare Words with Subword

Units. In Proceedings of the 54th Annual Meeting

of the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 1715–1725, Berlin, Ger-

many. Association for Computational Linguistics.

Shin, H.-C., Zhang, Y., Bakhturina, E., Puri, R., Patwary,

M., Shoeybi, M., and Mani, R. (2020). BioMegatron:

Larger Biomedical Domain Language Model. In Pro-

ceedings of the 2020 Conference on Empirical Meth-

ods in Natural Language Processing (EMNLP), pages

4700–4706.

Tai, W., Kung, H., Dong, X. L., Comiter, M., and Kuo,

C.-F. (2020). exBERT: Extending Pre-trained Models

with Domain-speciﬁc Vocabulary Under Constrained

Training Resources. In Proceedings of the 2020 Con-

Vocabulary Modiﬁcations for Domain-adaptive Pretraining of Clinical Language Models

187

ference on Empirical Methods in Natural Language

Processing: Findings, pages 1433–1439.

Velupillai, S., Dalianis, H., Hassel, M., and Nilsson, G. H.

(2009). Developing a standard for de-identifying elec-

tronic patient records written in Swedish: precision,

recall and F-measure in a manual and computerized

annotation trial. International Journal of Medical In-

formatics, 78(12):e19–e26.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi,

M., Macherey, W., Krikun, M., Cao, Y., Gao, Q.,

Macherey, K., et al. (2016). Google’s neural ma-

chine translation system: Bridging the gap between

human and machine translation. arXiv preprint

arXiv:1609.08144.

HEALTHINF 2022 - 15th International Conference on Health Informatics

188