tokenization or model initialization. We also plan to
carry out longer domain-adaptive pretraining sessions
and study the impact of this on the respective ap-
proaches. Finally, another approach we plan to inves-
tigate is to extend and augment the general-domain
vocabulary with clinical terms instead of completely
replacing the general vocabulary with a clinical vo-
cabulary, as was done here.
ACKNOWLEDGEMENTS
This work was partially funded by Region Stockholm
through the project Improving Prediction Models for
Diagnosis and Prognosis of COVID-19 and Sepsis
with Natural Language Processing of Clinical Text.
REFERENCES
Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi,
D., Naumann, T., and McDermott, M. (2019). Pub-
licly available clinical BERT embeddings. In Pro-
ceedings of the 2nd Clinical Natural Language Pro-
cessing Workshop, pages 72–78, Minneapolis, Min-
nesota, USA. Association for Computational Linguis-
tics.
Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A Pre-
trained Language Model for Scientific Text. In Pro-
ceedings of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 3606–3611.
Dalianis, H., Henriksson, A., Kvist, M., Velupillai, S.,
and Weegar, R. (2015). HEALTH BANK- A Work-
bench for Data Science Applications in Healthcare.
CEUR Workshop Proceedings Industry Track Work-
shop, pages 1–18.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Pro-
ceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), pages 4171–4186, Min-
neapolis, Minnesota. Association for Computational
Linguistics.
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama,
N., Liu, X., Naumann, T., Gao, J., and Poon,
H. (2021). Domain-specific language model pre-
training for biomedical natural language process-
ing. ACM Transactions on Computing for Healthcare
(HEALTH), 3(1):1–23.
Gururangan, S., Marasovi
´
c, A., Swayamdipta, S., Lo, K.,
Beltagy, I., Downey, D., and Smith, N. A. (2020).
Don’t Stop Pretraining: Adapt Language Models to
Domains and Tasks. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, pages 8342–8360, Online. Association
for Computational Linguistics.
Koto, F., Lau, J. H., and Baldwin, T. (2021). IndoBER-
Tweet: A Pretrained Language Model for Indone-
sian Twitter with Effective Domain-Specific Vocabu-
lary Initialization. In Proceedings of the 2021 Con-
ference on Empirical Methods in Natural Language
Processing, pages 10660–10668.
Kudo, T. and Richardson, J. (2018). SentencePiece: A sim-
ple and language independent subword tokenizer and
detokenizer for Neural Text Processing. In Proceed-
ings of the 2018 Conference on Empirical Methods
in Natural Language Processing: System Demonstra-
tions, pages 66–71.
Lamproudis, A., Henriksson, A., and Dalianis, H. (2021).
Developing a Clinical Language Model for Swedish:
Continued Pretraining of Generic BERT with In-
Domain Data. In Proceedings of RANLP 2021: Re-
cent Advances in Natural Language Processing, 1-3
Sept 2021, Varna, Bulgaria, pages 790–797.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,
and Kang, J. (2020). BioBERT: a pre-trained biomed-
ical language representation model for biomedical text
mining. Bioinformatics, 36(4):1234–1240.
Lewis, P., Ott, M., Du, J., and Stoyanov, V. (2020). Pre-
trained Language Models for Biomedical and Clinical
Tasks: Understanding and Extending the State-of-the-
Art. In Proceedings of the 3rd Clinical Natural Lan-
guage Processing Workshop, pages 146–157.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen,
D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy-
anov, V. (2019). RoBERTa: A Robustly Opti-
mized BERT Pretraining Approach. arXiv preprint
arXiv:1907.11692.
Malmsten, M., B
¨
orjeson, L., and Haffenden, C. (2020).
Playing with Words at the National Library of
Sweden–Making a Swedish BERT. arXiv preprint
arXiv:2007.01658.
Remmer, S., Lamproudis, A., and Dalianis, H. (2021).
Multi-label Diagnosis Classification of Swedish Dis-
charge Summaries – ICD-10 Code Assignment Using
KB-BERT. In Proceedings of RANLP 2021: Recent
Advances in Natural Language Processing, RANLP
2021, 1-3 Sept 2021, Varna, Bulgaria, pages 1158–
1166.
Sennrich, R., Haddow, B., and Birch, A. (2016). Neu-
ral Machine Translation of Rare Words with Subword
Units. In Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 1715–1725, Berlin, Ger-
many. Association for Computational Linguistics.
Shin, H.-C., Zhang, Y., Bakhturina, E., Puri, R., Patwary,
M., Shoeybi, M., and Mani, R. (2020). BioMegatron:
Larger Biomedical Domain Language Model. In Pro-
ceedings of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP), pages
4700–4706.
Tai, W., Kung, H., Dong, X. L., Comiter, M., and Kuo,
C.-F. (2020). exBERT: Extending Pre-trained Models
with Domain-specific Vocabulary Under Constrained
Training Resources. In Proceedings of the 2020 Con-
Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models
187