Authors:
Mahanazuddin Syed
1
;
Kevin Sexton
1
;
Melody Greer
1
;
Shorabuddin Syed
1
;
Joseph VanScoy
2
;
Farhan Kawsar
2
;
Erica Olson
2
;
Karan Patel
2
;
Jake Erwin
2
;
Sudeepa Bhattacharyya
3
;
Meredith Zozus
4
and
Fred Prior
1
Affiliations:
1
Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, U.S.A.
;
2
College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, U.S.A.
;
3
Department of Biological Sciences and Arkansas Biosciences Institute, Arkansas State University, Jonesboro, AR, U.S.A.
;
4
Department of Population Health Sciences, University of Texas Health Science Center at San Antonio, San Antonio, TX, U.S.A.
Keyword(s):
Clinical Named Entity Recognition, Deep Learning, De-identification, Word Embeddings, and Natural Language Processing.
Abstract:
Clinical named entity recognition (NER) is an essential building block for many downstream natural language processing (NLP) applications such as information extraction and de-identification. Recently, deep learning (DL) methods that utilize word embeddings have become popular in clinical NLP tasks. However, there has been little work on evaluating and combining the word embeddings trained from different domains. The goal of this study is to improve the performance of NER in clinical discharge summaries by developing a DL model that combines different embeddings and investigate the combination of standard and contextual embeddings from the general and clinical domains. We developed: 1) A human-annotated high-quality internal corpus with discharge summaries and 2) A NER model with an input embedding layer that combines different embeddings: standard word embeddings, context-based word embeddings, a character-level word embedding using a convolutional neural network (CNN), and an exter
nal knowledge sources along with word features as one-hot vectors. Embedding was followed by bidirectional long short-term memory (Bi-LSTM) and conditional random field (CRF) layers. The proposed model reaches or overcomes state-of-the-art performance on two publicly available data sets and an F1 score of 94.31% on an internal corpus. After incorporating mixed-domain clinically pre-trained contextual embeddings, the F1 score further improved to 95.36% on the internal corpus. This study demonstrated an efficient way of combining different embeddings that will improve the recognition performance aiding the downstream de-identification of clinical notes.
(More)