Authors:
Mahbub Ul Alam
1
;
Aron Henriksson
1
;
Hideyuki Tanushi
2
;
Emil Thiman
3
;
2
;
Pontus Naucler
3
;
2
and
Hercules Dalianis
1
Affiliations:
1
Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden
;
2
Division of Infectious Disease, Department of Medicine, Karolinska Institutet, Stockholm, Sweden
;
3
Department of Infectious Diseases, Karolinska University Hospital, Stockholm, Sweden
Keyword(s):
Natural Language Processing, Terminologies, Synonym Extraction, Word Embeddings, Clinical Text.
Abstract:
Many natural language processing applications rely on the availability of domain-specific terminologies containing synonyms. To that end, semi-automatic methods for extracting additional synonyms of a given concept from corpora are useful, especially in low-resource domains and noisy genres such as clinical text, where nonstandard language use and misspellings are prevalent. In this study, prototype embeddings based on seed words were used to create representations for (i) specific urinary tract infection (UTI) symptoms and (ii) UTI symptoms in general. Four word embedding methods and two phrase detection methods were evaluated using clinical data from Karolinska University Hospital. It is shown that prototype embeddings can effectively capture semantic information related to UTI symptoms. Using prototype embeddings for specific UTI symptoms led to the extraction of more symptom terms compared to using prototype embeddings for UTI symptoms in general. Overall, 142 additional UTI symp
tom terms were identified, yielding a more than 100% increment compared to the initial seed set. The mean average precision across all UTI symptoms was 0.51, and as high as 0.86 for one specific UTI symptom. This study provides an effective and cost-effective solution to terminology expansion with small amounts of labeled data.
(More)