Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word

Embeddings and Thesaurus

Nagesh Yadav, Alessandro Dibari, Miao Wei, John Segrave-Daly, Conor Cullen, Denisa Moga,

Jillian Scalvini, Ciaran Hennessy, Morten Kristiansen and Omar O’Sullivan

International Business Machines, U.S.A.

denimoga@ie.ibm.com, jscalvin@us.ibm.com, ciaran.hennessy@ie.ibm.com, morten@ie.ibm.com, oosulli@ie.ibm.com

Keywords:

Information Retrieval, Query Expansion, Word Embedding, Thesaurus.

Abstract:

Query expansion is an extensively researched topic in the ﬁeld of information retrieval that helps to bridge

the vocabulary mismatch problem, i.e., the way users express concepts differs from the way they appear in

the corpus. In this paper, we propose a query-expansion technique for searching a corpus that contains a mix

of terminology from several domains - some of which have well-curated thesauri and some of which do not.

An iterative fusion technique is proposed that exploits thesauri for those domains that have them, and word

embeddings for those that do not. For our experiments, we have used a corpus of Medicaid healthcare policies

that contain a mix of terminology from medical and insurance domains. The Uniﬁed Medical Language

System (UMLS) thesaurus was used to expand medical concepts and a word embeddings model was used to

expand non-medical concepts. The technique was evaluated against elastic search using no expansion. The

results show 8% improvement in recall and 12% improvement in mean average precision.

1 INTRODUCTION

When searching for information, users often express

their search queries in a way that differs signiﬁcantly

from how that information is presented in the cor-

pus being searched. For example: a user may search

for medical insurance information such as ‘Flu jab

coverage for children’, and be interested in results

(i.e., snippet from a document) like ‘Inﬂuenza immu-

nizations are a beneﬁt for both child and adult mem-

bers...’. Here, the desired result matches little or none

of the vocabulary used in the query. Some of these

differences are straightforward – e.g.,in medicine, the

terms ‘Flu’ and ‘Inﬂuenza’ are synonymous. Others

are contextual – e.g., in Insurance, the terms ‘cover-

age’ and ‘beneﬁts’ are semantically similar (in con-

trast to Project Management, where they have dis-

tinctly different meanings). This is a standard prob-

lem faced in Information Retrieval (IR) systems re-

ferred to as vocabulary mismatch problem (Xu and

Croft, 2017).

Query expansion is a commonly-used approach

to mitigating the vocabulary mismatch problem

(Carpineto and Romano, 2012)(Crimp and Trotman,

2018)(Kuzi et al., 2016). IR systems commonly

‘expand’ users’ searches to include synonyms, and

semantically-related terms, such as those in the Flu

example above. Many domains have well-established

thesauri that enable this type of query expansion.

The NIH Uniﬁed Medical Language System (UMLS)

metathesaurus is a well-curated example applicable to

Medical and Healthcare domain (Bodenreider, 2004).

However, thesauri alone are not sufﬁcient to bridge

the vocabulary mismatch gap. First, they often have

gaps in their vocabulary – e.g., due to infrequent up-

dates. Secondly, to look up expansions in a thesaurus,

an IR system must ﬁrst match n-grams in the users’

query to concepts in that thesaurus (‘concept detec-

tion’). Differences between the user’s representation

of that concept in the query (e.g., ‘ﬂu jab’) and the

thesaurus representation (‘Inﬂuenza virus vaccine’)

can cause detection to fail and thus, no expansions

to be identiﬁed. Finally, many domains lack high-

quality thesauri, or those available are not well-suited

to Information Retrieval tasks. e.g., in Insurance,

there are no widely-adopted thesauri to indicate that

terms like ‘PAR’ and ‘prior authorization’ are syn-

onymous, while ‘covered’, ‘beneﬁt’ and ‘available’

share similar meanings. In this paper, we address

the aforementioned problems by combining thesauri

and word embeddings, as both have useful properties

for addressing these problems - e.g., by representing

Yadav, N., Dibari, A., Wei, M., Segrave-Daly, J., Cullen, C., Moga, D., Scalvini, J., Hennessy, C., Kristiansen, M. and O’Sullivan, O.

Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word Embeddings and Thesaurus.

DOI: 10.5220/0009090804410445

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 1, pages 441-445

ISBN: 978-989-758-395-7; ISSN: 2184-433X

441

Figure 1: Overview of our approach.

words as learned vectors (distributed representations),

proximity can be used as a rough approximation of se-

mantic similarity.

2 APPROACH

2.1 Overview

In this paper we propose an iterative query expan-

sion technique, an overview of which is depicted in

this pipeline (ﬁgure 1). First, the user query is pre-

processed, including stop-word removal and case nor-

malization. An n-gram generator takes the normal-

ized query and creates a list of all possible n-grams,

starting from the largest and iteratively reducing to

individual words. Each of these candidate n-grams

is passed through a concept expansion module to get

a list of candidate expansions. This module is de-

scribed in section 2.3 and exploits a combination of

thesauri and word embeddings. Some of these candi-

date expansions are noisy - terms that appear in the

thesaurus, but are not relevant to the corpus in ques-

tion. For example, medical research terms that never

appear in an insurance policy corpus. To address this,

all expansions go through ’domain adaptation’ where

they are cross-checked against the corpus linguistics

using word embeddings, as described in section 2.4.

A query is then generated and submitted to the search

engine.

2.2 Concept Detection

The Concept Detection module is responsible for

mapping a candidate n-gram to a known concept in

the domain thesaurus, if such a match exists. In our

application, UMLS is used as a thesaurus of known

medical/healthcare concepts. As depicted in ﬁgure 2,

concept detection is ﬁrst attempted by looking-up the

candidate n-gram in the thesaurus. If matched, the

concept is retrieved from the thesaurus and passed

onto to the concept expansion module. In our ap-

plication, these concepts are represented by Con-

cept Unique Identiﬁers (CUIs) from the UMLS the-

saurus. If the n-gram cannot be matched by a simple

lookup, a second attempt is made - this time exploit-

ing word embeddings. Potential synonyms and other

semantically-related terms for the n-gram are derived

from a word embedding model. Since word embed-

dings provide only a rough approximation of semantic

similarity, a high similarity threshold is used to miti-

gate the risk of noisy/irrelevant terms. (Optimal val-

ues for this threshold are determined experimentally -

e.g., for the Medicaid IR system in our case). At this

point, a second attempt at concept detection can be

made by looking-up these embedding-derived terms

in the thesaurus. If matched, the concept is retrieved

from the thesaurus as before and passed onto the con-

cept expansion module.

To illustrate the concept detection process - sup-

pose the user query is Panoramic x-ray cover-

age. Within this query is the bi-gram panoramic x-

ray. This bi-gram is looked up in the UMLS the-

saurus, where no matching concept can be found.

Semantically-related terms for this bi-gram are de-

rived from a word embedding model, yielding the

new term panoramic radiography. The embedding-

derived term is again looked up in the UMLS the-

saurus, and this time a matching concept is found

(e.g., a UMLS CUI). Once the concepts have been de-

tected, the thesaurus can then be exploited to expand

them, as described in the following sections.

After concept detection, a single concept can still be

ambiguous. More formally, a concept (initially rep-

resented only by its ngram) is ambiguous when af-

ter concept detection phase there are more than one

Concept Unique Identiﬁer (CUI) matched in UMLS

against that term (n-gram). For our speciﬁc ap-

plication wherein we are interested in medical and

healthcare entities, we are exploiting UMLS Semantic

Groups to solve these ambiguities in most cases. Fi-

nally, when concept detection is complete, a numer-

ical conﬁdence score is assigned to every identiﬁed

concept. This number is a function of the longest n-

grams matched, with respect to the given user query.

C = f (NG

max

, U Q) (1)

where NG

max

is the number of terms in the longest

concept and UQ is the number of terms in the original

user query.

NLPinAI 2020 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

442

Figure 2: Query Expansion Flow.

2.3 Concept Expansion

The Concept Expansion module is responsible for

’expanding’ the terms in a user query with synony-

mous or closely semantically-related terms. This

is straightforward for simple, single-concept queries

that the concept detection module already matched in

a thesaurus. e.g., in our system, the CUI identiﬁed for

a concept is used to look up variant and synonymous

terms in the UMLS thesaurus.

However, queries that contain a mix of concepts

from different domains are less straightforward, par-

ticularly where some of those domains have a the-

saurus and others do not. To address this, we combine

expansions from both thesaurus and corpus-speciﬁc

word embeddings, using the conﬁdence score de-

scribed in section 2.2 to determine the relative con-

tribution of each. The most trustworthy information

source is the thesaurus, which often has decades of

use and curation (e.g., UMLS). So when most of the

terms expressed in a user query can be mapped to con-

cepts in a thesaurus, then the conﬁdence score is high.

In this case, expansions are primarily contributed by

the thesaurus, and less emphasis is placed on word-

embedding based expansions. In contrast, when most

of the concepts in a user query cannot be found in

the thesaurus, then the conﬁdence score will be low.

In this case, expansions are primarily contributed by

a corpus-speciﬁc word embedding model, with only

a few (or none) contributed by the thesaurus. This

word embedding is learned from the corpus being

searched, to avoid introducing additional vocabulary

gaps - i.e., gaps between the vocabulary on which the

word embedding was trained and that of the corpus

being searched. Expansions are obtained from this

model using the same semantic similarity threshold

as before. In our application, Word2Vec was used to

train this word embedding model from a corpus of

Medicaid policy documents (Le and Mikolov, 2014).

Once obtained, all expansions are sent for domain

adaptation before being submitted to the search en-

gine.

2.4 Domain Adaptation

The Domain adaptation module reduces noise in the

search results by ensuring that expansions are only

used when they relate to the actual terminology used

in the corpus. Some thesaurus-based expansions may

be removed at this stage, if they are found not to

be relevant for the particular corpus being searched.

For each expansion resulting from the previous mod-

ule, we calculate its cosine similarity with the orig-

inal user query term it was derived from. This en-

sures that the set of expansions that are closer to the

corpus (measured in terms of semantic distance) and

Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word Embeddings and Thesaurus

443

Table 1: Search metrics.

Metric Without expansion With expansion % improvement

Mean Average Precision 0.63 0.71 12

Precision(@Rank 10) 0.49 0.53 8

Recall(@Rank 10) 0.62 0.67 8

NDCG 0.79 0.83 5

also closer to the original user query are ranked higher

than those which are further apart.

3 RESULTS AND DISCUSSION

The proposed technique was tested in IBM Policy

Insights, an application that helps Medicaid Fraud,

Waste and Abuse (FWA) investigators ﬁnd policy in-

formation relevant to their investigations.

For Ground Truth, 3 subject matter expert (SME)

investigators identiﬁed a corpus of 345 Medicaid pol-

icy documents. They also deﬁned 80 queries that

were representative of those typically used in their

investigations, ranging from single words to phrase

queries. Every document in the corpus was labelled

as either relevant or not-relevant for each query. The

corpus, queries and relevancy labels were used as

ground truth for measuring document retrieval per-

formance. To evaluate the performance, the follow-

ing metrics were used, Precision, Recall, Mean Av-

erage Precision (MAP), Normalized Discounted Cu-

mulative Gain (NDCG) and Mean Reciprocal Rank

(MRR) (Buttcher et al., 2016).

The underlying search engine used was Watson

Discovery Service (WDS) (Turner, 2016), using its

out-of-the-box elastic search capabilities only. A

WDS collection was created for the corpus of 345 pol-

icy documents. UMLS was used as a thesaurus. A

word embeddings model was trained on this corpus,

using word2vec (Le and Mikolov, 2014). To evalu-

ate baseline document retrieval performance, all 80

queries were run against WDS’ out-of-the-box Natu-

ral Language Query (NLQ) API and the above perfor-

mance metrics calculated for the returned documents.

For our experiment, the same setup was used, with

the 80 queries passed through our proposed query ex-

pansion pipeline in advance of being sent to the NLQ

search engine. The same performance metrics were

again calculated for the returned documents and these

were compared to the baseline performance.

Table 1 summarizes the comparison of key met-

rics. From the average performance over 80 queries,

it can be concluded that recall and precision are both

improved by our proposed query expansion pipeline.

We also observed that queries containing healthcare

concepts matched by our thesaurus (as described in

section 2.2) showed more signiﬁcant improvements

than those that did not. This is expected, due to

the nature of our corpus and available thesauri i.e.,

UMLS contains high-quality human-curated knowl-

edge. We conclude that combining domain knowl-

edge from human-curated thesauri and automatically-

learned word embeddings enhances document re-

trieval performance on a corpus containing a mix of

terminology from several domains.

4 FUTURE WORK

The proposed technique was tested on a corpus of

healthcare policy documents. Further testing needs

to be done on different domains to assess the gener-

alization of the proposed technique. Speciﬁcally, the

approach needs to be tested on domains with a less de-

tailed human-annotated thesaurus (such as insurance

sector). Furthermore, metadata of detected concepts

such as semantic groups will be exploited in future

iterations for concept disambiguation.

REFERENCES

Bodenreider, O. (2004). The uniﬁed medical language sys-

tem: integrating biomedical terminology. In Nucleic

acids research, volume 32, pages D267–D270. Ox-

ford University Press.

Buttcher, S., Clarke, C. L., and Cormack, G. V. (2016).

Information retrieval: Implementing and evaluating

search engines. MIT Press.

Carpineto, C. and Romano, G. (2012). A survey of au-

tomatic query expansion in information retrieval. In

ACM Comput. Surv., volume 44, pages 1:1–1:50, New

York, NY, USA. ACM.

Crimp, R. and Trotman, A. (2018). Reﬁning query expan-

sion terms using query context. In Proceedings of the

23rd Australasian Document Computing Symposium,

ADCS ’18, pages 12:1–12:4, New York, NY, USA.

ACM.

Kuzi, S., Shtok, A., and Kurland, O. (2016). Query expan-

sion using word embeddings. In Proceedings of the

25th ACM International on Conference on Informa-

tion and Knowledge Management, CIKM ’16, pages

1929–1932, New York, NY, USA. ACM.

Le, Q. and Mikolov, T. (2014). Distributed representations

NLPinAI 2020 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

444

of sentences and documents. In International confer-

ence on machine learning, pages 1188–1196.

Turner, E. (2016). Watson discovery ser-

vice: understand your data at scale with

less effort. In IBM Developer Works,

https://www.ibm.com/blogs/watson/2016/12/watson-

discovery-service-understand-data-scale-less-effort.

Xu, J. and Croft, W. B. (2017). Query expansion using local

and global document analysis. In ACM SIGIR forum,

volume 51, pages 168–175. ACM.

Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word Embeddings and Thesaurus

445