Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word Embeddings and Thesaurus

Nagesh Yadav, Alessandro Dibari, Miao Wei, John Segrave-Daly, Conor Cullen, Denisa Moga, Jillian Scalvini, Ciaran Hennessy, Morten Kristiansen, Omar O’Sullivan

2020

Abstract

Query expansion is an extensively researched topic in the field of information retrieval that helps to bridge the vocabulary mismatch problem, i.e., the way users express concepts differs from the way they appear in the corpus. In this paper, we propose a query-expansion technique for searching a corpus that contains a mix of terminology from several domains - some of which have well-curated thesauri and some of which do not. An iterative fusion technique is proposed that exploits thesauri for those domains that have them, and word embeddings for those that do not. For our experiments, we have used a corpus of Medicaid healthcare policies that contain a mix of terminology from medical and insurance domains. The Unified Medical Language System (UMLS) thesaurus was used to expand medical concepts and a word embeddings model was used to expand non-medical concepts. The technique was evaluated against elastic search using no expansion. The results show 8% improvement in recall and 12% improvement in mean average precision.

Download


Paper Citation


in Harvard Style

Yadav N., Dibari A., Wei M., Segrave-Daly J., Cullen C., Moga D., Scalvini J., Hennessy C., Kristiansen M. and O’Sullivan O. (2020). Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word Embeddings and Thesaurus. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: NLPinAI, ISBN 978-989-758-395-7, pages 441-445. DOI: 10.5220/0009090804410445


in Bibtex Style

@conference{nlpinai20,
author={Nagesh Yadav and Alessandro Dibari and Miao Wei and John Segrave-Daly and Conor Cullen and Denisa Moga and Jillian Scalvini and Ciaran Hennessy and Morten Kristiansen and Omar O’Sullivan},
title={Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word Embeddings and Thesaurus},
booktitle={Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: NLPinAI,},
year={2020},
pages={441-445},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009090804410445},
isbn={978-989-758-395-7},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: NLPinAI,
TI - Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word Embeddings and Thesaurus
SN - 978-989-758-395-7
AU - Yadav N.
AU - Dibari A.
AU - Wei M.
AU - Segrave-Daly J.
AU - Cullen C.
AU - Moga D.
AU - Scalvini J.
AU - Hennessy C.
AU - Kristiansen M.
AU - O’Sullivan O.
PY - 2020
SP - 441
EP - 445
DO - 10.5220/0009090804410445