Towards a Unified Named Entity Recognition System

Disease Mention Identification

Tsendsuren Munkhdalai, Meijing Li, Khuyagbaatar Batsuren and Keun Ho Ryu

Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University,

Cheongju, South Korea

Keywords: Feature Learning, Semi-Supervised Learning, Named Entity Recognition, Conditional Random Fields.

Abstract: Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for

biomedical text data. Exploiting unlabeled text data to leverage system performance has been an active and

challenging research topic in text mining due to the recent growth in the amount of biomedical literature. In

this study, we take a step towards a unified NER system in biomedical, chemical and medical domain. We

evaluate word representation features automatically learnt by a large unlabeled corpus for disease NER. The

word representation features include brown cluster labels and Word Vector Classes (WVC) built by apply-

ing k-means clustering to continuous valued word vectors of Neural Language Model (NLM). The experi-

mental evaluation using Arizona Disease Corpus (AZDC) showed that these word representation features

boost system performance significantly as a manually tuned domain dictionary does. BANNER-

CHEMDNER, a chemical and biomedical NER system has been extended with a disease mention recogni-

tion model that achieves a 77.84% F-measure on AZDC when evaluating with 10-fold cross validation

method. BANNER-CHEMDNER is freely available at: https://bitbucket.org/tsendeemts/banner-chemdner.

1 INTRODUCTION

One essential task in developing an information

extraction system is the Named Entity Recognition

(NER) process, which basically defines the bounda-

ries between typical words and biomedical terminol-

ogy in a particular text, and assigns the terminology

to specific categories based on domain knowledge.

Gene and protein mention recognition in biomed-

ical text has been a main focus of the bio-text min-

ing community and many systems have been devel-

oped (Leaman 2008, Munkhdalai 2013). In contrast,

recognition of disease has received much less atten-

tion. Proposed solutions include rule-based, diction-

ary-based, and Machine Learning (ML) approaches.

In the dictionary-based approach, a prepared

terminology list is matched through a given text to

retrieve chunks containing the location of the termi-

nology words (Karopka 2006, Jimeno 2008, Gu-

rulingappa 2010). However, medical and chemical

text can contain new terminology that has yet to be

included in the dictionary.

The rule-based approach defines particular rules

by observing the general features of the entities in a

text. In order to identify any named entity in text

data, a rule-generation process has to process a huge

amount of text to collect accurate rules. In addition,

the rules are usually collected by domain experts,

requiring a lot of effort.

Since the Machine Learning (ML) approach was

adopted, significant progress in disease NER has

been achieved (Leaman 2009, Chowdhury 2010,

Neveol 2009). Robert et al. introduced the AZDC

corpus and adapted Conditional Random Fields

(CRF)-based gene mention recognition system,

BANNER for disease NER. They also combined the

CRF model with a dictionary-based method and

showed a significant improvement. Chowdhury et al.

studied different combination of feature sets, includ-

ing dictionary lookups and tags extracted by a syn-

tactic dependency parser. Those special feature

combinations in conjunction with carefully designed

postprocessing rules were observed to boost the

performance at a higher rate. However, incorpora-

tion of the domain dependent dictionary into a ML

system makes it non-trivial to adapt such a system

for another domain. This leads to an individual sys-

tem that is only applicable to a particular NER task

(such system might only address to gene mention

recognition problem) rather a unified system that

251

Munkhdalai T., Li M., Batsuren K. and Ryu K..

Towards a Uniﬁed Named Entity Recognition System - Disease Mention Identiﬁcation.

DOI: 10.5220/0005287802510255

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 251-255

ISBN: 978-989-758-070-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

could be applied to multiple NER tasks, such as

gene, chemical and disease mention recognition.

Recently, Semi-Supervised Learning (SSL) tech-

niques have been applied to NER. SSL a ML ap-

proach that typically uses a large amount of unla-

beled and a small amount of labeled data to build a

more accurate classification model than that which

would be built using only labeled data. SSL has

received significant attention for two reasons. First,

preparing a large amount of data for training re-

quires a lot of time and effort. Second, since SSL

exploits unlabeled data, the accuracy of classifiers is

generally improved. There have been two different

directions of SSL methods, semi-supervised model

induction approaches which are the traditional

methods and incorporate the domain knowledge

from unlabeled data into the classification model

during the training phase (Munkhdalai 2012,

Munkhdalai 2010), and supervised model induction

with unsupervised, possibly semi-supervised feature

learning. The approaches in the second research

direction induce a better feature representation by

learning over a large unlabeled corpus. Recently, the

studies that apply the word representation features

induced on the large text corpus have reported im-

provement over baseline systems in many natural

language processing (NLProc) tasks (Turian 2010,

Huang 2012, Socher 2011).

In this study, we take a step towards a unified

NER system in biomedical, chemical and medical

domain. We evaluate generally applicable word

representation features automatically learnt by a

large unlabeled corpus for disease NER. The word

representation features include brown cluster labels

and Word Vector Classes (WVC) built by applying

k-means clustering to continuous valued word vec-

tors of Neural Language Model (NLM). The exper-

imental evaluation using Arizona Disease Corpus

(AZDC) showed that these word representation fea-

tures boost system performance significantly as a

manually tuned domain dictionary does. BANNER-

CHEMDNER (Munkhdalai 2013), a chemical and

biomedical NER system has been extended with a

disease mention recognition model that achieves a

77.84% F-measure on AZDC when evaluating with

10-fold cross validation method.

The rest of this paper is organized as follows.

Section 2 introduces the proposed methodology and

the stages in the disease NER pipeline. Section 3

reports the performance evaluation of the system

based on combination of word representation fea-

tures, and a comparison against the existing systems.

Finally, we summarize the main conclusions

achieved and present our future work direction.

2 DISEASE NAMED ENTITY

RECOGNITION

This section introduces a detail of the proposed dis-

ease NER pipeline. First, we per-form preprocessing

on MEDLINE and PMC document collection and

then extract two different feature sets, a base feature

set and a word representation feature set, in the fea-

ture processing phase. The unlabeled set of the col-

lection is fed to unsupervised learning of the feature

processing phase to build word classes. Finally, we

apply the CRF sequence-labeling method to the

extracted feature vectors to train the NER model.

These steps will be described in subsequent sections.

2.1 Preprocessing

First, the text data is cleansed by removing non-

informative characters and replacing special charac-

ters with corresponding spellings. The text is then

tokenized with BANNER simple tokenizer. The

BANNER tokenizer breaks tokens into either a con-

tiguous block of letters and/or digits or a single

punctuation mark. Finally, the lemma and the part-

of-speech (POS) information were obtained for a

further usage in the feature extraction phase. In

BANNER-CHEMDNER, BioLemmatizer (Liu

2012) was used for lemma extraction, which resulted

in a significant improvement in overall system per-

formance for biomedical and chemical NER.

In addition to these preprocessing steps, special

care is taken to parse the PMC XML documents to

get the full text for the unlabeled data collection.

2.2 Feature Processing

We extract features from the preprocessed text to

represent each token as a feature vector, and then an

ML algorithm is employed to build a model for

NER.

The proposed method includes extraction of the

baseline and the word representation feature sets.

The word representation features can be extracted by

learning on a large amount of text and may be capa-

ble of introducing domain background to the NER

model.

The entire feature set for a token is expanded to

include features for the surroundings with a two-

length sliding window. The word, the word n-gram,

the character n-gram, and the traditional orthograph-

ic information are extracted as the baseline feature

set. The regular expressions that reveal orthographic

information are matched to the tokens to give ortho-

graphic information.

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

252

For word representation features, we train Brown

clustering models (Brown 1992) and Word Vector

(WV) models (Collobert 2008) on a large PubMed

and PMC document collection. Brown clustering is

a hierarchical word clustering method, grouping

words in an input corpus to maximize the mutual

information of bigrams. The VW model is induced

via a neural language model and consists of n-

dimensional continuous valued vectors, each of

which represents a word in the training corpus. Fur-

ther, the word vectors are clustered using a K-means

algorithm to drive a Word Vector Class (WVC)

model. Since Brown clustering is a bigram model,

this model may not be able to carry wide context

information of a word, whereas the WVC model is

an n-gram model (usually n=5) and learns broad

context information from the domain corpus. We

drive the cluster label prefixes with 4, 6, 10 and 20

lengths in the Brown model, 50 and 100 dimensions

of the WVs, and the WVCs as word representation

features.

2.3 Supervised Learning

CRF - a probabilistic undirected graphical model has

been used successfully in a large number of studies

on NER, because it takes advantage of sequence

labelling by treating each sentence as a sequence of

tokens. We apply a second-order CRF model, where

the current label is conditioned on the previous two

using a Begin, Inside, Outside (BIO) tagging format

of the tokens. In the BIO tagging format, each token

is classified either at the beginning, inside or outside

of a named entity, and a postprocessing task forms

the named entity mentions by merging the tagged

tokens.

We use a Machine Learning for Language

Toolkit (MALLET) library for training the CRF

model, because the BANNER system provides a

convenient interface to work with it. The BANNER

system also includes two types of general post-

processing that could be useful for any NER tasks in

bio-text data. The first type is based on the sym-

metry of parenthesis, brackets or double quotation

marks. Since these punctuation marks are always

paired, BANNER drops any named entity mention

containing mismatched parentheses, brackets or

double quotation marks. The second type of post-

processing is dedicated to resolving abbreviations of

named entities.

3 RESULTS

First, we evaluated combination of word representa-

tion features using 10-fold cross validation. We then

compared the result against existing systems

3.1 Dataset

We evaluated the system using AZDC corpus

(Leaman 2009) for disease mention identification.

The dataset consists of 793 annotated abstracts con-

taining 2,873 sentences, 3,093 unique disease men-

tions.

For the unlabeled data, we collected around 1.4

million PubMed abstracts and full text articles from

the whole PMC database available at the time (over

2 million documents). After preprocessing, we de-

rived two different text corpora: a PubMed abstract

corpus consisting of a vocabulary of 1,136,085 en-

tries for induction of Brown clustering models, and a

merged corpus of both resources with a vocabulary

of 4,359,932 entries for training WV models. Given

the limited resources and time, we were able to in-

duce the Brown clustering models only with the

PubMed abstract corpus.

3.2 Performance Evaluation

We followed an experimental setting similar to the

one in Robert et al. in order to compare our results

with that of the BANNER system. We performed

10-fold cross validation on AZDC in such a way that

all sentences of the same abstract are included in the

same fold. The results of the ten folds are averaged

to obtain the final outcome.

Table 1 shows the performance comparison of

the different runs with varied feature settings. We

started conducting a run with a basic feature setting,

and gradually increased the complexity of the fea-

ture space for further runs. A Brown model with a

larger number of clusters tended to obtain a higher

F-measure. Unlike Brown clustering, a large or a

lower number of WVCs degraded the performance.

We found the WVC model with 300 different clas-

ses the best performing one on this task. Further, the

combination of the different WVC models signifi-

cantly improved the F-measure. We achieved the

best performance, a 77.84% F-measure with the

model based on the baseline feature set, the 1000-

Brown clustering, and 300, 500 and 1000 WVCs

(the baseline + Brown 1000 + WVC 300 + WVC

500 + WVC 1000 setup).

TowardsaUnifiedNamedEntityRecognitionSystem-DiseaseMentionIdentification

253

Table 1: Disease NER evaluation results of different runs with varied features. Feature groups are separated by (+) and

followed by the corresponding parameters.

Features Precision (%) Recall (%) F-score (%)

Baseline + Brown 300 78.96 72.41 75.53

Baseline + Brown 1000 78.93 73.55 76.1

Baseline + Brown 1000 + WVC 1000 79.6 73.59 76.45

Baseline + Brown 1000 + WVC 300 79.63 75 77.21

Baseline + Brown 1000 + WVC 500 78.88 74.39 76.54

Baseline + Brown 1000 + WVC 500 + WVC 300 80.06 74.66 77.25

Baseline + Brown 1000 + WVC 500 + WVC 1000 79.29 74.04 76.54

Baseline + Brown 1000 + WVC 500 + WVC 300 +

WVC 1000 80.44 75.45 77.84

Table 2: Comparison of BANNER and our system results.

Systems Precision (%) Recall (%) F-score (%)

BANNER 78.5 69.9 74

BANNER (with dictionary) 80.9 75.1 77.9

BANNER-CHEMDNER 80.44 75.45 77.84

3.3 Performance Comparison

Table 2 reports the comparison of BANNER and our

system results. Our system outperforms the basic

BANNER setup by a 3.84% F-measure. BANNER

combined with dictionary matching performs slight-

ly better than our system. Our system achieves a

higher recall, since it is based on ML. In contrast,

the precision of BANNER with dictionary is better.

In fact, this is the main advantage of dictionary-

based methods.

In our system, we do not rely on any lexicon nor

any dictionary other than the free text in the domain

in order to keep the system applicable to other NER

tasks in bio-text data, even though the usage of such

resources is reported to considerably boost system

performance.

4 CONCLUSIONS

We took a step towards a unified named entity

recognition system in biomedical, chemical and

medical domain. We evaluated word representation

features automatically learnt by a large unlabeled

corpus for disease named entity recognition system.

The word representation features include brown

cluster labels and word vector classes built by apply-

ing k-means clustering to continuous valued word

vectors of neural language model.

The experimental evaluation using Arizona dis-

ease corpus showed that these word representation

features boost system performance significantly as a

manually tuned domain dictionary does. BANNER-

CHEMDNER, a chemical and biomedical named

entity recognition system has been extended with a

disease mention recognition model that achieves a

77.84% F-measure on Arizona disease corpus when

evaluating with 10-fold cross validation method.

ACKNOWLEDGEMENTS

This research was supported by the Basic Science

Research Program through the National Research

Foundation of Korea (NRF) funded by the Ministry

of Science, ICT and Future Planning (No-

2013R1A2A2A01068923) and by a National Re-

search Foundation of Korea (NRF) grant funded by

the Korea government (MSIP) (No. 2008-0062611).

REFERENCES

Leaman, R., Gonzalez, G., 2008. Banner: An Executable

Survey of Advances in Biomedical Named Entity

Recognition. In Pacific Symposium on Biocomputing.

Munkhdalai, T., Li, M., Batsuren, K., Ryu, K. H., 2013.

Banner-Chemdner: Incorporating Domain Knowledge

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

254

in Chemical and Drug Named Entity Recognition. In

Fourth BioCreative.

Karopka, T., Fluck, J., Mevissen, H., Glass, A., 2006. The

autoimmune Disease Database: a dynamically com-

piled literature-derived database. BMC Bioinformatics.

Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Ber-

langa, R., Rebholz-Schuhmann, D., 2008. Assessment

of disease named entity recognition on a corpus of an-

notated sentences. BMC Bioinformatics.

Gurulingappa, H., Klinger, R., Hofmann-Apitius, M.,

Fluck, J., 2010. An Empirical Evaluation of Resources

for the Identification of Disease and Adverse Effects

in Biomedical Literature. In 2nd Workshop on Build-

ing and Evaluating Resources for Biomedical Text

Mining.

Leaman, R., Miller, C., 2009. Enabling Recognition of

Disease in Biomedical Text with Machine Learning:

Corpus and Benchmark. In Symposium on Languages

in Biology and Medicine.

Chowdhury, M. F. M., Lavelli, A., 2010. Disease Mention

Recognition with Specific Features. In Biomedical

Natural Language Processing.

Neveol, A., Kim, W., Wlbur, W. J., Lu, Z., 2009. Explor-

ing Two Biomedical Text Genres for Disease Recog-

nition. In Biomedical Natural Language Processing.

Munkhdalai, T., Li, M., Kim, T., Namsrai, O., Seon-phil,

J., Jungpil, S., Ryu, K. H., 2012. Bio Named Entity

Recognition based on Co-training Algorithm. In AINA

2012.

Munkhdalai, T., Li, M., Unil, Y., Namsrai, O., Ryu, K. H.,

2012. An Active Co-Training Algorithm for Biomedi-

cal Named-Entity Recognition. KIPS.

Turian, J., Ratinov, L., Bengio, Y., 2010. Word represen-

tations: A simple and general method for semi-

supervised learning. In ACL.

Huang, E. H., Socher, R., Manning, C. D., Ng, A. Y.,

2012. Improving Word Representations via Global

Context and Multiple Word Prototypes. In ACL.

Socher, R., Lin, C. C, Ng, A. Y., Manning, C. D., 2011.

Parsing Natural Scenes and Natural Language with

Recursive Neural Networks. In ICML.

Liu, H., Christiansen, T., Baumgartner, W. A., Verspoor,

K., 2012. BioLemmatizer: a lemmatization tool for

morphological processing of biomedical text. J. Bio.

Sem.

Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J.

D., Lai, J. C., 1992. Class-Based n-gram Models of

Natural Language. In ACL.

Collobert, R., Weston, J., 2008. A Unified Architecture for

Natural Language Processing: Deep Neural Networks

with Multitask Learning. In ICML.

TowardsaUnifiedNamedEntityRecognitionSystem-DiseaseMentionIdentification

255