could be applied to multiple NER tasks, such as
gene, chemical and disease mention recognition.
Recently, Semi-Supervised Learning (SSL) tech-
niques have been applied to NER. SSL a ML ap-
proach that typically uses a large amount of unla-
beled and a small amount of labeled data to build a
more accurate classification model than that which
would be built using only labeled data. SSL has
received significant attention for two reasons. First,
preparing a large amount of data for training re-
quires a lot of time and effort. Second, since SSL
exploits unlabeled data, the accuracy of classifiers is
generally improved. There have been two different
directions of SSL methods, semi-supervised model
induction approaches which are the traditional
methods and incorporate the domain knowledge
from unlabeled data into the classification model
during the training phase (Munkhdalai 2012,
Munkhdalai 2010), and supervised model induction
with unsupervised, possibly semi-supervised feature
learning. The approaches in the second research
direction induce a better feature representation by
learning over a large unlabeled corpus. Recently, the
studies that apply the word representation features
induced on the large text corpus have reported im-
provement over baseline systems in many natural
language processing (NLProc) tasks (Turian 2010,
Huang 2012, Socher 2011).
In this study, we take a step towards a unified
NER system in biomedical, chemical and medical
domain. We evaluate generally applicable word
representation features automatically learnt by a
large unlabeled corpus for disease NER. The word
representation features include brown cluster labels
and Word Vector Classes (WVC) built by applying
k-means clustering to continuous valued word vec-
tors of Neural Language Model (NLM). The exper-
imental evaluation using Arizona Disease Corpus
(AZDC) showed that these word representation fea-
tures boost system performance significantly as a
manually tuned domain dictionary does. BANNER-
CHEMDNER (Munkhdalai 2013), a chemical and
biomedical NER system has been extended with a
disease mention recognition model that achieves a
77.84% F-measure on AZDC when evaluating with
10-fold cross validation method.
The rest of this paper is organized as follows.
Section 2 introduces the proposed methodology and
the stages in the disease NER pipeline. Section 3
reports the performance evaluation of the system
based on combination of word representation fea-
tures, and a comparison against the existing systems.
Finally, we summarize the main conclusions
achieved and present our future work direction.
2 DISEASE NAMED ENTITY
RECOGNITION
This section introduces a detail of the proposed dis-
ease NER pipeline. First, we per-form preprocessing
on MEDLINE and PMC document collection and
then extract two different feature sets, a base feature
set and a word representation feature set, in the fea-
ture processing phase. The unlabeled set of the col-
lection is fed to unsupervised learning of the feature
processing phase to build word classes. Finally, we
apply the CRF sequence-labeling method to the
extracted feature vectors to train the NER model.
These steps will be described in subsequent sections.
2.1 Preprocessing
First, the text data is cleansed by removing non-
informative characters and replacing special charac-
ters with corresponding spellings. The text is then
tokenized with BANNER simple tokenizer. The
BANNER tokenizer breaks tokens into either a con-
tiguous block of letters and/or digits or a single
punctuation mark. Finally, the lemma and the part-
of-speech (POS) information were obtained for a
further usage in the feature extraction phase. In
BANNER-CHEMDNER, BioLemmatizer (Liu
2012) was used for lemma extraction, which resulted
in a significant improvement in overall system per-
formance for biomedical and chemical NER.
In addition to these preprocessing steps, special
care is taken to parse the PMC XML documents to
get the full text for the unlabeled data collection.
2.2 Feature Processing
We extract features from the preprocessed text to
represent each token as a feature vector, and then an
ML algorithm is employed to build a model for
NER.
The proposed method includes extraction of the
baseline and the word representation feature sets.
The word representation features can be extracted by
learning on a large amount of text and may be capa-
ble of introducing domain background to the NER
model.
The entire feature set for a token is expanded to
include features for the surroundings with a two-
length sliding window. The word, the word n-gram,
the character n-gram, and the traditional orthograph-
ic information are extracted as the baseline feature
set. The regular expressions that reveal orthographic
information are matched to the tokens to give ortho-
graphic information.
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
252