Although UMLS mainly covers biomedical
concepts, we can also find concepts related to
procedures, devices, time, geography, people,
organizations and so on. These concepts play a
secondary role in UMLS but they are used in
describing biomedical concepts. This fact makes
UMLS a very heterogeneous source of knowledge
and, of course, it also complicates the task of
annotating and classifying the words contained in it.
Another issue that complicates the semantic analysis
is the fact that UMLS has a lot of multi-word
concepts whose components are not described in the
KR itself. Indeed, the results of this work can be
seen as a first approximation to the semantic
decomposition of the complex concepts problem.
In the next section, we present the CRF
algorithm to understand their mathematical basis. In
Section 3, we explain what UMLS is and the
modifications done over it to our experiments. In
Section 4, we present the proposed method to obtain
the seed tagged words that feed the CRF algorithm.
Then, in Section 5, a complete evaluation of the
process is made and, finally, the conclusion is
presented in the last section.
2 CONDITIONAL RANDOM
FIELDS
As biomedical NER can be thought of as a sequence
segmentation problem where each word is a token in
a sequence to be assigned a label, CRF method was
chosen as a good option to annotate the UMLS
concept descriptions. CRF is a structured prediction
method, which is essentially a combination of
classification and graphical modeling, combining the
ability of graphical models to compactly model
multivariate data with the ability of classification
methods to perform prediction using large sets of
input features (Sutton and McCallum, 2012).
CRFs are undirected statistical graphical models,
a special case of which is a linear chain that
corresponds to a conditionally trained finite-state
machine. As explained in (Settles, 2004), such
models are well suited to sequence analysis, and
CRFs in particular have been shown to be useful in
part-of-speech tagging (Lafferty, McCallum and
Pereira, 2001), shallow parsing (Sha and Pereira,
2003) and named entity recognition for newswire
data (McCallum and Li, 2003). They have also just
recently been applied to the more limited task of
finding gene and protein mentions (McDonald and
Pereira, 2005), with promising early results.
As explained in (Settles, 2004), CRFs are
probabilistic tagging models that give the
conditional probability of a possible tag sequence
given the input token sequence. Let o = {o1, o2...,
on} be a sequence of observed words of length n,
this is the input token sequence. Let S be a set of
states in a finite state machine, each corresponding
to a label l
∈
L. Let s = {s1, s2, ... sn} be the
sequence of states in S that correspond to the labels
assigned to words in the input sequence o. Linear-
chain CRFs define the conditional probability of a
state sequence given an input sequence to be:
1
11
0
1
exp
nm
jj i i
i= j=
Pso=
fs,s,o,i
Z
(1)
Where Zo is a normalization factor of all state
sequences and it is constant for the given input,
f
j
(s
i−1
, s
i
, o , i) is one of m functions that describes a
feature and specifies an association between the
predicates that hold at a position and the state for
that position and λ
j
is a learnt feature weight for each
feature function, that specifies whether that
association should be favored or disfavored. We
assume that the ith input token is represented by a
set oi
of predicates that hold of the token or its
neighborhood in the input sequence (McDonald and
Pereira, 2005).
The learnt feature weight λ
j
for each feature f
j
should be highly positive for features that are
correlated with the target label, highly negative for
features that are anti-correlated with the label and
around zero for relatively uninformative features.
These weights are set to maximize the conditional
log likelihood of labeled sequences in a training set
D = {<o, l>(1)... <o, l>(n)}:
2
2
11
log
2σ
nm
i
i= j=
LL D = P l
(2)
When the training state sequences are fully
labeled and unambiguous, the objective function is
convex, thus the model is guaranteed to find the
optimal weight settings in terms of LL(D) (Settles,
2004). Once these settings are found, the most
probable tag sequence for a given input unlabeled
sequence o can be obtained applying a Viterbi-style
algorithm to the maximization (Lafferty, McCallum
and Pereira, 2001).
Typical features considered in the approaches of
the literature are mainly divided in two groups: the
orthographic features (capitalization, affixes,
alphanumerical text, etc.) and semantic features
(using, for example, external lexicons) (Settles,
2004).
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
336