Table 2: Comparison of our system with selected systems from the BioCreative II Gene Mention Task.
System Precision Recall F-measure Characteristics
(Grover et al., 2007) 0.8697 0.8255 0.8470 CRF + Abbreviation Detection + Chunker
(Vlachos, 2007) 0.8628 0.7966 0.8284 CRF + Syntactic Parsing - Domain Concepts
Our 0.8796 0.7227 0.7936 CRF
(Tsai et al., 2006) 0.9267 0.6891 0.7905 CRF + Post Processing
(Sun et al., 2007) 0.8046 0.7361 0.7688 CRF - Domain Concepts
tablishing relations between the several tokens in a
sentence. In our system, the relations are limited to a
{-1,1} window, because it reached the best results in
comparison with bigger windows. However, we ex-
tract less contextual information, which has showed
to be crucial to better recognize multi-token gene/pro-
tein names. This observation raises the importance of
using as much contextual information as possible.
Considering systems that do not relate all the to-
kens of the sentence, our system outperforms the oth-
ers, even when post-processing methods are used.
Overall, our system is the third best when using only
one CRF model.
From this comparison, the performance results
showed that contextual features have more impact
than post processing methods and specific domain
concepts. Our system achieves better results than the
system presented by (Tsai et al., 2006), which uses
a post processing technique to refine the recognized
names. On the other hand, the system presented by
(Vlachos, 2007) has better results than our system
without using domain concepts.
4 CONCLUSIONS
In this paper we presented a system to recognize
gene/protein names from natural language texts, using
Conditional Random Fields as the machine learning
technique. A large set of orthographic and morpho-
logical features is used, in order to extract precise and
complete knowledge about words’ shape. Dictionary
matching and specific domain concepts are also used
as features, in order to improve the overall system’s
recall. Compared to other systems that use weak con-
textual information, our system reached best results,
reaching an F-measure of 0.7936.
From the analysis of our results and the compar-
ison to other similar systems, it seems that explor-
ing more gene/protein names databases, in order to
match more names correctly and consequently in-
crease the impact of the dictionary matching feature,
could be beneficial. Another important point is the
introduction of more domain specific concepts. For
instance, UMLS terminology could be used to help
on gene/protein names recognition. Moreover, the in-
tegration of more features could also explored, trying
to extract more morphological and orthographic in-
formation (e.g., word length). We also intend to ex-
plore techniques to collect more contextual informa-
tion, which showed to have a strong contribution to
performance, both on recall and precision. Finally, in
order to increase the performance of the implemented
system, distinct models may be combined, taking ad-
vantage of the different predictions provided by each
model on the same chunk of text.
ACKNOWLEDGEMENTS
D. Campos is funded by Fundac¸
˜
ao para a Ci
ˆ
encia
e Tecnologia (FCT) under the project PTDC/EIA-
CCO/100541/2008. S. Matos is funded by FCT under
the Ci
ˆ
encia2007 programme.
REFERENCES
Ando, R. (2007). BioCreative II gene mention tagging sys-
tem at IBM Watson. In Proceedings of the Second
BioCreative Challenge Evaluation Workshop, pages
101–103. Citeseer.
Baldridge, J., Morton, T., and Bierner, G. (2010). openNLP
Package.
Browne, A. C., McCray, A. T., and Srinivasan, S. (2000).
The SPECIALIST LEXICON. Technical report, Lis-
ter Hill National Center for Biomedical Communica-
tions, National Library of Medicine.
Chen, Y., Liu, F., and Manderick, B. (2007). Gene men-
tion recognition using lexicon match based two-layer
support vector machines. In Proceedings of the Sec-
ond BioCreative Challenge Evaluation Workshop; 23
to 25 April 2007; Madrid, Spain.
Franz
´
en, K., Eriksson, G., Olsson, F., Asker, L., Lid
´
en, P.,
and C
¨
oster, J. (2002). Protein names and how to find
them. Int J Med Inform, 67(1-3):49–61.
Grover, C., Haddow, B., Klein, E., Matthews, M., Nielsen,
L., Tobin, R., and Wang, X. (2007). Adapting a
relation extraction pipeline for the BioCreAtIvE II
task. In Proceedings of the second BioCreative chal-
lenge evaluation workshop, volume 23, pages 273–
286. Citeseer.
RECOGNITION OF GENE/PROTEIN NAMES USING CONDITIONAL RANDOM FIELDS
279