Bacterium Model. The logistic regression model
achieves significantly better results than the baseline
model and all contest submissions. However, many
of the used features have only very little influence.
We remark that almost comparable results can be
achieved by a model that always predicts true unless
the bacterium name starts with ‘bacteri’. This sort of
model is of course not generic and largely overfits the
data. It works well because it succeeds in excluding a
significant amount of false relations. Labeled entities
occur in surface forms ‘bacterium’, ‘bacterial infec-
tions’, . . . These forms occur relatively often in texts,
but they rarely appear in Localization relations. The
reason for this is that when the word ‘bacterium’ ap-
pears in a text, it usually does not refer to the general
concept but to a specific bacterium discussed previ-
ously in the text. However, to avoid overfitting it is
preferred to use such patterns in the data by including
relevant features, rather than implementing strict de-
cision rules based on them. In the case of the above
characteristic, the name of the specific bacterium en-
tity is added as a feature in our system.
5 CONCLUSION
In this paper we discussed an approach for the
first two subtasks of the Bacteria Biotopes task of
BioNLP-ST 2013. For the first subtask (entity detec-
tion and ontology mapping) we implemented a model
based on Conditional Random Fields. In this sys-
tem, candidates are generated from the text and thor-
oughly inspected to find matches within the ontology.
We also devised several improvements for the bound-
ary detection of entities. Our model achieved signif-
icantly better results than all official submissions to
BioNLP-ST 2013.
For the second subtask (relation extraction) we
generated candidates with multiple generation rules
(e.g. all bacteria and locations that occur in the same
sentence). To select a candidate we used a logistic
regression model. Because we used a combination of
generation rules we achieved a much higher recall and
therefore a much better score than all official submis-
sions to BioNLP-ST 2013.
In spite of these pronounced gains, we think there
is still room for improvement, especially for the sec-
ond subtask. One potential improvement of our
model will be to consider long distance dependen-
cies between the bacterium and location, more con-
textual features and additional background knowl-
edge from external resources. In this direction, us-
ing structured output prediction and joint learning
frameworks will help us to consider these kind of
knowledge for an end-to-end entity and relation ex-
traction model (Kordjamshidi and Moens, 2013; Ko-
rdjamshidi and Moens, 2014).
ACKNOWLEDGEMENTS
The authors would like to thank the Research Founda-
tion Flanders (FWO) for funding this research (grant
G.0356.12), as well as the bilateral project of KU
Leuven and Tsinghua University DISK (DIScovery of
Knowledge on Chinese Medicinal Plants in Biomedi-
cal Texts, grant BIL/012/008). Also we would like to
thank the reviewers for their insightful comments and
remarks.
REFERENCES
Bannour, S., Audibert, L., and Soldano, H. (2013).
Ontology-based semantic annotation: an automatic
hybrid rule-based method. In Proceedings of the
BioNLP Shared Task 2013 Workshop, pages 139–143,
Sofia, Bulgaria. ACL.
Bjorne, J. and Salakoski, T. (2011). Generalizing biomedi-
cal event extraction. In Proceedings of BioNLP Shared
Task 2011 Workshop. ACL.
Bj
¨
orne, J. and Salakoski, T. (2013). TEES 2.1: Automated
Annotation Scheme Learning in the BioNLP 2013
Shared Task. In Proceedings of the BioNLP Shared
Task 2013 Workshop, pages 16–25, Sofia, Bulgaria.
ACL.
Bossy, R., Golik, W., Ratkovic, Z., Bessi
`
eres, P., and
N
´
edellec, C. (2013). BioNLP shared Task 2013 –
An Overview of the Bacteria Biotope Task. In Pro-
ceedings of the BioNLP Shared Task 2013 Workshop,
pages 161–169, Sofia, Bulgaria. ACL.
Bossy, R., Jourde, J., Bessieres, P., van de Guchte, M., and
Nedellec, C. (2011). BioNLP shared task 2011 - Bac-
teria Biotope. In Proceedings of BioNLP Shared Task
2011 Workshop. ACL, pages 56–64.
Claveau, V. (2013). IRISA participation to BioNLP-ST
2013: lazy-learning and information retrieval for in-
formation extraction tasks. In Proceedings of the
BioNLP Shared Task 2013 Workshop, pages 188–196,
Sofia, Bulgaria. ACL.
Grouin, C. (2013). Building a contrasting taxa extractor for
relation identification from assertions: Biological tax-
onomy & ontology phrase extraction system. In Pro-
ceedings of the BioNLP Shared Task 2013 Workshop,
pages 144–152, Sofia, Bulgaria. ACL.
Karadeniz, I. and
¨
Ozg
¨
ur, A. (2013). Bacteria biotope detec-
tion, ontology-based normalization, and relation ex-
traction using syntactic rules. In Proceedings of the
BioNLP Shared Task 2013 Workshop, pages 170–177,
Sofia, Bulgaria. ACL.
Klein, D. and Manning, C. D. (2003). Fast exact inference
with a factored model for natural language parsing. In
MachineReadingofBiologicalTexts-Bacteria-BiotopeExtraction
63