Machine Reading of Biological Texts - Bacteria-Biotope Extraction

Wouter Massa, Parisa Kordjamshidi, Thomas Provoost, Marie-Francine Moens

2015

Abstract

The tremendous amount of scientific literature available about bacteria and their biotopes underlines the need for efficient mechanisms to automatically extract this information. This paper presents a system to extract the bacteria and their habitats, as well as the relations between them. We investigate to what extent current techniques are suited for this task and test a variety of models in this regard. To detect entities in a biological text we use a linear chain Conditional Random Field (CRF). For the prediction of relations between the entities, a model based on logistic regression is built. Designing a system upon these techniques, we explore several improvements for both the generation and selection of good candidates. One contribution to this lies in the extended flexibility of our ontology mapper, allowing for a more advanced boundary detection. Furthermore, we discover value in the combination of several distinct candidate generation rules. Using these techniques, we show results that are significantly improving upon the state of art for the BioNLP Bacteria Biotopes task.

References

  1. Bannour, S., Audibert, L., and Soldano, H. (2013). Ontology-based semantic annotation: an automatic hybrid rule-based method. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 139-143, Sofia, Bulgaria. ACL.
  2. Bjorne, J. and Salakoski, T. (2011). Generalizing biomedical event extraction. In Proceedings of BioNLP Shared Task 2011 Workshop. ACL.
  3. Björne, J. and Salakoski, T. (2013). TEES 2.1: Automated Annotation Scheme Learning in the BioNLP 2013 Shared Task. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 16-25, Sofia, Bulgaria. ACL.
  4. Bossy, R., Golik, W., Ratkovic, Z., Bessières, P., and Nédellec, C. (2013). BioNLP shared Task 2013 - An Overview of the Bacteria Biotope Task. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 161-169, Sofia, Bulgaria. ACL.
  5. Bossy, R., Jourde, J., Bessieres, P., van de Guchte, M., and Nedellec, C. (2011). BioNLP shared task 2011 - Bacteria Biotope. In Proceedings of BioNLP Shared Task 2011 Workshop. ACL, pages 56-64.
  6. Claveau, V. (2013). IRISA participation to BioNLP-ST 2013: lazy-learning and information retrieval for information extraction tasks. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 188-196, Sofia, Bulgaria. ACL.
  7. Grouin, C. (2013). Building a contrasting taxa extractor for relation identification from assertions: Biological taxonomy & ontology phrase extraction system. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 144-152, Sofia, Bulgaria. ACL.
  8. Karadeniz, I. and O zgür, A. (2013). Bacteria biotope detection, ontology-based normalization, and relation extraction using syntactic rules. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 170-177, Sofia, Bulgaria. ACL.
  9. Klein, D. and Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In Advances in Neural Information Processing Systems 15 (NIPS), pages 3-10. MIT Press.
  10. Kordjamshidi, P. and Moens, M.-F. (2013). Designing constructive machine learning models based on generalized linear learning techniques. In NIPS Workshop on Constructive Machine Learning.
  11. Kordjamshidi, P. and Moens, M.-F. (2014). Global machine learning for spatial ontology population. Journal of Web Semantics: Special issue on Semantic Search.
  12. Lei, J., Tang, B., Lu, X., Gao, K., Jiang, M., and Xu, H. (2014). A comprehensive study of named entity recognition in chinese clinical text. Journal of the American Medical Informatics Association, 21(5):808-814.
  13. Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707.
  14. McCallum, A., Schultz, K., and Singh, S. (2009). FACTORIE: Probabilistic programming via imperatively defined factor graphs. In Neural Information Processing Systems (NIPS).
  15. Nédellec, C., Bossy, R., Kim, J.-D., Kim, J.-J., Ohta, T., Pyysalo, S., and Zweigenbaum, P. (2013). Overview of BioNLP shared task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 1-7, Sofia, Bulgaria. ACL.
  16. Nguyen, N. T. H. and Tsuruoka, Y. (2011). Extracting bacteria biotopes with semi-supervised named entity recognition and coreference resolution. In Proceedings of BioNLP Shared Task 2011 Workshop. ACL.
  17. Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3):130-137.
  18. Ramshaw, L. A. and Marcus, M. P. (1995). Text chunking using transformation-based learning. In Proceedings of the 3rd ACL Workshop on Very Large Corpora, pages 82-94. Cambridge MA, USA.
  19. Ratkovic, Z., Golik, W., Warnier, P., Veber, P., and Nedellec, C. (2011). Task Bacteria Biotope-The Alvis System. In Proceedings of BioNLP Shared Task 2011 Workshop. ACL.
  20. Sutton, C. and McCallum, A. (2006). Introduction to Conditional Random Fields for Relational Learning. MIT Press.
Download


Paper Citation


in Harvard Style

Massa W., Kordjamshidi P., Provoost T. and Moens M. (2015). Machine Reading of Biological Texts - Bacteria-Biotope Extraction . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015) ISBN 978-989-758-070-3, pages 55-64. DOI: 10.5220/0005214700550064


in Bibtex Style

@conference{bioinformatics15,
author={Wouter Massa and Parisa Kordjamshidi and Thomas Provoost and Marie-Francine Moens},
title={Machine Reading of Biological Texts - Bacteria-Biotope Extraction},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)},
year={2015},
pages={55-64},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005214700550064},
isbn={978-989-758-070-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)
TI - Machine Reading of Biological Texts - Bacteria-Biotope Extraction
SN - 978-989-758-070-3
AU - Massa W.
AU - Kordjamshidi P.
AU - Provoost T.
AU - Moens M.
PY - 2015
SP - 55
EP - 64
DO - 10.5220/0005214700550064