RECOGNITION OF GENE/PROTEIN NAMES USING CONDITIONAL RANDOM FIELDS

David Campos, Sérgio Matos, José Luis Oliveira

2010

Abstract

With the overwhelming amount of publicly available data in the biomedical field, traditional tasks performed by expert database annotators rapidly became hard and very expensive. This situation led to the development of computerized systems to extract information in a structured manner. The first step of such systems requires the identification of named entities (e.g. gene/protein names), a task called Named Entity Recognition (NER). Much of the current research to tackle this problem is based on Machine Learning (ML) techniques, which demand careful and sensitive definition of the several used methods. This article presents a NER system using Conditional Random Fields (CRFs) as the machine learning technique, combining the best techniques recently described in the literature. The proposed system uses biomedical knowledge and a large set of orthographic and morphological features. An F-measure of 0,7936 was obtained on the BioCreative II Gene Mention corpus, achieving a significantly better performance than similar baseline systems.

References

  1. Ando, R. (2007). BioCreative II gene mention tagging system at IBM Watson. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, pages 101-103. Citeseer.
  2. Baldridge, J., Morton, T., and Bierner, G. (2010). openNLP Package.
  3. Browne, A. C., McCray, A. T., and Srinivasan, S. (2000). The SPECIALIST LEXICON. Technical report, Lister Hill National Center for Biomedical Communications, National Library of Medicine.
  4. Chen, Y., Liu, F., and Manderick, B. (2007). Gene mention recognition using lexicon match based two-layer support vector machines. In Proceedings of the Second BioCreative Challenge Evaluation Workshop; 23 to 25 April 2007; Madrid, Spain.
  5. Franzén, K., Eriksson, G., Olsson, F., Asker, L., Lidén, P., and Cöster, J. (2002). Protein names and how to find them. Int J Med Inform, 67(1-3):49-61.
  6. Grover, C., Haddow, B., Klein, E., Matthews, M., Nielsen, L., Tobin, R., and Wang, X. (2007). Adapting a relation extraction pipeline for the BioCreAtIvE II task. In Proceedings of the second BioCreative challenge evaluation workshop, volume 23, pages 273- 286. Citeseer.
  7. Huang, H., Lin, Y., Lin, K., Kuo, C., Chang, Y., Yang, B., Chung, I., and Hsu, C. (2007). High-recall gene mention recognition by unification of multiple backward parsing models. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, pages 109-111. Citeseer.
  8. Johnson, H., Baumgartner, W., Krallinger, M., Cohen, K., and Hunter, L. (2007). Corpus refactoring: a feasibility study. Journal of biomedical discovery and collaboration, 2(1):4.
  9. Keerthi, S. and Sundararajan, S. (2007). CRF versus SVMStruct for sequence labeling. Technical report, Yahoo Research.
  10. Kim, J., Ohta, T., Tateisi, Y., and Tsujii, J. (2003). GENIA corpus-a semantically annotated corpus for biotextmining. Bioinformatics-Oxford, 19(1):180-182.
  11. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001). Citeseer.
  12. Liu, H., Hu, Z.-Z., Zhang, J., and Wu, C. H. (2006). Biothesaurus: a web-based thesaurus of protein and gene names. Bioinformatics, 22(1):103-105.
  13. McCallum, A. K. (2002). MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu/.
  14. National Center for Biotechnology Information (2009). Medline fact sheet.
  15. Porter, M. (2001). Snowball: A language for stemming algorithms.
  16. Sasaki, Y., Montemagni, S., Pezik, P., Rebholz-Schuhmann, D., McNaught, J., and Ananiadou, S. (2008). Biolexicon: A lexical resource for the biology domain. In Proc. of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), volume 3.
  17. Smith, L., Tanabe, L., Ando, R., Kuo, C., Chung, I., Hsu, C., Lin, Y., Klinger, R., Friedrich, C., Ganchev, K., et al. (2008). Overview of BioCreative II gene mention recognition. Genome biology, 9(Suppl 2):S2.
  18. Sun, C., Lei, L., and Xiaolong, W. and, Y. G. (2007). A study for application of discriminative models in biomedical literature mining. In Proceedings of the Second BioCreative Challenge Evaluation Workshop; 23 to 25 April 2007; Madrid, Spain.
  19. Tsai, R., Sung, C., Dai, H., Hung, H., Sung, T., and Hsu, W. (2006). NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC bioinformatics, 7(Suppl 5):S11.
  20. Vlachos, A. (2007). Tackling the BioCreative2 gene mention task with conditional random fields and syntactic parsing. In Proceedings of the Second BioCreative Challenge Evaluation Workshop; 23 to 25 April 2007; Madrid, Spain, pages 85-87. Citeseer.
Download


Paper Citation


in Harvard Style

Campos D., Matos S. and Oliveira J. (2010). RECOGNITION OF GENE/PROTEIN NAMES USING CONDITIONAL RANDOM FIELDS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 275-280. DOI: 10.5220/0003096902750280


in Bibtex Style

@conference{kdir10,
author={David Campos and Sérgio Matos and José Luis Oliveira},
title={RECOGNITION OF GENE/PROTEIN NAMES USING CONDITIONAL RANDOM FIELDS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={275-280},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003096902750280},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - RECOGNITION OF GENE/PROTEIN NAMES USING CONDITIONAL RANDOM FIELDS
SN - 978-989-8425-28-7
AU - Campos D.
AU - Matos S.
AU - Oliveira J.
PY - 2010
SP - 275
EP - 280
DO - 10.5220/0003096902750280