PUBMED DATASET: A JAVA LIBRARY FOR AUTOMATIC CONSTRUCTION OF EVALUATION DATASETS

Kirill Lassounski, Sahudy Montenegro González, Annabell del Real Tamariz, Gabriel Lima de Oliveira

Abstract

The NCBI (National Center for Biotechnology Information) provides information about genes, proteins, scientific literature, molecular structures among other resources related to bio-medicine. The NCBI has a database called PubMed that stores about 21 millions of scientific articles. There are many researches in the information retrieval field that need to automatically obtain useful data from PubMed to perform evaluation and testing. This work describes a Java library to construct datasets, so that numerous scientific researches could evaluate their results easily and quickly. Users must set input and output parameters such as article’s attributes (title, abstract, keywords, etc.) to conform the dataset constructed as a serializable file. The creation of PubMed Dataset came from the fact that the authors needed to build their own datasets to evaluate their system results. In this article it is also presented the BioSearch Refinement system as a case study. The system utilizes the library to construct the datasets used to evaluate its algorithm for automatic extraction of keyphrases. We also discuss the benefits obtained from the usage of the PubMed Dataset.

References

  1. Feldman, R. and Sanger, J. (2007). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.
  2. Krallinger, M., Valencia, A., and Hirschman, L. (2008). Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biology, 9(2).
  3. Medelyan, O. (2009). Human-competitive automatic topic indexing. PhD thesis, University of Waikato.
  4. Nguyen, T. and Kan, M. (2007). Keyphrase extraction in scientific publications. In Proceedings of International Conference on Asian Digital Libraries (ICADL 7807), pages 317-326.
  5. Uddin, J., Abulaish, M., and Dey, L. (2010). A conceptdriven biomedical knowledge extraction and visualization framework for conceptualization of text corpora. Journal of Biomedical Informatics, 43(6):1020- 1035.
  6. Wan, Y. and Xiao, J. (2008). Collabrank: towards a collaborative approach to single-document keyphrase extraction. In Proceedings of COLING.
  7. Yang, Z., Lin, H., and Li, Y. (2010). Bioppisvmextractor: A protein-protein interaction extractor for biomedical literature using svm and rich feature sets. Journal of Biomedical Informatics, 43:88-96.
  8. Zaremba, S., Ramos-Santacruz, M., Hampton, T., Shetty, P., Fedorko, J., Whitmore, J., Greene, J., Perna, N., Glasner, J., Plunkett, G., Shaker, M., and Pot, D. (2009). Text-mining of pubmed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens. BMC Bioinformatics, 10(1).
Download


Paper Citation


in Harvard Style

Lassounski K., Montenegro González S., del Real Tamariz A. and Lima de Oliveira G. (2012). PUBMED DATASET: A JAVA LIBRARY FOR AUTOMATIC CONSTRUCTION OF EVALUATION DATASETS . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012) ISBN 978-989-8425-90-4, pages 343-346. DOI: 10.5220/0003797203430346


in Bibtex Style

@conference{bioinformatics12,
author={Kirill Lassounski and Sahudy Montenegro González and Annabell del Real Tamariz and Gabriel Lima de Oliveira},
title={PUBMED DATASET: A JAVA LIBRARY FOR AUTOMATIC CONSTRUCTION OF EVALUATION DATASETS},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)},
year={2012},
pages={343-346},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003797203430346},
isbn={978-989-8425-90-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)
TI - PUBMED DATASET: A JAVA LIBRARY FOR AUTOMATIC CONSTRUCTION OF EVALUATION DATASETS
SN - 978-989-8425-90-4
AU - Lassounski K.
AU - Montenegro González S.
AU - del Real Tamariz A.
AU - Lima de Oliveira G.
PY - 2012
SP - 343
EP - 346
DO - 10.5220/0003797203430346