Authors:
Kirill Lassounski
1
;
Sahudy Montenegro González
2
;
Annabell del Real Tamariz
1
and
Gabriel Lima de Oliveira
1
Affiliations:
1
State University of Norte Fluminense, Brazil
;
2
Federal University of São Carlos, Brazil
Keyword(s):
Information Retrieval, Text Mining, PubMed, Evaluation Dataset, Java.
Related
Ontology
Subjects/Areas/Topics:
Algorithms and Software Tools
;
Bioinformatics
;
Biomedical Engineering
Abstract:
The NCBI (National Center for Biotechnology Information) provides information about genes, proteins, scientific literature, molecular structures among other resources related to bio-medicine. The NCBI has a database called PubMed that stores about 21 millions of scientific articles. There are many researches in the information retrieval field that need to automatically obtain useful data from PubMed to perform evaluation and testing. This work describes a Java library to construct datasets, so that numerous scientific researches could evaluate their results easily and quickly. Users must set input and output parameters such as article’s attributes (title, abstract, keywords, etc.) to conform the dataset constructed as a serializable file. The creation of PubMed Dataset came from the fact that the authors needed to build their own datasets to evaluate their system results. In this article it is also presented the BioSearch Refinement system as a case study. The system utilizes the lib
rary to construct the datasets used to evaluate its algorithm for automatic extraction of keyphrases. We also discuss the benefits obtained from the usage of the PubMed Dataset.
(More)