4.2 Discussion about PubMed Dataset
Usage
Using the library for the creation of datasets was
quick and easy, simply by specifying the initial pa-
rameters. The construction of a dataset becomes very
flexible because its data and parameters can be eas-
ily modified and run again. The number of retrieved
articles will vary according to the attributes that were
specified by the user. The greater the number of at-
tributes the fewer items will be retrieved as the chance
of an article having all the attributes is smaller. The
quality of the dataset depends on what is available on
PubMed, because the datasets are an expression of the
data contained in the source. The performance results
are illustrated on Table 2.
Table 2: Performance times of PubMed Dataset.
Dataset Steps Runtime
1 Downloading 7,45 minutes
Serialization 2 seconds
Unserialization 3 seconds
Total 9,45 minutes
2 Downloading 3,11 minutes
Serialization 859 milliseconds
Unserialization 1 second
Total 3,13 minutes
The download phase is executed once and always
will be dependent on network traffic. In the case of
Dataset 1, the time for downloading 10.000 articles
was considerable high. Once the dataset is written to
disk, the time taken to load the dataset is very short,
making its use very convenient for testing.
5 CONCLUSIONS
By the team’s experience the time spent to build and
manage evaluation datasets is significant when com-
pared to the research itself, giving a positive bal-
ance to the use of this library. The PubMed Dataset
shows itself very useful on the fast, easy and effi-
cient construction of datasets. The three design prin-
ciples: portability, flexibility and to be open source
were achieved with success. Flexibility is a very im-
portant point that needs to be highlighted since any
article attribute can be included in the dataset.
It was presented a case study to create two datasets
using the library. The initial configuration was sim-
ply specified via Java code. The dataset is created
as a serialized file and the performance times of the
complete execution are not an issue. Yet, we believe
that future use of the library by other users will bring
new suggestions on how to improve the library to be
even more flexible and adaptable. The API’s source
code and documentation is available for download at
https://github.com/lassounski/PubMed-Dataset.
On an evaluation environment the tests are grad-
ually created and are often added more specific test
cases. Given this scenario, it is obvious the fact that
the tests will run several times. Without the use of this
library, the time spent obtaining test data and running
the tests would be remarkable. Note that the dataset
integrity and consistency are not guaranteed since the
data retrieved from PubMed may have missing values
and may vary over time.
ACKNOWLEDGEMENTS
This research was supported by CNPq (National
Counsel of Technological and Scientific Develop-
ment)/Brazil.
REFERENCES
Feldman, R. and Sanger, J. (2007). The Text Mining Hand-
book: Advanced Approaches in Analyzing Unstruc-
tured Data. Cambridge University Press.
Krallinger, M., Valencia, A., and Hirschman, L. (2008).
Linking genes to literature: text mining, informa-
tion extraction, and retrieval applications for biology.
Genome Biology, 9(2).
Medelyan, O. (2009). Human-competitive automatic topic
indexing. PhD thesis, University of Waikato.
Nguyen, T. and Kan, M. (2007). Keyphrase extraction
in scientific publications. In Proceedings of Interna-
tional Conference on Asian Digital Libraries (ICADL
’07), pages 317–326.
Uddin, J., Abulaish, M., and Dey, L. (2010). A concept-
driven biomedical knowledge extraction and visual-
ization framework for conceptualization of text cor-
pora. Journal of Biomedical Informatics, 43(6):1020–
1035.
Wan, Y. and Xiao, J. (2008). Collabrank: towards a collabo-
rative approach to single-document keyphrase extrac-
tion. In Proceedings of COLING.
Yang, Z., Lin, H., and Li, Y. (2010). Bioppisvmextractor:
A protein-protein interaction extractor for biomedical
literature using svm and rich feature sets. Journal of
Biomedical Informatics, 43:88–96.
Zaremba, S., Ramos-Santacruz, M., Hampton, T., Shetty,
P., Fedorko, J., Whitmore, J., Greene, J., Perna, N.,
Glasner, J., Plunkett, G., Shaker, M., and Pot, D.
(2009). Text-mining of pubmed abstracts by nat-
ural language processing to create a public knowl-
edge base on molecular mechanisms of bacterial en-
teropathogens. BMC Bioinformatics, 10(1).
BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms
346