the 4 classes using only the one’s containing at least one of the following mesh head-
ings: “Erythrocytes”, “Escherichia coli”, “Protein Binding” and “Blood Pressure”. In
this sample we started with more than 1300 attributes that can be represented using a
fact table and some dimensions tables in our Snowflake schema data warehouse. The
pruning pre-processing was made using only SQL instructions on the Data Warehouse.
Unlike [6] we also have used the abstract from the papers. Like [8] we have used a
bag-of-words representation and we implemented the standard term-frequency inverse
document frequency (TFIDF) function to assign weights to each term. We have gener-
ated several data sets combining the different pre-processing techniques. We have made
a comparison table of the accuracy obtained using different pre-processing techniques
and different classification algorithms. We have also presented the computation time re-
quired for the execution of the algorithms. The best accuracy result achieved (87.42%)
is thus promising.
As a future work and to achieve a better classification we may: 1) incorporate more
information 2) optimise the MeSH terms selection for each document 3) test with other
MEDLINE sources, like the 2010 MEDLINE version and 4) using the full MED-
LINE database.
References
1. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church,
D. M., Dicuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Helmberg, W., Kapustin, Y., Ken-
ton, D. L., Khovayko, O., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Pruitt,
K. D., Schuler, G. D., Schriml, L. M., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov,
A., Starchenko, G., Suzek, T. O., Tatusov, R., Tatusova, T. A., Wagner, L., Yaschenko, E.:
Database resources of the national center for biotechnology information. Nucleic Acids Res
34 (2006)
2. Zhou, W., Smalheiser, N.R., Yu, C.: A tutorial on information retrieval: basic terms and
concepts. Journal of Biomedical Discovery and Collaboration 1 (2006)
3. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings
in Bioinformatics 6 (2005) pp. 57–71
4. Chapelle, O., et al.: Semi-supervised learning. Cambridge MIT Press (2006)
5. Zhu, X.: Semi-supervised learning literature survey (2006)
6. Meliha Yetisgen-Yildiz, W.P.: The effect of feature representation on medline document
classification. AMIA Annu Symp Proc. (2005) 849–853
7. Sebastiani, F.: Machine learning in automated text categorization. Computing Surveys. 31(1)
(2002) 1–47
8. LAN, M., TAN, C.L., SU, J., LOW, H.B.: Text representations for text categorization : a
case study in biomedical domain. In: IJCNN
ˆ
a07 : International Joint Conference on Neural
Networks. (2007)
9. Miller, G.A.: Wordnet: A lexical database for english. Communications of the ACM 38
(1995) pp. 39–41
10. Porter, M.F.: An algorithm for suffix stripping (1997)
11. Hosford medical terms dictionary v3.0 (2010)
12. Inmon, W.: Building the Data Warehouse. . (2002)
13. Schonbach, C., Kowalski-Saunders, P., Brusic, V.: Data warehousing in molecular biology.
Briefings in Bioinformatics 1 (2000) 190–198
14. Chituc, C.M.: Data warehousing - lecture notes (2009)
61