The Impact of Pre-processing on the Classification of MEDLINE Documents

Carlos Adriano Gonçalves, Célia Talma Gonçalves, Rui Camacho, Eugénio Oliveira

2010

Abstract

The amount of information available in the MEDLINE database makes it very hard for a researcher to retrieve a reasonable amount of relevant documents using a simple query language interface. Automatic Classification of documents may be a valuable technology to help reducing the amount of documents retrieved for each query. To accomplish this process it is of capital importance to use appropriate pre-processing techniques on the data. The main goal of this study is to analyse the impact of pre-processing techniques in text Classification of MEDLINE documents. We have assessed the effect of combining different pre-processing techniques together with several classification algorithms available in the WEKA tool. Our experiments show that the application of pruning, stemming and WordNet reduces significantly the number of attributes and improves the accuracy of the results.

References

  1. Zhou, W., Smalheiser, N.R., Yu, C.: A tutorial on information retrieval: basic terms and concepts. Journal of Biomedical Discovery and Collaboration 1 (2006)
  2. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6 (2005) pp. 57-71
  3. Chapelle, O., et al.: Semi-supervised learning. Cambridge MIT Press (2006)
  4. Zhu, X.: Semi-supervised learning literature survey (2006)
  5. Meliha Yetisgen-Yildiz, W.P.: The effect of feature representation on medline document classification. AMIA Annu Symp Proc. (2005) 849-853
  6. Sebastiani, F.: Machine learning in automated text categorization. Computing Surveys. 31(1) (2002) 1-47
  7. LAN, M., TAN, C.L., SU, J., LOW, H.B.: Text representations for text categorization : a case study in biomedical domain. In: IJCNNaˆ07 : International Joint Conference on Neural Networks. (2007)
  8. Miller, G.A.: Wordnet: A lexical database for english. Communications of the ACM 38 (1995) pp. 39-41
  9. Porter, M.F.: An algorithm for suffix stripping (1997)
  10. Hosford medical terms dictionary v3.0 (2010)
  11. Inmon, W.: Building the Data Warehouse. . (2002)
  12. Schonbach, C., Kowalski-Saunders, P., Brusic, V.: Data warehousing in molecular biology. Briefings in Bioinformatics 1 (2000) 190-198
  13. Chituc, C.M.: Data warehousing - lecture notes (2009)
Download


Paper Citation


in Harvard Style

Adriano Gonçalves C., Talma Gonçalves C., Camacho R. and Oliveira E. (2010). The Impact of Pre-processing on the Classification of MEDLINE Documents . In Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010) ISBN 978-989-8425-14-0, pages 53-61. DOI: 10.5220/0003028700530061


in Bibtex Style

@conference{pris10,
author={Carlos Adriano Gonçalves and Célia Talma Gonçalves and Rui Camacho and Eugénio Oliveira},
title={The Impact of Pre-processing on the Classification of MEDLINE Documents},
booktitle={Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010)},
year={2010},
pages={53-61},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003028700530061},
isbn={978-989-8425-14-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010)
TI - The Impact of Pre-processing on the Classification of MEDLINE Documents
SN - 978-989-8425-14-0
AU - Adriano Gonçalves C.
AU - Talma Gonçalves C.
AU - Camacho R.
AU - Oliveira E.
PY - 2010
SP - 53
EP - 61
DO - 10.5220/0003028700530061