Information Retrieval in Medicine - An Extensive Experimental Study

Roberto Gatta, Mauro Vallati, Berardino De Bari, Nadia Pasinetti, Carlo Cappelli, Ilenia Pirola, Massimo Salvetti, Michela Buglione, Maria L. Muiesan, Stefano M. Magrini, Maurizio Castellano


The clinical documents stored in a textual and unstructured manner represent a precious source of information that can be gathered by exploiting Information Retrieval techniques. Classification algorithms, and their composition through Ensemble Methods, can be used for organizing this huge amount of data, but are usually tested on standardized corpora, which significantly differ from actual clinical documents that can be found in a modern hospital. In this paper we present the results of a large experimental analysis conducted on 36,000 clinical documents, generated by three different medical Departments. For the sake of this investigation we propose a new classifier, based on the entropy idea, and test four single algorithms and four ensemble methods. The experimental results show the performance of selected approaches in a real-world environment, and highlights the impact of obsolescence on classification.


  1. Breiman, L. and Breiman, L. (1996). Bagging predictors. In Machine Learning, pages 123-140.
  2. Cano, J. R., Herrera, F., and Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Appl. Soft Comput., 6(3):323-332.
  3. Chen, R. C. and Hsieh, C.-H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Syst. Appl., 31(2):427- 435.
  4. Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple classifier systems, LBCS-1857, pages 1-15. Springer.
  5. Enríquez, F., Cruz, F. L., Ortega, F. J., Vallejo, C. G., and Troyano, J. A. (2013). A comparative study of classifier combination applied to nlp tasks. Information Fusion, 14(3):255-267.
  6. Foody, G., Mathur, A., Sanchez-Hernandez, C., and Boyd, D. (2006). Training set size requirements for the classification of a specific class. Remote Sensing of Environment, 104(1):1-14.
  7. Frank, E. and Bouckaert, R. R. (2006). Naive bayes for text classification with unbalanced classes. In In Proc 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 503-510.
  8. Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
  9. Guo, G., Wang, H., Bell, D. A., Bi, Y., and Greer, K. (2004). An knn model-based approach and its application in text categorization. In In Proc 5th International Conference on Computational Linguistics and Intelligent Text Processing, pages 559-570.
  10. Huang, Y. and Suen, C. Y. (1993). The behavior-knowledge space method for combination of multiple classifiers. In Computer Vision and Pattern Recognition, 1993. Proceedings CVPR 7893., 1993 IEEE Computer Society Conference on, pages 347-352.
  11. Hutter, F., Hoos, H. H., and Stützle, K. L.-B. T. (2009). ParamILS: an automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36:267-306.
  12. Jordan, M. I. and Jacobs, R. A. (1993). Hierarchical mixtures of experts and the EM algorithm. Technical Report AIM-1440.
  13. Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. (1998). On combining classifiers. IEEE transactions on pattern analysis and machine, 20:226-239.
  14. Kuncheva, L. I. (2004). Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience.
  15. Kuncheva, L. I., Bezdek, J. C., and Duin, R. P. W. (2001). Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognition, 34:299-314.
  16. Rocchio, J. J. (1971). Relevance feedback in information retrieval. In The Smart retrieval system - experiments in automatic document processing, pages 313-323. Englewood Cliffs, NJ: Prentice-Hall.
  17. Ruiz, M. E. and Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5:87-118.
  18. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. In Information Processing and Management, pages 513-523.
  19. Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11):613-620.
  20. Sánchez, J. S., Barandela, R., Marqués, A. I., Alejo, R., and Badenas, J. (2003). Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 24(7):1015-1022.
  21. Wolpert, D. H. (1992). Stacked Generalization. Neural Networks, 5:241-259.

Paper Citation

in Harvard Style

Gatta R., Vallati M., De Bari B., Pasinetti N., Cappelli C., Pirola I., Salvetti M., Buglione M., Muiesan M., Magrini S. and Castellano M. (2014). Information Retrieval in Medicine - An Extensive Experimental Study . In Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2014) ISBN 978-989-758-010-9, pages 447-452. DOI: 10.5220/0004909904470452

in Bibtex Style

author={Roberto Gatta and Mauro Vallati and Berardino De Bari and Nadia Pasinetti and Carlo Cappelli and Ilenia Pirola and Massimo Salvetti and Michela Buglione and Maria L. Muiesan and Stefano M. Magrini and Maurizio Castellano},
title={Information Retrieval in Medicine - An Extensive Experimental Study},
booktitle={Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2014)},

in EndNote Style

JO - Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2014)
TI - Information Retrieval in Medicine - An Extensive Experimental Study
SN - 978-989-758-010-9
AU - Gatta R.
AU - Vallati M.
AU - De Bari B.
AU - Pasinetti N.
AU - Cappelli C.
AU - Pirola I.
AU - Salvetti M.
AU - Buglione M.
AU - Muiesan M.
AU - Magrini S.
AU - Castellano M.
PY - 2014
SP - 447
EP - 452
DO - 10.5220/0004909904470452