Combining Syntactic and Semantic Vector Space Models in the Health Domain by using a Clustering Ensemble

Flora Amato, Francesco Gargiulo, Antonino Mazzeo, Sara Romano, Carlo Sansone

Abstract

The adoption of services for automatic information management is one of the most interesting open problems in various professional and social fields. We focus on the health domain characterized by the production of huge amount of documents, in which the adoption of innovative systems for information management can significantly improve the tasks performed by the actors involved and the quality of the health services offered. In this work we propose a methodology for automatic documents categorization based on the adoption of unsupervised learning techniques. We extracted both semantic and syntactic features in order to define the vector space models and proposed the use of a clustering ensemble in order to increase the discriminative power of our approach. Results on real medical records, digitalized by means of a state-of-the-art OCR technique, demonstrated the effectiveness of the proposed approach.

References

  1. Amato, F., Casola, V., Mazzocca, N., and Romano, S. (2011). A semantic-based document processing framework: a security perspective. In Proceedings of CISIS 2011, pages 197-202. IEEE Computer Society.
  2. Bagui, S. (2005). Combining pattern classifiers: methods and algorithms. Technometrics, 47(4):517-518.
  3. Boccignone, G., Chianese, A., Moscato, V., and Picariello, A. (2008). Context-sensitive queries for image retrieval in digital libraries. JIIS, 31(1):53-84.
  4. Domeniconi, C. and Al-Razgan, M. (2009). Weighted cluster ensembles: Methods and analysis. ACM Trans. Knowl. Discov. Data, 2(4):17:1-17:40.
  5. Fodeh, S. J., Punch, W. F., and Tan, P.-N. (2009). Combining statistics and semantics via ensemble model for document clustering. In Proceedings of SAC, pages 1446-1450, New York, NY, USA. ACM.
  6. Gonzàlez, E. and Turmo, J. (2008). Comparing nonparametric ensemble methods for document clustering. In Proceedings of NLDB, pages 245-256, Berlin, Heidelberg. Springer-Verlag.
  7. Kuncheva, L. I. (2004). Combining Pattern Classifiers: Methods and Algorithms. Wiley.
  8. Moscato, F., Di Martino, B., Venticinque, S., and Martone, A. (2009). Overfa: a collaborative framework for the semantic annotation of documents and websites. IJWGS, 5(1):30-45.
  9. Pelleg, D. and Moore, A. (2000). X-means: Extending kmeans with efficient estimation of the number of clusters. In Proceedings of ICML, pages 727-734. Morgan Kaufmann.
  10. Vega-Pons, S. and Ruiz-Shulcloper, J. (2011). A survey of clustering ensemble algorithms. IJPRAI, 25(3):337- 372.
Download


Paper Citation


in Harvard Style

Amato F., Gargiulo F., Mazzeo A., Romano S. and Sansone C. (2013). Combining Syntactic and Semantic Vector Space Models in the Health Domain by using a Clustering Ensemble . In Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2013) ISBN 978-989-8565-37-2, pages 382-385. DOI: 10.5220/0004250403820385


in Bibtex Style

@conference{healthinf13,
author={Flora Amato and Francesco Gargiulo and Antonino Mazzeo and Sara Romano and Carlo Sansone},
title={Combining Syntactic and Semantic Vector Space Models in the Health Domain by using a Clustering Ensemble},
booktitle={Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2013)},
year={2013},
pages={382-385},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004250403820385},
isbn={978-989-8565-37-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2013)
TI - Combining Syntactic and Semantic Vector Space Models in the Health Domain by using a Clustering Ensemble
SN - 978-989-8565-37-2
AU - Amato F.
AU - Gargiulo F.
AU - Mazzeo A.
AU - Romano S.
AU - Sansone C.
PY - 2013
SP - 382
EP - 385
DO - 10.5220/0004250403820385