A Method of Topic Detection for Great Volume of Data

Flora Amato, Francesco Gargiulo, Antonino Mazzeo, Carlo Sansone

2014

Abstract

Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the t f −id f matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in each document, represented by a row. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.

References

  1. Amato, F., Casola, V., Mazzeo, A., and Romano, S. (2010). A semantic based methodology to classify and protect sensitive data in medical records. In Information Assurance and Security (IAS), 2010 Sixth International Conference on, pages 240-246. IEEE.
  2. Amato, F., Casola, V., Mazzocca, N., and Romano, S. (2013a). A semantic approach for fine-grain access control of e-health documents. Logic Journal of IGPL, 21(4):692-701.
  3. Amato, F., Gargiulo, F., Mazzeo, A., Romano, S., and Sansone, C. (2013b). Combining syntactic and semantic vector space models in the health domain by using a clustering ensemble. In HEALTHINF, pages 382-385.
  4. Bolelli, L. and Ertekin, Giles, C. (2009). Topic and trend detection in text collections using latent dirichlet allocation. In Boughanem, M., Berrut, C., Mothe, J., and Soule-Dupuy, C., editors, Advances in Information Retrieval, volume 5478 of Lecture Notes in Computer Science, pages 776-780. Springer Berlin Heidelberg.
  5. Dan Pelleg, A. M. (2000). X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 727-734, San Francisco. Morgan Kaufmann.
  6. Gargiulo, F., Mazzariello, C., and Sansone, C. (2013). Multiple classifier systems: Theory, applications and tools. In Bianchini, M., Maggini, M., and Jain, L. C., editors, H. on Neural Information Processing, volume 49 of Intelligent Systems Reference Library, pages 335-378. Springer.
  7. Gargiulo, F. and Sansone, C. (2010). SOCIAL: Selforganizing classifier ensemble for adversarial learning. In Gayar, N. E., Kittler, J., and Roli, F., editors, MCS, volume 5997 of Lecture Notes in Computer Science, pages 84-93. Springer.
  8. Holmes, Donkin, A., and Witten, I. H. (1994). Weka: a machine learning workbench. In Intelligent Information Systems,1994. Proceedings of the 1994 Second Australian and New Zealand Conference on, pages 357- 361.
  9. Jia Zhang, I., Madduri, R., Tan, W., Deichl, K., Alexander, J., and Foster, I. (2011). Toward semantics empowered biomedical web services. In Web Services (ICWS), 2011 IEEE International Conference on, pages 371-378.
  10. Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA.
  11. Seo, Y.-W. and Sycara, K. (2004). Text clustering for topic detection.
  12. Song, Y., Du, J., and Hou, L. (2012). A topic detection approach based on multi-level clustering. In Control Conference (CCC), 2012 31st Chinese, pages 3834- 3838. IEEE.
  13. Wartena, C. and Brussee, R. (2008). Topic detection by clustering keywords. In Database and Expert Systems Application, 2008. DEXA'08. 19th International Workshop on, pages 54-58. IEEE.
  14. Zhang, D. and Li, S. (2011). Topic detection based on kmeans. In Electronics, Communications and Control (ICECC), 2011 International Conference on, pages 2983-2985.
Download


Paper Citation


in Harvard Style

Amato F., Gargiulo F., Mazzeo A. and Sansone C. (2014). A Method of Topic Detection for Great Volume of Data . In Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: KomIS, (DATA 2014) ISBN 978-989-758-035-2, pages 434-439. DOI: 10.5220/0005145504340439


in Bibtex Style

@conference{komis14,
author={Flora Amato and Francesco Gargiulo and Antonino Mazzeo and Carlo Sansone},
title={A Method of Topic Detection for Great Volume of Data},
booktitle={Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: KomIS, (DATA 2014)},
year={2014},
pages={434-439},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005145504340439},
isbn={978-989-758-035-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: KomIS, (DATA 2014)
TI - A Method of Topic Detection for Great Volume of Data
SN - 978-989-758-035-2
AU - Amato F.
AU - Gargiulo F.
AU - Mazzeo A.
AU - Sansone C.
PY - 2014
SP - 434
EP - 439
DO - 10.5220/0005145504340439