A Tensor-based Clustering Approach for Multiple Document Classifications

Salvatore Romeo, Andrea Tagarelli, Francesco Gullo, Sergio Greco

Abstract

We propose a novel approach to the problem of document clustering when multiple organizations are provided for the documents in input. Besides considering the information on the text-based content of the documents, our approach exploits frequent associations of the documents in the groups across the existing classifications, in order to capture how documents tend to be grouped together orthogonally to different views. A third-order tensor for the document collection is built over both the space of terms and the space of the discovered frequent document-associations, and then it is decomposed to finally establish a unique encompassing clustering of documents. Preliminary experiments conducted on a document clustering benchmark have shown the potential of the approach to capture the multi-view structure of existing organizations for a given collection of documents.

References

  1. Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):pp. 549-559.
  2. Cichocki, A., Phan, A. H., and Zdunek, R. (2009). Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley, Chichester.
  3. Ghosh, J. and Acharya, A. (2011). Cluster ensembles. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, 1(4):305-315.
  4. Karypis, G. (2002/2007). CLUTO - Software for Clustering High-Dimensional Datasets. http://www.cs.umn.edu/ cluto.
  5. Kolda, T. and Bader, B. (2006). The TOPHITS model for higher-order web link analysis. In Proc. SIAM Workshop on Link Analysis, Counterterrorism and Security.
  6. Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3):455-500.
  7. Kolda, T. G. and Sun, J. (2008). Scalable tensor decompositions for multi-aspect data mining. In Proc. IEEE ICDM Conf., pages 363-372.
  8. Kutty, S., Nayak, R., and Li, Y. (2011). XML Documents Clustering Using a Tensor Space Model. In Proc. PAKDD Conf., pages 488-499.
  9. Lewis, D. D., Yang, Y., Rose, T., and Li, F. (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397.
  10. Liu, N., Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L. (2005). Text Representation: From Vector to Tensor. In Proc. IEEE ICDM Conf., pages 725-728.
  11. Liu, X., Glänzel, W., and Moor, B. D. (2011). Hybrid clustering of multi-view data via Tucker-2 model and its application. Scientometrics, 88(3):819-839.
  12. Steinbach, M., Karypis, G., and Kumar, V. (2000). A Comparison of Document Clustering Techniques. In Proc. KDD Text Mining Workshop.
Download


Paper Citation


in Harvard Style

Romeo S., Tagarelli A., Gullo F. and Greco S. (2013). A Tensor-based Clustering Approach for Multiple Document Classifications . In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-8565-41-9, pages 200-205. DOI: 10.5220/0004269102000205


in Bibtex Style

@conference{icpram13,
author={Salvatore Romeo and Andrea Tagarelli and Francesco Gullo and Sergio Greco},
title={A Tensor-based Clustering Approach for Multiple Document Classifications},
booktitle={Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2013},
pages={200-205},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004269102000205},
isbn={978-989-8565-41-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - A Tensor-based Clustering Approach for Multiple Document Classifications
SN - 978-989-8565-41-9
AU - Romeo S.
AU - Tagarelli A.
AU - Gullo F.
AU - Greco S.
PY - 2013
SP - 200
EP - 205
DO - 10.5220/0004269102000205