CLUSTERING DOCUMENTS WITH LARGE OVERLAP OF TERMS INTO DIFFERENT CLUSTERS BASED ON SIMILARITY ROUGH SET MODEL

Nguyen Chi Thanh; Koichi Yamada; Muneyuki Unehara

doi:10.5220/0003068803960399

CLUSTERING DOCUMENTS WITH LARGE OVERLAP OF TERMS INTO DIFFERENT CLUSTERS BASED ON SIMILARITY ROUGH SET MODEL

Nguyen Chi Thanh, Koichi Yamada, Muneyuki Unehara

2010

Abstract

Similarity rough set model for document clustering (SRSM) uses a generalized rough set model based on similarity relation and term co-occurrence to group documents in the collection into clusters. The model is extended from tolerance rough set model (TRSM) (Ho and Funakoshi, 1997). The SRSM methods have been evaluated and the results showed that it perform better than TRSM. However, in document collections where there are words overlapped in different document classes, the effect of SRSM is rather small. In this paper we propose a method to improve the performance of SRSM method in such document collections.

References

Dhillon, I. S. and Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42 (1-2), pp. 143-175.
Ho, T. B. and Funakoshi, K. (1997). Information retrieval using rough sets. Journal of Japanese Society for Aritificial Intelligence, 13 (3), pp. 424-433.
Ho, T. B. and Nguyen, N. B. (2002), Nonhierarchical document clustering based on a tolerance rough set model. International Journal of Intelligent Systems, 17 (2), pp. 199-212.
Li, Y., Chung, S. M. and Holt, J. D. (2008). Text document clustering based on frequent word meaning sequences. Data and Knowledge Engineering, 64 (1), pp. 381-404.
Luce, R. D. (1956). Semiorders and a Theory of Utility Discrimination. Econometrica, Vol. 24, No. 2, pp. 178-191.
Mahdavi, M. and Abolhassani, H. (2008). Harmony K-means algorithm for document clustering. Data Mining and Knowledge Discovery, pp. 1-22.
Meng, X.-J., Chen, Q.-C. and Wang, X.-L. (2009). A tolerance rough set based semantic clustering method for web search results. Information Technology Journal, 8 (4), pp. 453-464.
Nguyen, C. T., Yamada, K. and Unehara, M. (2010), A similarity rough set model for document representation and document clustering. IEICE Transactions on Information and Systems (submitted).
Pawlak, Z. (1982). Rough sets. Int. J. of Information and Computer Sciences, 11 (5), pp. 341-356.
Salton, G. and McGill, M. J. (1983). Introduction to modern information retrieval. MCGraw-Hill Book Company.
Stefanowski, J. and Tsoukias, A. (2001). Incomplete Information Tables and Rough Classification. Computational Intelligence, 17(3), pp.545-566.
Steinbach, M., Karypis, G. and Kumar, V. (2000). A comparison of document clustering techniques. Proceedings of the KDD Workshop on Text Mining.
Strehl, A., Ghosh, J. and Mooney, R. (2000). Impact of similarity measures on web-page clustering, Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web search (AAAI 2000), Austin, TX, pp. 58-64.
Yao, Y. Y. Wong, S. K. M. and Lin, T. Y. (1997). A review of rough set models. Rough Sets and Data Mining: Analysis for Imprecise Data, pp. 47-73.
Zhao, Y. and Karypis, G. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10 (2), pp. 141 - 168.

Download

Paper Citation

in Harvard Style

Chi Thanh N., Yamada K. and Unehara M. (2010). CLUSTERING DOCUMENTS WITH LARGE OVERLAP OF TERMS INTO DIFFERENT CLUSTERS BASED ON SIMILARITY ROUGH SET MODEL . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 396-399. DOI: 10.5220/0003068803960399

in Bibtex Style

@conference{kdir10,
author={Nguyen Chi Thanh and Koichi Yamada and Muneyuki Unehara},
title={CLUSTERING DOCUMENTS WITH LARGE OVERLAP OF TERMS INTO DIFFERENT CLUSTERS BASED ON SIMILARITY ROUGH SET MODEL },
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={396-399},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003068803960399},
isbn={978-989-8425-28-7},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - CLUSTERING DOCUMENTS WITH LARGE OVERLAP OF TERMS INTO DIFFERENT CLUSTERS BASED ON SIMILARITY ROUGH SET MODEL
SN - 978-989-8425-28-7
AU - Chi Thanh N.
AU - Yamada K.
AU - Unehara M.
PY - 2010
SP - 396
EP - 399
DO - 10.5220/0003068803960399