Next, we do the experiment with two pass
appoach for four clusters, which in step (2) removes
all words simultaneously appear in four clusters.
Table 4 shows the result of the experiment with
two pass approach for four clusters. This result is
better than the result in Table 2 and Table 3 but still
worse than the result in Table 1.
The proposed method is not good as the original
SRSM method. With the two pass approach for two
clusters, there are many words appear in two clusters
so when we remove them, many document vectors
only have few words, the result becomes worse. In
the case of removing words appearing in three and
four clusters, the results are better but still worse
than the original approach. The reason can be the
words removed are not the “right” words, which are
the words we try to remove. We try to remove the
words overlapped in “actual class” of documents,
but here we removed words overlapped in the
algorithm’s clustering results. These words may not
the same because our results are not exactly similar
to the “actual class” of documents.
4 CONCLUSIONS
A Similarity Rough Set Model for document
clustering has been introduced and used in
representation of document in clustering process.
Experiment results (Nguyen et al., 2010) show that
the quality of the clustering with SRSM is better
than the one with TRSM. However, it has been
shown that the effectiveness of our approach is not
so large when terms used in different classes
overlap.
To improve the performance of SRSM clustering
when terms used in different classes are overlapped,
a two pass approach has been proposed. However, it
works not good as expected. The reason is that the
proposed method does not remove the overlapped
words in “actual class” of documents, it only
removes words overlapped in clusters generated
from the clustering method. In future work, we will
continue to improve the efficiency of the algorithm
by trying to remove the overlapped words using
different approaches. This is a challenging study and
it can be applied to real applications where
documents have a large overlap of terms among
classes such as research papers in near fields or
research papers and instruction papers in the same
field.
REFERENCES
Dhillon, I. S. and Modha, D. S. (2001). Concept
decompositions for large sparse text data using
clustering. Machine Learning, 42 (1-2), pp. 143-175.
Ho, T. B. and Funakoshi, K. (1997). Information retrieval
using rough sets. Journal of Japanese Society for
Aritificial Intelligence, 13 (3), pp. 424-433.
Ho, T. B. and Nguyen, N. B. (2002), Nonhierarchical
document clustering based on a tolerance rough set
model. International Journal of Intelligent Systems, 17
(2), pp. 199-212.
Li, Y., Chung, S. M. and Holt, J. D. (2008). Text
document clustering based on frequent word meaning
sequences. Data and Knowledge Engineering, 64 (1),
pp. 381-404.
Luce, R. D. (1956). Semiorders and a Theory of Utility
Discrimination. Econometrica, Vol. 24, No. 2, pp.
178-191.
Mahdavi, M. and Abolhassani, H. (2008). Harmony
K-means algorithm for document clustering. Data
Mining and Knowledge Discovery, pp. 1-22.
Meng, X.-J., Chen, Q.-C. and Wang, X.-L. (2009). A
tolerance rough set based semantic clustering method
for web search results. Information Technology
Journal, 8 (4), pp. 453-464.
Nguyen, C. T., Yamada, K. and Unehara, M. (2010), A
similarity rough set model for document
representation and document clustering. IEICE
Transactions on Information and Systems (submitted).
Pawlak, Z. (1982). Rough sets. Int. J. of Information and
Computer Sciences, 11 (5), pp. 341-356.
Salton, G. and McGill, M. J. (1983). Introduction to
modern information retrieval. MCGraw-Hill Book
Company.
Stefanowski, J. and Tsoukias, A. (2001). Incomplete
Information Tables and Rough Classification.
Computational Intelligence, 17(3), pp.545-566.
Steinbach, M., Karypis, G. and Kumar, V. (2000). A
comparison of document clustering techniques.
Proceedings of the KDD Workshop on Text Mining.
Strehl, A., Ghosh, J. and Mooney, R. (2000). Impact of
similarity measures on web-page clustering,
Proceedings of the 17th National Conference on
Artificial Intelligence: Workshop of Artificial
Intelligence for Web search (AAAI 2000), Austin, TX,
pp. 58–64.
Yao, Y. Y. Wong, S. K. M. and Lin, T. Y. (1997). A
review of rough set models. Rough Sets and Data
Mining: Analysis for Imprecise Data, pp. 47-73.
Zhao, Y. and Karypis, G. (2005). Hierarchical clustering
algorithms for document datasets. Data Mining and
Knowledge Discovery, 10 (2), pp. 141 - 168.
CLUSTERING DOCUMENTS WITH LARGE OVERLAP OF TERMS INTO DIFFERENT CLUSTERS BASED ON
SIMILARITY ROUGH SET MODEL
399