(Tan et al., 2006; Kaufman and Rousseeuw, 1999)
like K-means and Bi-Secting K-means, and (2)
GAAC (Tan et al., 2006). Since Bi-Secting K-Means
and K-Means may produce different clustering
results each time the experiment is conducted, due to
random initialization, we calculate the average
result, by repeating the experiment five times, on
every dataset.
Evaluation Metrics: Cluster quality is evaluated
using two metrics: Purity (Zhao and Karypis, 2001)
and F-measure standard evaluation metrics
(Steinbach et al., 2000). Standard deviation of
F-measure score and cluster purity is also calculated,
in order to measure the variation in results.
Results: In Table 1, we present: (1) cluster
purity score and (2) F-measure score with standard
deviation, using all the four systems. A bold font
value is used to represent higher score. Since,
K-means has a higher accuracy than Bi-Secting
K-means and GAAC, when using a bag-of-words
based approach, we compared our result with
K-means. We also observed that GAAC (with
bag-of-words based approach) is better in cluster
purity score than the other two partitioning
algorithms, Bi-Secting K-means and K-means.
From the results obtained with the 20-Newsgroup
dataset (See Table 1), it is clear that:
• Our devised scheme performs better than the
bag-of-words based approach, and shows an
average improvement of 9%, in F-measure
score over K-means, based on BOW approach.
• It also shows improvements of more than 10%
in F-measure score over K-means, based on
BOW approach, for corpus_set_ID: C
2
, C
7
, C
13
,
C
14
and C
15
.
• Our approach shows an average improvement
of 16% in purity, over K-means based
approach.
5 CONCLUSIONS
In this paper, we introduce a new similarity measure,
based on the weighted importance of N-grams in
documents, in addition to other similarity measures
that are based on common N-grams in documents.
This new approach improves the topical similarity in
the cluster, which results in an improvement in
purity and accuracy of clusters. We reduce the
information gap between the identified N-grams and
remove noisy N-grams, by exploiting Wikipedia
anchor texts and their well-organized link structures,
before applying a GAAC based clustering scheme
and our similarity measure to cluster the documents.
Our experimental results on the publicly available
text dataset clearly show that our devised system
performs significantly better than bag-of-words
based state-of-the-art systems in this area.
REFERENCES
Banerjee, S., Ramanathan, K., Gupta, A., 2007. Clustering
Short Texts using Wikipedia; SIGIR’07, July 23–27,
Amsterdam, The Netherlands.
Clauset, A., Newman, M., Moore, C., 2004. Finding
community structure in verylarge networks. Physical
Review E, 70:066111, 2004.
Hammouda, K., Matute, D., Kamel, M., 2005.
CorePhrase: Keyphrase Extraction for Document
Clustering; In IAPR: 4th International Conference on
Machine Learning and Data Mining.
Han, J., Kim, T., Choi, J., 2007. Web Document
Clustering by Using Automatic Keyphrase Extraction;
Proceedings of the 2007 IEEE/WIC/ACM
International Conferences on Web Intelligence and
Intelligent Agent Technology - Workshops.
Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X., 2009.
Exploiting Wikipedia as External Knowledge for
Document Clustering; KDD’09.
Huang, A., Milne, D., Frank, E., Witten, I. 2008.
Clustering Documents with Active Learning Using
Wikipedia. ICDM 2008.
Huang, A., Milne, D., Frank, E., Witten, I., 2009.
Clustering documents using a wikipedia-based concept
representation. In Proc 13th Pacific-Asia Conference
on Knowledge Discovery and Data Mining.
Kaufman, L., and Rousseeuw, P., 1999. Finding Groups in
data: An introduction to cluster analysis, 1999, John
Wiley & Sons.
Kumar, N., Srinathan, K., 2008. Automatic Keyphrase
Extraction from Scientific Documents Using N-gram
Filtration Technique. In the Proceedings of ACM
DocEng.
Newman, M., Girvan, M., 2004. Finding and evaluating
community structure in networks. Physical review E,
69:026113, 2004.
Steinbach, M., Karypis, G., and Kumar, V., 2000. A
Comparison of document clustering techniques.
Technical Report. Department of Computer Science
and Engineering, University of Minnesota.
Tan, P., Steinbach, M., Kumar, V., 2006. Introduction to
Data Mining; Addison-Wesley; ISBN-10:
0321321367.
Zhao, Y., Karypis, G., 2001. Criterion functions for
document clustering: experiments and analysis,
Technical Report. Department of Computer Science,
University of Minnesota.
EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR
IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
187