Document Clustering based on Genetic Algorithm
using D-Individual
Lim Cheon Choi and Soon Cheol Park
Division of Electronics and Information Engineering
Chonbuk National University, Jeonju, South Korea
Abstract. Document clustering using genetic algorithm shows good perfor-
mance. However the genetic algorithm has problem of performance degradation
by premature convergence phenomenon. In this paper, we proposed the docu-
ment clustering based on Genetic Algorithm using D-Individual (DIGA) to
solve this problem. Genetic algorithm is based on the diversity of population
and the capability to convergence. Success of genetic algorithm depends on
these two factors. If we use these factors efficiently, we can get a better solution
in reduced execution time. We apply DIGA to Reuter-21578 text collection and
demonstrate the effect of our clustering algorithm. The results show that our
DIGA has better performance than traditional clustering algorithms (K-means,
Group Average and genetic algorithm) in various experiments.
1 Introduction
The researches of document clustering are actively progressing for analysis and appli-
cation of mass information. The document clustering is to group the documents which
are similar in a set of documents without prior information.
The clustering algorithms are classified to the hierarchical clustering algorithm
and the non-hierarchical clustering algorithm based on grouping method [1],[2]. K-
means clustering algorithm is one of the non-hierarchical clustering algorithms [3].
Group Average clustering is one of the hierarchical clustering algorithms [2]. These
two clustering algorithms are fast in execution and easy to understand but do not
achieve the highest results [1],[2]. To solve it, genetic algorithm which is one of ar-
tificial intelligent algorithms has been applied to the document clustering and it shows
good results [4],[5].
Genetic algorithm is optimization algorithm using natural selection and evolution
operation. Genetic algorithm depends on the diversity of individuals and the capabili-
ty to convergence [4],[5],[6]. Capability to convergence of genetic algorithm has
weakness that is premature convergence phenomenon. The premature convergence
phenomenon is cause of depress the performance of genetic algorithm [7].
In this paper we propose the document clustering based on genetic algorithm using
the individuals which have the feature information of document clusters. The algo-
rithm has better performance and shorter execution time than the general Genetic
Algorithm.
Choi L. and Park S..
Document Clustering based on Genetic Algorithm using D-Individual .
DOI: 10.5220/0003351601110118
In Proceedings of the International Workshop on Semantic Interoperability (IWSI-2011), pages 111-118
ISBN: 978-989-8425-43-0
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)