problem. Each particle is evaluated using the
following equation:
Nc
Cmzd
J
c
ijp
N
jCZ
ijijp
e
∑∑
=∈∀
=
1
]||/).([
(F1)
where Zp denotes the p
th
data vector, | C
ij
| is the
number of data vectors belonging to the cluster C
ij
and d is the Euclidian distance between Zp and m
ij
.
3.1 The Evaluation Function
The Evaluation function plays a fundamental role in
any evolutionary algorithm; it tells how good a
solution is.
By analyzing the equation F1 we can see that it is
first takes each cluster C
ij
and calculates the average
distance of the data belonging to the cluster to its
centroid m
ij
. Then it takes the average distances of
all clusters C
ij
and calculates another average, which
is the result of the equation.
It can be seen that a cluster C
ij
with just one data
vector will influence the final result (the quality) as
much as a cluster C
ik
with lot of data vectors.
Sometimes a particle that does not represent a
good solution is going to be evaluated as if it did.
For instance, suppose that one of the particle clusters
has a data vector that is very close to its centroid,
and another cluster has a lot of data vectors that are
not so close to the centroid. This is not a very good
solution, but giving the same weight to the cluster
with one data vector as the cluster with a lot of data
vectors can make it seem to be. Furthermore, this
equation is not going to reward the homogeneous
solutions, that is, solutions where the data vectors
are well distributed along the clusters.
To solve this problem we propose the following
new equations, where the number of data vectors
belonging to each cluster is taken into account:
∑∑
=∈∀
×=
c
ijp
N
j
oij
CZ
ijijp
NCCmzdF
1
)]}/|(|)||/).([({
(F2)
Where N
o
is the number of data vectors to be
clustered.
To take into account the distribution of the data
among the clusters, the equation can be changed to:
)1|||(|' +−×= ilik CCFF
(F3)
such that,
|}{|max|| ,..,1 ijNcjik CC =∀=
and
|}{|min|| ,..,1 ijNcjil CxC =∀=
The next section shows the test results with these
different equations.
4 RESULTS
Table 1 shows the three benchmarks that used: Iris,
Wine and Glass, taken from the UCI Repository of
Machine Learning Databases. (Assuncion, 2007).
Table 1: Benchmarks features.
Benchmark Number of
Objects
Number of
Attributes
Number
of
Classes
Iris 150 4 3
Wine 178 13 3
Glass 214 9 7
For each data set, three implementations, using the
equations F1, F2 and F3, were run 30 times, with
200 function evaluations and 10 particles, w = 0.72,
c1 = 1.49, c2 = 1.49. (Merwe 2003).
Each benchmark class is represented by the
particle created cluster with largest number of data
of that class; data of different classes within this
cluster are considered misclassified. Thus the hit rate
of the algorithm can be easily calculated.
The average hit rate t over the 30 simulations ±
the standard deviation σ of each implementation is
presented in Table 2.
As can be seen on Table 2 the changes on the
fitness function brought good improvements to the
results on the evaluated benchmarks. It is important
to notice that equation F3 pushes the particles
towards clusters with more uniformly distributed
data, so it should be used on problems in witch is
previously known that clusters have uniform
distribution sizes, otherwise, equation F2 should be
used. On Iris, in witch clusters have uniform sizes,
equation F3 produced very good results, even
though equation F2 produced good results too. The
improvements on the others benchmarks are also
satisfactory.
On Figure 1, the convergence of the three
functions is shown. As a characteristic of the PSO,
they all have a fast convergence.
On Figures 2, 3 and 4, some examples of
clustering found can be seen. On Figure 2 contains
some examples of clustering for the Iris benchmark,
on the algorithm using function F1 found the correct
group for 71,9% of data, on Figure 3 the F2 found
the correct group for 88,6%, and on Figure 4 the F3
found the correct group for 85,3%. It can be seen
that F2 and F3 totally distinguished the class setosa
(squares) from the other classes.
NEW APPROACHES TO CLUSTERING DATA - Using the Particle Swarm Optimization Algorithm
595