![](bg2.png)
2 K-MEANS ALGORITHM
The k-means algorithm is a partitioning clustering
algorithm. The k -means algorithm is very simple
and most popular clustering algorithm. The k-means
algorithm is a squared error-based clustering
algorithm.
The k-means is given by MacQueen
(MacQueen,1967) and aim of this clustering
algorithm is to divide the dataset into disjoint
clusters by optimizing an objective function that is
given below
Optimize
∑∑
=∈
=
k
icix
mixdE
1
),(
(1)
Here m
i
is the center of cluster C
i
, while d(x,m
i
)
is the euclidean distance between a point x and
cluster center m
i
. In k-means algorithm, the
objective function E attempts to minimize the
distance of each point from the cluster center to
which the point belongs.
Consider the data set with ‘n’ objects ,i.e.,
S = {x
i
: 1 ≤ i ≤ n}.
1) Initialize k-partitions randomly or based on some
prior knowledge.
i.e. { C
1
, C
2
, C
3
, …….., C
k
}.
2) Calculate the cluster prototype matrix M (distance
matrix of distances between k-clusters and data
objects) .
M = { m
1
, m
2
, m
3
, …….. , m
k
} where m
i
is a
column matrix 1 × n .
3)Assign each object in the data set to the nearest
cluster - Cm i.e. x
j
∈C
m
if d(x
j
,C
m
) ≤ d(x
j
,C
i
) ∀
1 ≤ j ≤ k , j ≠m where j=1,2,3,…….n.
4) Calculate the average of cluster elements of each
cluster and change the k-cluster centers by their
averages.
5) Again calculate the cluster prototype matrix M.
6) Repeat steps 3, 4 and 5 until there is no change
for each cluster.
3 PAM ALGORITHM
The purpose for the partitioning of a data set into k
separate clusters is to find groups whose members
show a high degree of similarity among themselves
but dissimilarity with the members of other groups.
The objective of PAM(Partitioning Around
Medoids) (Kaufman,1990) is to determine a
representative object (medoid) for each cluster, that
is, to find the most centrally located objects within
the clusters. Initially a set of k-items is taken to be
the set of medoids. Then, at each step, all objects
from the input dataset that are not currently medoids
are examined one by one if they should be
medoids.That is the algorithm determines whether
there is an object that should replace one of the
existing medoids . Swapping of medoids with other
non-selected objects is based on the value of total
cost of impact T
ih
.The PAM represents a cluster by
a medoid so PAM is also known as k-medoids
algorithm.
The PAM algorithm consists of two parts. The first
build phase follows the following algorithm:
Phase-1:
Consider an object i as a candidate.Consider another
object j that has not been selected as a prior
candidate. Obtain its dissimilarity d
j
with the most
similar previously selected candidates. Obtain its
dissimilarity with the new candidate i. Call this d(j;
i): Take the difference of these two dissimilarities.
1) If the difference is positive, then object j
contributes to the possible selection of i.
Calculate C
ji
= max (d
j
- d(j; i); 0) where d
j
– Euclidian distance between j
th
object and
most similar previously selected candidate
and d(j; i) – Euclidian distance between j
th
and i
th
object .
2) Sum Cji over all possible j.
3) Choose the object i that maximizes the sum
of C
ji
over all possible j.
4) Repeat the process until k objects have been
found.
Phase-2:
The second step attempts to improve the set of
representative objects. This does so by considering
all pairs of objects (i; h) in which i has been chosen
but h has not been chosen as a representative. Next it
is determined if the clustering results improve if
object i and h are exchanged. To determine the
effect of a possible swap between i and h we use the
following algorithm:
Consider an object j that has not been previously
selected. We calculate its swap contribution C
jih
:
1) If j is further from i and h than from one of the
other representatives, set C
jih
to zero.
2) If j is not further from i than any other
representatives (d(j;i)=d
j
), consider one of two
situations:
a) j is closer to h than the second closest
representative & d(j; h) < E
j
where E
j
is the
Euclidian distance of between j
th
object and the
second most similarly representative . Then
C
jih
= d(j; h)-d(j; i).
Note: C
jih
can be either negative or positive
depending on the positions of j, i and h. Here only if
ICSOFT 2008 - International Conference on Software and Data Technologies
256