Distributed K-Median Clustering with Application to

Image Clustering

Aiyesha Ma and Ishwar K. Sethi

Department of Computer Science and E ngineering

Oakland University

Rochester, Michigan

Abstract. Developing algorithms suitable for distributed environments is im-

portant as data becomes more distributed. This paper proposes a distributed K-

Median clustering algorithm for use in a distributed environment with central-

ized server, such as the Napster model in a peer-to-peer environment. Several

approximate methods for computing the median in a distributed environment are

proposed and analyzed in the context of the iterative K-Median algorithm.

The proposed algorithm allows the clustering of multivariate data while ensuring

that each cluster representative remains an item in the collection. This facilitates

exploratory analysis where retaining a representative in the collection is impor-

tant, such as imaging applications.

1 Introduction

The K-Means clustering algorithm is a well known and popular clustering technique,

with many applications. This algorithm has the limitation that the mean vector is a new

vector, and thus is only relevant in some applications. For example, suppose color is

an attribute in the vector, then the mean of the colors red and yellow would not make

any sense. Using centroids rather than means is one variation that forces the resulting

cluster representative to be an instance in the collection, thus avoiding non-sensical

intermediate steps and results.

The centroid is also the L

multivariate median, sometimes referred to as the spatial

median. The centroid is deﬁned as the object for which the cost function, or sum of the

distances to all other objects in the cluster, is minimized. While the idea of K-median,

or centroid based clustering, is not novel in a non-distributed environment, this paper

presents this concept in the distributed environment. Unlike the mean, the median can-

not be computed in a distributed environment without extensive communication cost.

Thus, this paper explores and analyzes several approximate median computations that

can be used to perform K-median clustering in a distributed environment. The proposed

clustering algorithm is applied to an image collection to further demonstrate the results.

A brief background on distributed clustering is presented in Section 2. The proposed

distributed K-median clustering algorithm is presented in Section 3 along with approx-

imate median computation schemes and analysis. Section 4 describes and presents the

application to image clustering. Following, in Section 5, is the conclusion.

Ma A. and K. Sethi I. (2007).

Distributed K-Median Clustering with Application to Image Clustering.

In Proceedings of the 7th International Workshop on Pattern Recognition in Information Systems, pages 215-220

DOI: 10.5220/0002425402150220

 SciTePress

2 Background

While some effort has been made to parallelize a couple popular clustering methods

(SOM [1], K-Means [2]), some parallelization methods are not applicable to distributed

environments with high communication costs. Furthermore, other clustering methods

may be more relevant to a particular application.

The k-means algorithm was ﬁrst parallelized for SPMD architecture using a mes-

sage passing interface by Dhillon and Modha in [2]. The algorithm is a two part iterative

process that continues until either a steady state is reached or the maximum number of

iterations has occurred. The ﬁrst part of the algorithm computes the local centers or

cluster means at each processor, in parallel. The second part of the algorithm consists

of a round of communication in which the local means are communicated and global

cluster centers are computed. There are variations on this basic algorithm for loosely

coupled systems or systems with uneven distribution across nodes, where the round of

communication after each local computation may not be efﬁcient [3].

Unlike the target environment of this paper, research on content-based retrieval in

P2P systems seems to focus on the decentralized arena. These systems use vector based

indexing systems for the retrieval, but in addition they organize the network topology so

that peers containing semantically similar documents are located near each other. Their

aim is thus to improve query efﬁciency by ﬁrst sending queries to those peers that are

likely to contain that type of semantic content. [4–8]. Though much research has gone

into information retrieval in P2P environments, little research has gone into developing

methods for interactive browsing in P2P environments. In [9], the authors propose a

browsing method for documents.

3 Distributed K-median Clustering

The distributed k-median algorithm presented here follows the general distributed k-

means algorithm ﬁrst developed by Dhillon and Modha in [2]. Instead of computing

the mean vector as the cluster center, however, the cluster center is computed as an

approximate global median.

The approximate global median for each cluster is computed as the weighted me-

dian of the local representatives for that cluster. Let, X

ing of representatives and their weights, for peer P for cluster C,

, w

)| x

is a representative in the collection,

is the number of items in the collection that x

represents}

Then, the approximate global median of all elements in cluster C,

Median(C

) = Median({X

(C)|∀P })

is computed by replicating each x

, w

times, and computing the median. For example,

if X

P =1

(C = 1) = {(P 1x

, 3), (P 1x

, 1)} and X

P =2

(C = 1) = {(P 2x

, 2)}, then

the global median would be Median(P 1x

, P 1x

, P 2x

The distributed clustering algorithm is shown in Algorithm 1. Here, D(x

, C) is the

distance between x

and cluster center C.

216

Algorithm 1 Distributed k-Median Clustering Algorithm.

Select initial cluster centers, C∀k

repeat

In Parallel do:

for all x

∈ P do

for all C do: Compute D(x

, C). end for

Assign x

to cluster C, where D(x

, C) is minimized ∀C

end for

for all C do: Compute X

(C). end for

Communicate X

(C)∀C

End In Parallel

for all C do: Compute Median(C

). end for

until centers are stabilized

3.1 Selecting Local Representatives

Presented here are four methods for selecting the local representatives X

ter C. In the following, C

refers to all items in cluster C at peer P , and Size is the

number of items in the set.

Local Median The representative of the cluster is selected as the local median. While

this approach works for accurately computing the mean value in a distributed envi-

ronment, it is only an approximate method when computing the median.

), Size(C

))}

Random Sampling: n representatives are randomly selected from each cluster.

, Size(C

)/n)|R

= Random(C

), i ∈ [1, n]}

Semi-Structured Sampling: The local median and randomly chosen points are se-

lected as representatives in such a way that all items in the cluster are within an

associated hypersphere. Letting neighbors be deﬁned as,

Neighbors(x

) = {x

|∀x

∈ C

, i 6= j, D(x

, x

) < RepDist }

where RepDist is a parameter setting, specifying the maximum radius of the spher-

ical volume. Then the representatives are selected as,

), Size(Neighbor s(M edian(C

))) + 1),

, Size(Neighbor s(R

)) + 1)}

= Random



∩ M edian(C

) ∪ Neighbors(Median(C

))

∩R

∪ N eighbors(R

) ∩ R

∪ N eighbors(R

)

∩ . . . ∩ R

i−1

∪ N eighbors(R

i−1

)



The additional representatives, R

’s, are chosen until all items in the local clus-

ter are represented. Thus i ∈ [0, Size(C

)], meaning the maximum number of

representatives is the size of the local cluster, and the minimum number of repre-

sentatives is 1 (just the local median).

217

Last Median: This approach uses two representatives: the local item nearest the last

calculated global median, and the local median.

)), Closer(Nearest(Median(C

)))),

(Median(C

), Closer(Median(C

)))}

where Nearest(Median(C

)) = x

such that ∀x

, x

∈ C

, i 6= j, D(x

Median(C

)) < D(x

, Median(C

)). Closer indicates all items in the local

cluster that are closer to one representative than the other. Thus, this is deﬁned for

Nearest(Median(C

)) as the set {x

|∀x

∈ C

, D(x

, N earest(Median(C

)))

< D(x

, Median(C

))}, and for M edian(C

) this is deﬁned as the set {x

|∀x

∈

, D(x

, Median(C

)) < D(x

, Nearest(Median(C

)))}.

3.2 Analysis of Approximate Global Medians

A test set of 80 vectors, V , of length 2 were generated, with 4 groups of 20 points.

Each group was normally distributed (σ = 1) within a quadrant in the x-y plane. The

vectors were assigned to 2 peers in three possible ways: Even Random (Size(V

) =

Size(V )/P), Uneven Expert (Size(V

) ∼ N ORMAL(µ = Size(V )/P, σ = 5),

and peers were experts), and Mixed (uneven number of vectors, with 3/4 expert and

1/4 random). Each K-median algorithm was run with the same initial centers for a

maximum of 10 iterations; convergence was usually around 3 or 4 iterations.

To compare the performance, the vectors were ranked by distance to the non-distri-

buted medians, and the performance was the average rank of the distributed medians.

The average performance over 100 test runs is shown in Figure 1(a). The Semi-

structured and Last-Median approaches tend to choose as medians points that are on

average only two or three points away from the median obtained by non-distributed

K-median clustering.

The semi-structured sampling approach tends to increase in the number of repre-

sentatives as the number of dimensions or the complexity of the data increases. Due to

this phenomena the Last-Median approach may be a better trade-off between commu-

nication (and overhead costs) and performance.

(a) Test Points: Comparison of Ap-

proximate Median Approaches for

Various Data Assignments

(b) Image Clustering: Comparison of distributed to

non-distributed k-median clustering, with uneven

expert distribution, K = 15, P = 40

Fig. 1. Performance of proposed K-Median Clustering.

218

Fig. 2. Leftmost is cluster center, Spring-Flowers upper row, Cambridge lower row.

4 Application to Image Clustering

The experimental data set consisted of 7100 color images from: CD photo collection,

Benchathlon [http://www.benchathlon.net], and University of Washington [http://www.

cs.washington.edu/research/imagedatabase/groundtruth/]. Each feature vector consisted

of global histogram with 256 bins in the HSV (Hue, Saturation, Value) color space, with

16 bins in hue, 4 bins in saturation, and 4 bins in value. This feature vector was chosen

to comply with the MPEG 7 speciﬁcation of a color descriptor [10, 11], and to provide

a base estimate of the performance.

Images were assigned to 40 peers by Uneven Random (not shown for brevity) and

Uneven Expert. The latter is more likely to be the case in practice, where a peer will

have a set of favorite topics. Clustering was performed using the semi-structured and

last median approaches, with k ∈ {9, 15, 40}. Clustering generally took between 4 and

7 iterations for all approaches.

The non-distributed clustering results were used as a baseline to analyze the dis-

tributed approaches. Letting an image I be an image in the collection, suppose I is in

cluster C

of the non-distributed approach. Then we consider {Near(I) ∈ C

} to be

the relevant images to retrieve, where Near(I) are the closest n images. If I is in clus-

ter C

of the distributed approach, then {Near(I) ∈ C

} are the closest m images

retrieved. Thus, we deﬁne,

P recision = Size({Near(I) ∈ C

} ∩ {Near(I) ∈ C

})/n

Recall = Size({Near(I) ∈ C

} ∩ {Near(I) ∈ C

})/m

Tests were conducted with n ∈ [5, 10, 15, 20]), and m ∈ [5, 10, 15, 20, 30, ClusterSize].

Each image in the collection was used as a query image, and the results were averaged

over the entire collection for each test. Results for 15 clusters are shown in Figure 1(b).

The two approaches performed similarly, however, the Last-Median approach fa-

vored larger k and faster peer computation times, while the semistructured approach fa-

vored smaller k and faster centralized server computation times. The maximum distance

parameter for the semi-structured approach signiﬁcantly affects the performance results,

calculation times, and communication overhead. Thus the Last-Median approach may

perform better when less information is available.

As mentioned earlier, the target of this application is to facilitate indexing and

browsing of the image collection over a distributed network. Figure 2 depicts a cluster

center (left) and nearest ﬁve images within the cluster for one cluster when the image set

is unevenly distributed across 40 peers, and clustered with the Last-Median approach

and k = 40. Other results will be available at http://iielab-secs.secs.oakland.edu.

219

5 Conclusion

This paper presents a k-median clustering approach for use in a distributed environment,

such as a peer-to-peer system. While the presented approach uses the Napster model of

a centralized coordinator and index, the clustering method could be extended to de-

centralized models by deciding on a communication scheme.

This paper compared several methods for computing an approximate median using

only summary data for each peer and the approaches were analyzed within the con-

text of the k-median clustering algorithm. It was noted that variations in data distribu-

tion (such as random versus expert) affected the performance of the proposed methods.

Overall, two approaches performed well regardless, but had other trade-offs to consider.

The results of image clustering showed that enough similarities exist between the

clusters produced with the non-distributed clustering and those produced with dis-

tributed clustering to ensure that browsing and indexing methods using the approximate

approaches in the distributed environment are possible. Furthermore the clustering al-

gorithm worked well given the limitations of the feature vector used.

References

1. Lawrence, R.D., Almasi, G.S., Rushmeier, H.E.: A scalable parallel algorithm for self-

organizing maps with applications to sparse data mining problems. Data Mining and Knowl-

edge Discovery 3 (1999) 171–195

2. Dhillon, I.S., Modha, D.S.: A data clustering algorithm on distributed memory multiproces-

sors. Large-Scale Parallel Data Mining, Lecture Notes in Artiﬁcial Intelligence 1759 (2000)

245–260

3. Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means

clustering. Knowledge and Information System Journal (2005) Online ﬁrst.

4. M

uller, W., Henrich, A.: Fast retrieval of high-dimensional feature vectors in P2P networks

using compact peer data summaries. In: MIR ’03: Proceedings of the 5th ACM SIGMM

international workshop on Multimedia information retrieval, New York, NY, USA, ACM

Press (2003) 79–86

5. M

uller, W., Eisenhardt, M., Henrich, A.: Scalable summary based retrieval in P2P networks.

In: CIKM ’05: Proceedings of the 14th ACM international conference on Information and

knowledge management, New York, NY, USA, ACM Press (2005) 586–593

6. Blanquer, I., Hernndez, V., Mas, F.: A P2P platform for sharing radiological images and diag-

noses. In: Proc. Medical Image Computing and Computer Assisted Intervention (MICCAI).

(2004)

7. King, I., Ng, C.H., Sia, K.C.: Distributed content-based visual information retrieval system

on peer-to-peer networks. ACM Trans. Inf. Syst. 22 (2004) 477–501

8. Yang, Z.: Interactive content-based image retrieval in the peer-to-peer network using self-

organizing maps. In: HUT T-110.551 Seminar on Internetworking. (2005)

9. Fischer, G., Nurzenski, A.: Towards scatter/gather browsing in a hierarchical peer-to-peer

network. In: P2PIR’05: Proceedings of the 2005 ACM workshop on Information retrieval in

peer-to-peer networks, New York, NY, USA, ACM Press (2005) 25–32

10. Manjunath, B.S., Ohm, J.R., Vasudevan, V.V., Yamada, A.: Color and texture descriptors.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

11 (2001)

11. Sikora, T.: The mpeg-7 visual standard for content descriptionan overview. IEEE TRANS-

ACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 11 (2001)

220