PRIVACY PRESERVING k-MEANS CLUSTERING IN

MULTI-PARTY ENVIRONMENT

Saeed Samet, Ali Miri

School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada K1N 6N5

Luis Orozco-Barbosa

Instituto de Investigacion en Informatica, Universidad de Castilla-La Mancha, 02071 Albacete, Spain

Keywords:

Data mining, Clustering, classiﬁcation, and association rules, Mining methods and algorithms, Security and

Privacy Protection, Distributed data structures.

Abstract:

Extracting meaningful and valuable knowledge from databases is often done by various data mining algo-

rithms. Nowadays, databases are distributed among two or more parties because of different reasons such as

physical and geographical restrictions and the most important issue is privacy. Related data is normally main-

tained by more than one organization, each of which wants to keep its individual information private. Thus,

privacy-preserving techniques and protocols are designed to perform data mining on distributed environments

when privacy is highly concerned. Cluster analysis is a technique in data mining, by which data can be di-

vided into some meaningful clusters, and it has an important role in different ﬁelds such as bio-informatics,

marketing, machine learning, climate and medicine. k-means Clustering is a prominent algorithm in this cat-

egory which creates a one-level clustering of data. In this paper we introduce privacy-preserving protocols

for this algorithm, along with a protocol for Secure comparison, known as the Millionaires’ Problem, as a

sub-protocol, to handle the clustering of horizontally or vertically partitioned data among two or more parties.

1 INTRODUCTION

Clustering algorithms have been widely applied in

several applications, such as bio-informatics, market-

ing and medicine. In many of these applications se-

cure data is retrieved and stored by different organi-

zations, and thus privacy cannot be compromised in

most cases. Distribution of data could be horizontal,

i.e. each party owns some tuples of data, or vertical,

i.e. each party owns some attributes of data. Privacy-

preserving protocols are needed in these situations.

The k-means Clustering algorithm is a simple and rel-

atively efﬁcient way to cluster data using artiﬁcial at-

tributes. The standard algorithm for this technique has

to be modiﬁed such that involved parties can jointly

and securely produce k clusters and assign each data

entity to the closest one. This paper makes the fol-

lowing contributions in this area of research:

1. A protocol for k-means Clustering when data is

horizontally partitioned among two or more par-

ties, maintaining the privacy of each party.

2. A new technique for secure comparison.

3. A new protocol for the vertically partitioned case.

The rest of this paper is organized as follows: Sec-

tion 2 is dedicated to a deﬁnition of k-means Cluster-

ing and some related work. In Sections 3, a protocol

for horizontally partitioned data among multiple par-

ties is introduced. In Section 4, a simple and efﬁcient

protocol for Secure Comparison is presented which is

used in the protocol for the vertically partitioned case.

A protocol for the vertically case is described in Sec-

tions 5, followed by conclusions and future work in

Section 6.

2 CLUSTERING AND RELATED

WORK

Privacy issues in data mining techniques have been

widely studied and examined. Different protocols

have been presented for standard algorithms such as

decision trees, association rules, and clustering. In

this paper, we focus on the latter. Therefore, we ﬁrst

381

Samet S., Miri A. and Orozco-Barbosa L. (2007).

PRIVACY PRESERVING k-MEANS CLUSTERING IN MULTI-PARTY ENVIRONMENT.

In Proceedings of the Second International Conference on Security and Cryptography, pages 381-385

DOI: 10.5220/0002121703810385

 SciTePress

explain the clustering problem and its standard algo-

rithm for k-means. Different algorithms exist in clus-

tering for use according to the underlying application

and type of data. Each has strengths and weaknesses.

Partitional, hierarchical (nested), and fuzzy are exam-

ples of existing algorithms in clustering. This paper

deals with k-means clustering in the partitional case.

In this technique, at ﬁrst k artiﬁcial entities are pro-

duced as the initial means. Then, each data entity

(record or row) is assigned to the closest mean. In the

next step, based on the entities in each cluster, cen-

troids are updated. The last two steps are repeated

until the means remain unchanged or the difference

between any new center and its corresponding pre-

vious value is less than a speciﬁc threshold. Algo-

rithm 1 (Duda et al., 2000) shows the complete algo-

rithm for k-means clustering. The distance function

Algorithm 1 k-means Clustering Algorithm.

1. Determine k entities as the initial means

2. repeat

3. Assign each data entity to the closest mean

4. Reconstruct the mean of each cluster

5. until means do not change

in the k-means clustering algorithm could be a com-

mon distance metrics such as Euclidian, Manhattan

or Minkowski. Here we compute distance of two m-

dimensional vectors x and y by:

∑

i=1

− y

)

where x

and y

are the i-th elements of the vectors

X and Y respectively. Also centroid, µ, of a cluster

containing {X

, · · ·, X

} is

µ =

+ · · · + X

There are two main approaches to maintaining pri-

vacy. The ﬁrst uses data transformation and perturba-

tion, while the second one applies Secure Multi-party

Computation (SMC) techniques. There are some pro-

tocols presented for the former, such as (Oliveira and

Zaiane, 2003; Merugu and Ghosh, 2003), but in this

paper we consider the second approach. In (Jha et al.,

2005), Jha et al. present a protocol to apply in hori-

zontally partitioned data between two parties. They

introduce two secure techniques for this case, one

uses the Oblivious Polynomial Evaluation (OPE) pro-

tocol (Naor and Pinkas, 2001), and the second uses

Homomorphic Encryption, but does not provide for

a strong proof of security. In both techniques, one

party selects and uses a random private number. How-

ever, the second party, by using two received values

from the ﬁrst party and computing their common di-

visions, is able to reduce considerably the possible

number of private shares of the ﬁrst party. Also,

these techniques are only applied on the two party

case. Vaidya and Clifton (Vaidya and Clifton, 2003)

worked on the vertically partitioned case in the multi-

party environment. They use Yao’s Secure Circuit

Evaluation (Yao, 1986) protocol for secure the add-

and-compare function, and the permutation algorithm

developed by Du and Atallah (Du and Atallah, 2001)

using homomorphic encryption. However their pro-

tocol requires three non-colluding sites and is not ap-

plicable for two parties. The use of k-means clus-

tering over arbitrarily partitioned data was introduced

by Jagannathan and Wright (Jagannathan and Wright,

2005), but it only worked for two parties and could

not be extended to multiple parties. Jagannathan et

al. (Jagannathan et al., 2006) present another algo-

rithm for horizontally partitioned data between two

parties. This technique does not reveal intermediate

information and it is I/O efﬁcient. They use a ”Divide,

Conquer and Combine” model and recursively create

k cluster centers for each half of the current data and

merge them into k means.

3 PRIVACY-PRESERVING

ALGORITHM FOR

HORIZONTALLY

PARTITIONED DATA

In this section, we present a protocol for k-means

clustering in horizontally distributed data where the

privacy of each party is preserved. For a database D,

suppose each party P

(1 ≤ i ≤ n) owns a subset, D

of D containing some entities such that D

∩ D

for any 1 ≤ i, j ≤ n and

1≤i≤n

= D. Now, these

parties want to jointly cluster their records without

revealing their individual information. After the

selecting initial k means, each party computes the

distance from its entities to the centroids and assigns

each entity to the closest one. This step can be

done separately, because each entity belongs entirely

to one party. The next step in each iteration is

recomputing k means based on the new clusters. This

computation should be done jointly by all parties.

To ﬁnd the j-th mean, µ

(1 ≤ j ≤ k), all vectors

in the j-th cluster are involved. Suppose l

is the

summation of all vectors in party P

which belong to

j-th cluster, and r

is the number of these vectors.

Therefore, the new µ

would be:

SECRYPT 2007 - International Conference on Security and Cryptography

382

∑

i=1

∑

i=1

However, they cannot simply send this informa-

tion to each other or to a third party because of pri-

vacy concerns. We present a multi-party protocol P

for computing each µ

3.1 Secure Multi-party Division

There are n parties each of which has two values x

and y

, and they want to securely compute:

∑

i=1

∑

i=1

(1)

First, by using secure multi-party addition they

separately compute r

’s and s

’s such that:

∑

i=1

∏

i=1

∑

i=1

∏

i=1

Then, one party, say p

, receivest

(2 ≤ i ≤ n)

from the other parties, computes

∏

i=1

, which is equal

to expression (1), and sends the result to the other

parties. The authors of this paper present a solution

for secure multi-party addition in (Samet and Miri,

2006) and a generalization of two party addition to the

multi-party case is introduced in (Xiao et al., 2005).

Here, we brieﬂy explain these two techniques.

3.1.1 Secure Multi-party Addition

Suppose n parties, each of which has a value x

, want

to run a protocol and at the end, each party obtains its

own output private share r

such that:

∑

i=1

∏

i=1

(2)

without revealing x

’s and r

’s to each other. The base

algorithm is applied to two parties. Therefore, we ﬁrst

present the protocol for x

+ x

= r

∗ r

• P

randomly selects r

6= 0 and creates the vector

= (

)

• P

creates the vector X

= (1, x

)

• P

and P

run the Secure Dot Product (SDP), and

obtains the result of the dot product, r

= X

· X

= (

) · (1, x

) =

+ x

⇒ x

+ x

= r

∗ r

Now suppose there are three parties P

, P

, and P

• P

randomly divides its value, x

, into x

and

such that x

= x

+ x

, and selects a random

value r

• P

and P

run the previous protocol for their inputs

and x

respectively. P

obtains s

such that

+ x

= r

∗ s

• P

and P

do the same for their inputs x

and x

obtains s

such that x

+ x

= r

∗ s

• P

and P

run the previous protocol for their inputs

and s

respectively, and obtain r

and r

such

that s

+ s

= r

∗ r

. Now we have:

+ x

= (s

+ s

) ∗ r

= r

∗ r

Therefore, r

, r

, and r

as the ﬁnal output shares sat-

isfy the protocol. This algorithm can be done in the

multi-party case to generate output r

s from inputs x

such that equation (2) is satisﬁed. Checking the loop

condition of the k-means clustering algorithm, which

is comparing previous and new means, can be per-

formed publicly because all the parties have the value

of centroids. To show the security of the protocol we

have to check the secure multi-party division. Due to

limited space, we consider two parties. Proof of the

multi-party case is the same.

Theorem 3.1 The protocol P for jointly computing

x+y

m+n

, such that (x, m) belongs to P

and (y, n) belongs

to P

, is secure. i e. the privacy of the input pair for

each party is preserved.

Proof 1 At the end of the protocol P, P

and P

have

the following information:

(x, m) = (x, m, r

, s

) , I

(y, n) = (y, n, r

, s

)

such that

∗r

∗s

x+y

m+n

. As we see, both parties are in

the same situation at the end of the protocol with re-

gard to the information they obtain. Thus, it is enough

to prove the security of one party, say P

. First of all,

there is no dependency between the values of r

and

, because r

is P

’s output share for the secure ad-

dition of x and y, and s

is P

’s output share for the

secure addition of m and n. Also, the only informa-

tion that P

receives from P

is the ratio of r

to s

. For any given value t

from party P

, there

exist several possible pairs of (r

, s

) with the same

value of t

that lead to the same ﬁnal result of

x+y

m+n

Therefore, P

is information-theoretically secure (and

the same situation happens for P

). In addition, the

advantage of an adversary in ﬁnding the P

’s private

shares r

and s

is the same as randomly guessing all

the possible pairs of ( ´r

, ´s

) such that

´r

´s

A security analysis of SDP can be found in (Malek

and Miri, 2006).

PRIVACY PRESERVING k-MEANS CLUSTERING IN MULTI-PARTY ENVIRONMENT

383

4 A PROTOCOL FOR SECURE

COMPARISON

In the case of vertically partitioned data, we need to

securely compare the values owned by two parties

while the individual value of each party has to be kept

private. In this section we present a new, simple and

efﬁcient solution for this problem. Suppose two par-

ties P

and P

each of which has an input number, x

for P

and x

for P

, want to compare these numbers

in such a way that neither knows the other’s input.

The only information they will obtain at the end of

the protocol is which has the greater value. Yao (Yao,

1982) presents the problem and a solution for it, but

it uses a boolean circuit of the comparison operation,

which needs a large number of communication rounds

and oblivious transfers. There are also other proto-

cols for secure comparison presented in (Peng et al.,

2004), and (Ioannidis and Grama, 2003). We present

a simple solution for this problem by using the Secure

Two-party Addition protocol. P

and P

perform the

following steps:

• P

randomly selects a nonzero number l

and sets

its vector X

= (

) and P

sets its vector X

(−x

, 1).

• They run SDP and P

obtains its output l

such

that x

+ (−x

) = x

− x

= l

∗ l

• P

sends the sign of l

to P

. If l

= 0, i.e. x

= x

sends a ﬂag indicating that the inputs are equal.

• P

checks the following comparisons:

– If P

receives the ﬂag then x

= x

– If Sign(l

) = Sign(l

) then x

> x

– If Sign(l

) 6= Sign(l

) then x

< x

• P

sends the result of the comparison to P

This protocol is very simple and efﬁcient because

of the use of secure addition and SDP which have lin-

ear communication overhead. Also, the parties only

exchange the sign of their outputs once. This protocol

is secure because at ﬁrst it uses SDP to produce pri-

vate outputs for the two parties, and in the next step,

, by receiving the sign of P

’s output, has no in-

formation about P

’s input and output. Also, P

only

receives the ﬁnal result of the comparison.

5 PRIVACY-PRESERVING

ALGORITHM FOR

VERTICALLY PARTITIONED

DATA

A database is vertically distributed among n parties

when each party P

has the information of some at-

tributes (columns) from all entities in the database.

Therefore, in contrast to the horizontal case, ﬁnding

means at each iteration of the algorithm can be done

separately because the information for each attribute

maintained by one party and this party can compute

mean value of the corresponding components. The

problem is in the step where entities have to be as-

signed to the closest cluster. Each party has only the

information of some attributes, and thus they have

to jointly and securely compute the distance of each

entity to the current centroids. Suppose there are

n parties P

to P

, each of which has a set of at-

tributes. We denote the set of attributes owned by P

as A

= {a

, a

, · · ·, a

}. For each mean vector µ

, P

has the value of components corresponding to these

attributes, {µ

, µ

, ·· ·, µ

}. To compute the distance

from one entity to a centroid µ

, each party can com-

pute its portion ﬁrst. For instance, P

’s portion is:

− µ

)

+ (a

− µ

)

+ · · · + (a

− µ

)

We denote this value as d

. Thus the distance from

an entity to the centroid µ

is:

+ d

+ · · · + d

For another centroid µ

we have the same formula:

+ d

+ · · · + d

We have to compute these two values to know

which mean is closer to the entity. First, each party

computes d

− d

and denotes it as d

. Then,

they use Secure Sum (Clifton et al., 2002) to com-

pute

∑

i=1

. If the result is negative µ

is closer to that

entity, otherwise µ

is closer. This step will be re-

peated for the selected mean with the next one until

the closest mean is found. In secure sum, if no two

parties P

and P

i+2

collude with each other, no indi-

vidual value will be revealed. To prevent this type

of attack, parties can do the secure sum in more than

one round with random order. The only possible is-

sue in the use of the secure sum can happen in the

case of only two parties. Suppose P

and P

verti-

cally shares a database and for an entity e, P

has d

and d

and P

has d

and d

for µ

and µ

respec-

tively. They have to compare d

with d

If (d

− d

) + (d

− d

) < 0 then e is closer to µ

SECRYPT 2007 - International Conference on Security and Cryptography

384

otherwise it is closer to µ

. Thus, P

and P

can run

the secure comparison protocol, presented in the sec-

tion 4. Their inputs are d

− d

for P

, and d

− d

for P

. Therefore, they can jointly decide which mean

is closer to the entity e.

6 CONCLUSIONS AND FUTURE

WORK

Clustering is a method to categorize information into

meaningful partitions to make data analysis simpler

and more accurate. This technique has a wide range of

applications in the real world and also as a utility for

data summarization and compression. In many cases,

privacy is crucial and secure protocols are needed to

perform clustering in order to preserve the privacy of

shareholders. Two multi-party protocols for privacy-

preserving k-means clustering are presented for hor-

izontally and vertically partitioned data, along with

a protocol for secure two-party comparison. These

SMC techniques are based on secure multi-party ad-

dition and division sub-protocols. There are many

different clustering algorithms such as k-means, k-

medoid, and Agglomerative Hierarchical clustering.

Most existing work in privacy-preserving clustering

uses k-means. One possible extension of this work is

to design protocols for other algorithms, particularly

hierarchical clustering.

REFERENCES

Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., and Zhu,

M. Y. (2002). Tools for privacy preserving data min-

ing. SIGKDD Explorations, 4(2):28–34.

Du, W. and Atallah, M. (2001). Privacy-preserving co-

operative statistical analysis. In Proc. of the 17th

Annual Computer Security Applications Conference,

pages 102–110.

Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern

Classiﬁcation (2nd ed). John Wiley.

Ioannidis, I. and Grama, A. (2003). An efﬁcient protocol for

yao’s millionaires’ problem. In Proc. of the 36th An-

nual Hawaii International Conference on System Sci-

ence, pages 205–211.

Jagannathan, G., Pillaipakkamnatt, K., and Wright, R. N.

(2006). A new privacy-preserving distributed k-

clustering algorithm. In Proc. of the 2006 SIAM In-

ternational Conference on Data Mining.

Jagannathan, G. and Wright, R. N. (2005). Privacy-

preserving distributed k-means clustering over arbi-

trarily partitioned data. In Proceeding of the 11th

ACM SIGKDD international conference on Knowl-

edge discovery in data mining, pages 593–599.

Jha, S., Kruger, L., and McDaniel, P. (2005). Privacy pre-

serving clustering. In Proc. of the 10th European Sym-

posium on Research in Computer Security, pages 397–

417.

Malek, B. and Miri, A. (2006). Secure dot-product protocol

using trace functions. 2006 IEEE International Sym-

posium on Information Theory.

Merugu, S. and Ghosh, J. (2003). Privacy-preserving dis-

tributed clustering using generative models. In Proc.

of the 3rd IEEE International Conference on Data

Mining, pages 211–218.

Naor, M. and Pinkas, B. (2001). Efﬁcient oblivious trans-

fer protocols. In Proc. of the 12th annual ACM-SIAM

symposium on Discrete algorithms, pages 448–457.

Oliveira, S. R. M. and Zaiane, O. R. (2003). Privacy pre-

serving clustering by data transformation. In Proc. of

the 18th Brazilian Symposium on Databases), pages

304–318.

Peng, K., Boyd, C., Dawson, E., and Lee, B. (2004). An ef-

ﬁcient and veriﬁable solution to the millionaire prob-

lem. In Proc. of the 7th International Conference on

Information Security and Cryptology, pages 51–66.

Samet, S. and Miri, A. (2006). Privacy preserving ID3 using

Gini Index over horizontally partitioned data. Submit-

ted.

Vaidya, J. and Clifton, C. (2003). Privacy-preserving k-

means clustering over vertically partitioned data. In

Proc. of the 9th ACM SIGKDD international confer-

ence on Knowledge discovery and data mining, pages

206–215.

Xiao, M.-J., Huang, L.-S., Luo, Y.-L., and Shen, H. (2005).

Privacy preserving ID3 algorithm over horizontally

partitioned data. In Parallel and Distributed Comput-

ing, Applications and Technologies, pages 239–243.

Yao, A. C. (1982). Protocols for secure computations. In

Proc. of the 23th Symposium on Foundations of Com-

puter Science, pages 160–164.

Yao, A. C. (1986). How to generate and exchange secrets. In

Proc. of the 27th Symposium on Foundations of Com-

puter Science, pages 162––167.

PRIVACY PRESERVING k-MEANS CLUSTERING IN MULTI-PARTY ENVIRONMENT

385