PRIVACY PRESERVING k-MEANS CLUSTERING IN
MULTI-PARTY ENVIRONMENT
Saeed Samet, Ali Miri
School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada K1N 6N5
Luis Orozco-Barbosa
Instituto de Investigacion en Informatica, Universidad de Castilla-La Mancha, 02071 Albacete, Spain
Keywords:
Data mining, Clustering, classification, and association rules, Mining methods and algorithms, Security and
Privacy Protection, Distributed data structures.
Abstract:
Extracting meaningful and valuable knowledge from databases is often done by various data mining algo-
rithms. Nowadays, databases are distributed among two or more parties because of different reasons such as
physical and geographical restrictions and the most important issue is privacy. Related data is normally main-
tained by more than one organization, each of which wants to keep its individual information private. Thus,
privacy-preserving techniques and protocols are designed to perform data mining on distributed environments
when privacy is highly concerned. Cluster analysis is a technique in data mining, by which data can be di-
vided into some meaningful clusters, and it has an important role in different fields such as bio-informatics,
marketing, machine learning, climate and medicine. k-means Clustering is a prominent algorithm in this cat-
egory which creates a one-level clustering of data. In this paper we introduce privacy-preserving protocols
for this algorithm, along with a protocol for Secure comparison, known as the Millionaires’ Problem, as a
sub-protocol, to handle the clustering of horizontally or vertically partitioned data among two or more parties.
1 INTRODUCTION
Clustering algorithms have been widely applied in
several applications, such as bio-informatics, market-
ing and medicine. In many of these applications se-
cure data is retrieved and stored by different organi-
zations, and thus privacy cannot be compromised in
most cases. Distribution of data could be horizontal,
i.e. each party owns some tuples of data, or vertical,
i.e. each party owns some attributes of data. Privacy-
preserving protocols are needed in these situations.
The k-means Clustering algorithm is a simple and rel-
atively efficient way to cluster data using artificial at-
tributes. The standard algorithm for this technique has
to be modified such that involved parties can jointly
and securely produce k clusters and assign each data
entity to the closest one. This paper makes the fol-
lowing contributions in this area of research:
1. A protocol for k-means Clustering when data is
horizontally partitioned among two or more par-
ties, maintaining the privacy of each party.
2. A new technique for secure comparison.
3. A new protocol for the vertically partitioned case.
The rest of this paper is organized as follows: Sec-
tion 2 is dedicated to a definition of k-means Cluster-
ing and some related work. In Sections 3, a protocol
for horizontally partitioned data among multiple par-
ties is introduced. In Section 4, a simple and efficient
protocol for Secure Comparison is presented which is
used in the protocol for the vertically partitioned case.
A protocol for the vertically case is described in Sec-
tions 5, followed by conclusions and future work in
Section 6.
2 CLUSTERING AND RELATED
WORK
Privacy issues in data mining techniques have been
widely studied and examined. Different protocols
have been presented for standard algorithms such as
decision trees, association rules, and clustering. In
this paper, we focus on the latter. Therefore, we first
381
Samet S., Miri A. and Orozco-Barbosa L. (2007).
PRIVACY PRESERVING k-MEANS CLUSTERING IN MULTI-PARTY ENVIRONMENT.
In Proceedings of the Second International Conference on Security and Cryptography, pages 381-385
DOI: 10.5220/0002121703810385
Copyright
c
SciTePress
explain the clustering problem and its standard algo-
rithm for k-means. Different algorithms exist in clus-
tering for use according to the underlying application
and type of data. Each has strengths and weaknesses.
Partitional, hierarchical (nested), and fuzzy are exam-
ples of existing algorithms in clustering. This paper
deals with k-means clustering in the partitional case.
In this technique, at first k artificial entities are pro-
duced as the initial means. Then, each data entity
(record or row) is assigned to the closest mean. In the
next step, based on the entities in each cluster, cen-
troids are updated. The last two steps are repeated
until the means remain unchanged or the difference
between any new center and its corresponding pre-
vious value is less than a specific threshold. Algo-
rithm 1 (Duda et al., 2000) shows the complete algo-
rithm for k-means clustering. The distance function
Algorithm 1 k-means Clustering Algorithm.
1. Determine k entities as the initial means
2. repeat
3. Assign each data entity to the closest mean
4. Reconstruct the mean of each cluster
5. until means do not change
in the k-means clustering algorithm could be a com-
mon distance metrics such as Euclidian, Manhattan
or Minkowski. Here we compute distance of two m-
dimensional vectors x and y by:
m
i=1
(x
i
y
i
)
2
where x
i
and y
i
are the i-th elements of the vectors
X and Y respectively. Also centroid, µ, of a cluster
containing {X
1
, · · ·, X
m
} is
µ =
X
1
+ · · · + X
m
m
.
There are two main approaches to maintaining pri-
vacy. The first uses data transformation and perturba-
tion, while the second one applies Secure Multi-party
Computation (SMC) techniques. There are some pro-
tocols presented for the former, such as (Oliveira and
Zaiane, 2003; Merugu and Ghosh, 2003), but in this
paper we consider the second approach. In (Jha et al.,
2005), Jha et al. present a protocol to apply in hori-
zontally partitioned data between two parties. They
introduce two secure techniques for this case, one
uses the Oblivious Polynomial Evaluation (OPE) pro-
tocol (Naor and Pinkas, 2001), and the second uses
Homomorphic Encryption, but does not provide for
a strong proof of security. In both techniques, one
party selects and uses a random private number. How-
ever, the second party, by using two received values
from the first party and computing their common di-
visions, is able to reduce considerably the possible
number of private shares of the first party. Also,
these techniques are only applied on the two party
case. Vaidya and Clifton (Vaidya and Clifton, 2003)
worked on the vertically partitioned case in the multi-
party environment. They use Yao’s Secure Circuit
Evaluation (Yao, 1986) protocol for secure the add-
and-compare function, and the permutation algorithm
developed by Du and Atallah (Du and Atallah, 2001)
using homomorphic encryption. However their pro-
tocol requires three non-colluding sites and is not ap-
plicable for two parties. The use of k-means clus-
tering over arbitrarily partitioned data was introduced
by Jagannathan and Wright (Jagannathan and Wright,
2005), but it only worked for two parties and could
not be extended to multiple parties. Jagannathan et
al. (Jagannathan et al., 2006) present another algo-
rithm for horizontally partitioned data between two
parties. This technique does not reveal intermediate
information and it is I/O efficient. They use a ”Divide,
Conquer and Combine” model and recursively create
k cluster centers for each half of the current data and
merge them into k means.
3 PRIVACY-PRESERVING
ALGORITHM FOR
HORIZONTALLY
PARTITIONED DATA
In this section, we present a protocol for k-means
clustering in horizontally distributed data where the
privacy of each party is preserved. For a database D,
suppose each party P
i
(1 i n) owns a subset, D
i
,
of D containing some entities such that D
i
D
j
=
/
0
for any 1 i, j n and
1in
D
i
= D. Now, these
parties want to jointly cluster their records without
revealing their individual information. After the
selecting initial k means, each party computes the
distance from its entities to the centroids and assigns
each entity to the closest one. This step can be
done separately, because each entity belongs entirely
to one party. The next step in each iteration is
recomputing k means based on the new clusters. This
computation should be done jointly by all parties.
To find the j-th mean, µ
j
(1 j k), all vectors
in the j-th cluster are involved. Suppose l
ji
is the
summation of all vectors in party P
i
which belong to
j-th cluster, and r
ji
is the number of these vectors.
Therefore, the new µ
j
would be:
SECRYPT 2007 - International Conference on Security and Cryptography
382
µ
j
=
n
i=1
l
ji
n
i=1
r
ji
.
However, they cannot simply send this informa-
tion to each other or to a third party because of pri-
vacy concerns. We present a multi-party protocol P
for computing each µ
j
.
3.1 Secure Multi-party Division
There are n parties each of which has two values x
i
and y
i
, and they want to securely compute:
n
i=1
x
i
n
i=1
y
i
(1)
First, by using secure multi-party addition they
separately compute r
i
s and s
i
s such that:
n
i=1
x
i
=
n
i=1
r
i
,
n
i=1
y
i
=
n
i=1
s
i
Then, one party, say p
1
, receivest
i
=
r
i
s
i
(2 i n)
from the other parties, computes
n
i=1
t
i
, which is equal
to expression (1), and sends the result to the other
parties. The authors of this paper present a solution
for secure multi-party addition in (Samet and Miri,
2006) and a generalization of two party addition to the
multi-party case is introduced in (Xiao et al., 2005).
Here, we briefly explain these two techniques.
3.1.1 Secure Multi-party Addition
Suppose n parties, each of which has a value x
i
, want
to run a protocol and at the end, each party obtains its
own output private share r
i
such that:
n
i=1
x
i
=
n
i=1
r
i
(2)
without revealing x
i
s and r
i
s to each other. The base
algorithm is applied to two parties. Therefore, we first
present the protocol for x
1
+ x
2
= r
1
r
2
.
P
1
randomly selects r
1
6= 0 and creates the vector
X
1
= (
x
1
r
1
,
1
r
1
)
P
2
creates the vector X
2
= (1, x
2
)
P
1
and P
2
run the Secure Dot Product (SDP), and
P
2
obtains the result of the dot product, r
2
:
r
2
= X
1
· X
2
= (
x
1
r
1
,
1
r
1
) · (1, x
2
) =
x
1
+ x
2
r
1
x
1
+ x
2
= r
1
r
2
Now suppose there are three parties P
1
, P
2
, and P
3
.
P
3
randomly divides its value, x
3
, into x
31
and
x
32
such that x
3
= x
31
+ x
32
, and selects a random
value r
3
P
3
and P
1
run the previous protocol for their inputs
x
31
and x
1
respectively. P
1
obtains s
1
such that
x
31
+ x
1
= r
3
s
1
P
3
and P
2
do the same for their inputs x
32
and x
2
.
P
2
obtains s
2
such that x
32
+ x
2
= r
3
s
2
P
1
and P
2
run the previous protocol for their inputs
s
1
and s
2
respectively, and obtain r
1
and r
2
such
that s
1
+ s
2
= r
1
r
2
. Now we have:
x
1
+ x
2
+ x
3
= (s
1
+ s
2
) r
3
= r
1
r
2
r
3
.
Therefore, r
1
, r
2
, and r
3
as the final output shares sat-
isfy the protocol. This algorithm can be done in the
multi-party case to generate output r
i
s from inputs x
i
s
such that equation (2) is satisfied. Checking the loop
condition of the k-means clustering algorithm, which
is comparing previous and new means, can be per-
formed publicly because all the parties have the value
of centroids. To show the security of the protocol we
have to check the secure multi-party division. Due to
limited space, we consider two parties. Proof of the
multi-party case is the same.
Theorem 3.1 The protocol P for jointly computing
x+y
m+n
, such that (x, m) belongs to P
1
and (y, n) belongs
to P
2
, is secure. i e. the privacy of the input pair for
each party is preserved.
Proof 1 At the end of the protocol P, P
1
and P
2
have
the following information:
I
P
1
(x, m) = (x, m, r
1
, s
1
,
r
2
s
2
) , I
P
2
(y, n) = (y, n, r
2
, s
2
,
r
1
s
1
)
such that
r
1
r
2
s
1
s
2
=
x+y
m+n
. As we see, both parties are in
the same situation at the end of the protocol with re-
gard to the information they obtain. Thus, it is enough
to prove the security of one party, say P
2
. First of all,
there is no dependency between the values of r
2
and
s
2
, because r
2
is P
2
s output share for the secure ad-
dition of x and y, and s
2
is P
2
s output share for the
secure addition of m and n. Also, the only informa-
tion that P
1
receives from P
2
is the ratio of r
2
to s
2
,
r
2
s
2
. For any given value t
2
=
r
2
s
2
from party P
2
, there
exist several possible pairs of (r
2
, s
2
) with the same
value of t
2
that lead to the same final result of
x+y
m+n
.
Therefore, P
2
is information-theoretically secure (and
the same situation happens for P
1
). In addition, the
advantage of an adversary in finding the P
2
s private
shares r
2
and s
2
is the same as randomly guessing all
the possible pairs of ( ´r
2
, ´s
2
) such that
´r
2
´s
2
=
r
2
s
2
.
A security analysis of SDP can be found in (Malek
and Miri, 2006).
PRIVACY PRESERVING k-MEANS CLUSTERING IN MULTI-PARTY ENVIRONMENT
383
4 A PROTOCOL FOR SECURE
COMPARISON
In the case of vertically partitioned data, we need to
securely compare the values owned by two parties
while the individual value of each party has to be kept
private. In this section we present a new, simple and
efficient solution for this problem. Suppose two par-
ties P
1
and P
2
each of which has an input number, x
1
for P
1
and x
2
for P
2
, want to compare these numbers
in such a way that neither knows the other’s input.
The only information they will obtain at the end of
the protocol is which has the greater value. Yao (Yao,
1982) presents the problem and a solution for it, but
it uses a boolean circuit of the comparison operation,
which needs a large number of communication rounds
and oblivious transfers. There are also other proto-
cols for secure comparison presented in (Peng et al.,
2004), and (Ioannidis and Grama, 2003). We present
a simple solution for this problem by using the Secure
Two-party Addition protocol. P
1
and P
2
perform the
following steps:
P
1
randomly selects a nonzero number l
1
and sets
its vector X
1
= (
1
l
1
,
x
1
l
1
) and P
2
sets its vector X
2
=
(x
2
, 1).
They run SDP and P
2
obtains its output l
2
such
that x
1
+ (x
2
) = x
1
x
2
= l
1
l
2
.
P
2
sends the sign of l
2
to P
1
. If l
2
= 0, i.e. x
1
= x
2
,
P
2
sends a flag indicating that the inputs are equal.
P
1
checks the following comparisons:
If P
1
receives the flag then x
1
= x
2
If Sign(l
1
) = Sign(l
2
) then x
1
> x
2
If Sign(l
1
) 6= Sign(l
2
) then x
1
< x
2
P
1
sends the result of the comparison to P
2
.
This protocol is very simple and efficient because
of the use of secure addition and SDP which have lin-
ear communication overhead. Also, the parties only
exchange the sign of their outputs once. This protocol
is secure because at first it uses SDP to produce pri-
vate outputs for the two parties, and in the next step,
P
1
, by receiving the sign of P
2
s output, has no in-
formation about P
2
s input and output. Also, P
2
only
receives the final result of the comparison.
5 PRIVACY-PRESERVING
ALGORITHM FOR
VERTICALLY PARTITIONED
DATA
A database is vertically distributed among n parties
when each party P
i
has the information of some at-
tributes (columns) from all entities in the database.
Therefore, in contrast to the horizontal case, finding
means at each iteration of the algorithm can be done
separately because the information for each attribute
maintained by one party and this party can compute
mean value of the corresponding components. The
problem is in the step where entities have to be as-
signed to the closest cluster. Each party has only the
information of some attributes, and thus they have
to jointly and securely compute the distance of each
entity to the current centroids. Suppose there are
n parties P
1
to P
n
, each of which has a set of at-
tributes. We denote the set of attributes owned by P
i
as A
i
= {a
i
1
, a
i
2
, · · ·, a
i
m
}. For each mean vector µ
j
, P
i
has the value of components corresponding to these
attributes, {µ
j
1
, µ
j
2
, ·· ·, µ
j
m
}. To compute the distance
from one entity to a centroid µ
j
, each party can com-
pute its portion first. For instance, P
i
s portion is:
(a
i
1
µ
j
1
)
2
+ (a
i
2
µ
j
2
)
2
+ · · · + (a
i
m
µ
j
m
)
2
We denote this value as d
ji
. Thus the distance from
an entity to the centroid µ
j
is:
d
j1
+ d
j2
+ · · · + d
jn
For another centroid µ
q
we have the same formula:
d
q1
+ d
q2
+ · · · + d
qn
We have to compute these two values to know
which mean is closer to the entity. First, each party
p
i
computes d
ji
d
qi
and denotes it as d
i
. Then,
they use Secure Sum (Clifton et al., 2002) to com-
pute
n
i=1
d
i
. If the result is negative µ
j
is closer to that
entity, otherwise µ
q
is closer. This step will be re-
peated for the selected mean with the next one until
the closest mean is found. In secure sum, if no two
parties P
i
and P
i+2
collude with each other, no indi-
vidual value will be revealed. To prevent this type
of attack, parties can do the secure sum in more than
one round with random order. The only possible is-
sue in the use of the secure sum can happen in the
case of only two parties. Suppose P
1
and P
2
verti-
cally shares a database and for an entity e, P
1
has d
j1
and d
q1
and P
2
has d
j2
and d
q2
for µ
j
and µ
q
respec-
tively. They have to compare d
j1
+d
j2
with d
q1
+d
q2
.
If (d
j1
d
q1
) + (d
j2
d
q2
) < 0 then e is closer to µ
j
,
SECRYPT 2007 - International Conference on Security and Cryptography
384
otherwise it is closer to µ
q
. Thus, P
1
and P
2
can run
the secure comparison protocol, presented in the sec-
tion 4. Their inputs are d
j1
d
q1
for P
1
, and d
q2
d
j2
for P
2
. Therefore, they can jointly decide which mean
is closer to the entity e.
6 CONCLUSIONS AND FUTURE
WORK
Clustering is a method to categorize information into
meaningful partitions to make data analysis simpler
and more accurate. This technique has a wide range of
applications in the real world and also as a utility for
data summarization and compression. In many cases,
privacy is crucial and secure protocols are needed to
perform clustering in order to preserve the privacy of
shareholders. Two multi-party protocols for privacy-
preserving k-means clustering are presented for hor-
izontally and vertically partitioned data, along with
a protocol for secure two-party comparison. These
SMC techniques are based on secure multi-party ad-
dition and division sub-protocols. There are many
different clustering algorithms such as k-means, k-
medoid, and Agglomerative Hierarchical clustering.
Most existing work in privacy-preserving clustering
uses k-means. One possible extension of this work is
to design protocols for other algorithms, particularly
hierarchical clustering.
REFERENCES
Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., and Zhu,
M. Y. (2002). Tools for privacy preserving data min-
ing. SIGKDD Explorations, 4(2):28–34.
Du, W. and Atallah, M. (2001). Privacy-preserving co-
operative statistical analysis. In Proc. of the 17th
Annual Computer Security Applications Conference,
pages 102–110.
Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern
Classification (2nd ed). John Wiley.
Ioannidis, I. and Grama, A. (2003). An efficient protocol for
yao’s millionaires’ problem. In Proc. of the 36th An-
nual Hawaii International Conference on System Sci-
ence, pages 205–211.
Jagannathan, G., Pillaipakkamnatt, K., and Wright, R. N.
(2006). A new privacy-preserving distributed k-
clustering algorithm. In Proc. of the 2006 SIAM In-
ternational Conference on Data Mining.
Jagannathan, G. and Wright, R. N. (2005). Privacy-
preserving distributed k-means clustering over arbi-
trarily partitioned data. In Proceeding of the 11th
ACM SIGKDD international conference on Knowl-
edge discovery in data mining, pages 593–599.
Jha, S., Kruger, L., and McDaniel, P. (2005). Privacy pre-
serving clustering. In Proc. of the 10th European Sym-
posium on Research in Computer Security, pages 397–
417.
Malek, B. and Miri, A. (2006). Secure dot-product protocol
using trace functions. 2006 IEEE International Sym-
posium on Information Theory.
Merugu, S. and Ghosh, J. (2003). Privacy-preserving dis-
tributed clustering using generative models. In Proc.
of the 3rd IEEE International Conference on Data
Mining, pages 211–218.
Naor, M. and Pinkas, B. (2001). Efficient oblivious trans-
fer protocols. In Proc. of the 12th annual ACM-SIAM
symposium on Discrete algorithms, pages 448–457.
Oliveira, S. R. M. and Zaiane, O. R. (2003). Privacy pre-
serving clustering by data transformation. In Proc. of
the 18th Brazilian Symposium on Databases), pages
304–318.
Peng, K., Boyd, C., Dawson, E., and Lee, B. (2004). An ef-
ficient and verifiable solution to the millionaire prob-
lem. In Proc. of the 7th International Conference on
Information Security and Cryptology, pages 51–66.
Samet, S. and Miri, A. (2006). Privacy preserving ID3 using
Gini Index over horizontally partitioned data. Submit-
ted.
Vaidya, J. and Clifton, C. (2003). Privacy-preserving k-
means clustering over vertically partitioned data. In
Proc. of the 9th ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, pages
206–215.
Xiao, M.-J., Huang, L.-S., Luo, Y.-L., and Shen, H. (2005).
Privacy preserving ID3 algorithm over horizontally
partitioned data. In Parallel and Distributed Comput-
ing, Applications and Technologies, pages 239–243.
Yao, A. C. (1982). Protocols for secure computations. In
Proc. of the 23th Symposium on Foundations of Com-
puter Science, pages 160–164.
Yao, A. C. (1986). How to generate and exchange secrets. In
Proc. of the 27th Symposium on Foundations of Com-
puter Science, pages 162––167.
PRIVACY PRESERVING k-MEANS CLUSTERING IN MULTI-PARTY ENVIRONMENT
385