Cloud based Privacy Preserving Data Mining with Decision Tree

Echo P. Zhang, Yi-Jun He and Lucas C. K. Hui

Department of Computer Science, The University of Hong Kong, Hong Kong, Hong Kong

Keywords:

Data Mining, Privacy Preserving Data Mining, Cloud Computing, Privacy in Cloud Computing.

Abstract:

Privacy Preserving Data Mining (PPDM) aims at performing data mining among multiple parties, and at the

meantime, no single party suffers the threat of releasing private data to any others. Nowadays, cloud service

becomes more and more popular. However, how to deal with privacy issues of cloud service is still developing.

This paper is one of the ﬁrst researches in cloud server based PPDM. We propose a novel protocol that the

cloud server performs data mining in encrypted databases, and our solution can guarantee the privacy of each

client. This scheme can protect client from malicious users. With aid of a hardware box, the scheme can also

protect clients from untrusted cloud server. Another novel feature of this solution is that it works even when

the database from different parties are overlapping.

1 INTRODUCTION

1.1 Motivation

The privacy preserving data mining (PPDM) problem

has draw a lot of attentions in recent years (Verykios

et al., 2004). Multiple participants intend to collabo-

ratively mine the data from their combined database.

Also, no single participant’s private data will be re-

leased to any other party. In this paper, we focus

on solving one PPDM problem with decision tree

technique. Note that although many previous works

(Goethals et al., 2004), (Kantarcioglu and Clifton,

2004), (Kantarcioglu and Kardes, 2009), (Kantar-

cioglu et al., 2009), (Jha et al., 2005), (Zhan et al.,

2005) did the research on PPDM problem through

various directions, they did not give a concrete def-

inition of PPDM. Here, we start with deﬁning the

PPDM problem with decision tree as follows (for con-

venience, we call it PPDM

DT PC):

Deﬁnition 1: PPDM DT PC.

• There are n parties (n ≥ 2). Call them C

, . . . , C

• C

has a private two-dimensional database DB



i, j,k



, where d

i, j,k

means the cell data at the

position of row j, column k in DB

• A subset of {C

, . . . , C

}, denoted by C

sub

, coop-

erate to construct a decision tree based on their

corresponding databases.

• After the decision tree is built up, C

only gets the

result (i.e. the decision tree DT) without knowing

any data of DB

for any j 6= i.

There have been many researches working on

PPDM problem with decision tree, and they gave out

many feasible solutions to different distributions of

databases, such as horizontally distributed database

(Lindell and Pinkas, 2000) and vertically (Wang et al.,

2004), (Fung et al., 2005), (Du and Zhan, 2002),

(Vaidya and Clifton, 2005), (Fang et al., 2010) dis-

tributed database. Since all previous researches pro-

posed solutions based on the private databases stored

in participants’ personal computers, we collectively

call the above deﬁnition as PC-based PPDM prob-

lem (PPDM

DT PC). Up to our best knowledge, all

PPDM DT PC solutions suffer the same two short-

ness: 1) Low efﬁciency. Most of PPDM

DT PC

solutions requires a large amount of communication

messages, in order to achieve the privacy protection

of each participant; 2) Limitation to non-overlapping

database. There is no existing PC-based PPDM solu-

tion able to deal with overlapping database.

Besides the above problems, all PPDM DT PC

solutions cannot be applied to the modern Cloud com-

puting environment. As cloud computing is becoming

more and more popular nowadays, therefore, we want

to design a system of PPDM utilizing cloud comput-

ing (We call it PPDM

DT Cloud). There is an addi-

tional advantage of using cloud environment to solve

PPDM. The cloud service not only can handle over-

lapping databases from different participants, but also

provide an efﬁcient solution due to communication

messages among different parties can be reduced.

P. Zhang E., He Y. and C. K. Hui L..

Cloud based Privacy Preserving Data Mining with Decision Tree.

DOI: 10.5220/0003996200050014

In Proceedings of the International Conference on Data Technologies and Applications (DATA-2012), pages 5-14

ISBN: 978-989-8565-18-1

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Planting this problem into cloud platform is not

easy. We have to notice another signiﬁcant aspect

of cloud computing – privacy issue. Due to the in-

frastructure of the cloud service, the privacy of cloud

computing has been becoming a hot topic attracting

many researchers attention (Pearson, 2009), (Singh

et al., 2010). Cloud service provides clients a con-

venient environment to utilize services supplied by

cloud server instead of PC. Nowadays, cloud server

has been adopted to more and more applications. It

is not avoidable that clients will enter personal data

when utilizing the service provided by cloud server.

Therefore, it is very important to consider the privacy

service when design the cloud server.

Here, we would like to giveout a version of deﬁni-

tion of PPDM on decision tree in cloud environment.

We call it PPDM DT Cloud. This deﬁnition in cloud

scenario involves an extra party: the cloud server -

CS. CS is usually a group of machines. Databases

are stored in encrypted format in the cloud server to

protect the privacy of clients. The deﬁnition is as fol-

lows:

Deﬁnition 2: PPDM DT Cloud

• There is a cloud server - CS and n parties

, . . . , C

where (n ≥ 2).

• C

has a private two-dimensional database DB

i, j,k

}, where d

i, j,k

means the cell data on the row

j and column k in DB

. The database is stored in

the cloud service provided by CS in the encrypted

format.

• A subset of {C

, . . . , C

}, denoted by C

sub

, decide

to construct a decision tree based on their corre-

sponding databases.

• After the decision tree is built up, CS will inform

each party in C

sub

about the decision tree. Doing

so, C

does not know DB

for any j 6= i.

However, the above solution cannot be achieved

directly, because it is not so easy forCS to build up the

decision tree when all DB

are in different encrypted

formats. More speciﬁcally, each DB

is encrypted by

the unique key of the owner C

. This paper will pro-

pose a novel scheme to solve this problem. Even if

the cloud server is untrusted (in cases of using a pub-

lic cloud), we introduce an easy-to-deploy hardware

to protect the system from the untrusted server. If the

server is trusted (in cases of using a private cloud),

a software implementation is sufﬁcient to achieve the

needed security and privacy.

1.2 Potential Applications

Cloud computing is a proper platform for multiple

users to cooperate on some applications. For exam-

ple, a bank performs a collaborated data mining us-

ing its customer database and the database of another

bank, to judge whether it should approve a debt for

some bank client according to his personal informa-

tion and previous credit situation (like Table 1 & 2).

In the meantime, the bank cannot see the database de-

tails of another bank.

Table 1: Practical example (Bank A).

Cus. Age Edu. Salary Loan Overdue

C1 50 - - 12k No

C2 30 High - 50k -

Sch.

C3 - Univer- 30k - -

city

C4 40 - 40k 100k Yes

C5 - - 10k - -

Table 2: Practical example (Bank B).

Cus. Age Edu. Salaty Loan OD

C1 50 H.S. 20k - -

C2 - - 20k 20k Yes

C3 25 - - 50k No

C4 - Uni. - - -

C5 60 H.S. - 10k No

Another application is in the area of digital foren-

sics. Law enforcing authorities may need to look into

information on many different database to mine crim-

inal behaviours, but at the same time should preserve

privacy.

The cloud server platform provides a feasible and

efﬁcient way to solve these problems. However,

the private information of bank clients and suspected

criminals are usually conﬁdential. As a result, we

must propose a novel solution for cloud-based privacy

preserving data mining, which works on encrypted

conﬁdential information. We will elaborate this novel

scheme in the following sections.

1.3 Our Contributions

1. We propose the ﬁrst solution for PPDM on deci-

sion tree utilizing the cloud server. An additional

advantage of planting the PPDM into the cloud

server platform is much more efﬁcient than PC-

based PPDM.

2. PPDM DT Cloud can also handle the overlap-

ping databases case which has not been resolved

before.

3. We modify the traditional ElGamal cryptographic

algorithm in our scheme, which can protect par-

DATA2012-InternationalConferenceonDataTechnologiesandApplications

ticipants’ privacy from the threat of any malicious

parties.

4. Also, we design an easy-to-deploy hardware

Black Box which is an auxiliary device to unify

the data from various sources, to protect the threat

of an untrusted cloud server.

The rest of the paper is organized as follows: In

the next section, we introduce the related work in the

ﬁeld of privacy preserving data mining. In Section 3,

we explain some preliminary knowledge which is re-

quired for understanding our scheme. Section 4 anal-

yse the security requirements of PPDM DT Cloud.

In Section 5, we depict our novel algorithm to solve

the PPDM DT Cloud using Paillier cryptography

and ElGamal algorithm. Sections 6 presents a brief

security analysis. Section 7 concludes our paper.

2 RELATED WORK

Privacy preserving data mining (PPDM) can be traced

back to Yao’s millionaire problem (Yao, 1986).

PPDM helps participators in collaborating their pri-

vate databases to reach an accountable result without

exploring their own secret data. PPDM can be classi-

ﬁed by different criteria.

Since our work is about building decision tree,

here we focus on previous research works in this

topic.

(Yang et al., 2005) developed a privacy-preserving

frequency mining algorithm based on the homomor-

phic cryptography property, which can be utilized in

naive Bayes classiﬁer, ID-3 trees and association rule

mining. However, this algorithm was restricted by the

efﬁciency, e.g. it only could deal with a small number

of attributes.

(Lindell and Pinkas, 2000) discussed the ID-3 de-

cision tree on the horizontally distributed database,

while (Wang et al., 2004); (Du and Zhan, 2002);

(Vaidya and Clifton, 2005) and (Fang et al., 2010) dis-

cussed algorithms for vertically distributed database.

In particular, (Wang et al., 2004) and (Du and Zhan,

2002) only considered the situation that both parties

know the class attributes. Jaideep et al. conquered

this limitation in (Vaidya and Clifton, 2005) by build-

ing an ID-3 decision tree. (Du and Zhan, 2002) could

not deal with the situation if one party lies about its

input.

Since the rapid development of cloud computing,

the privacy issue just begins to draw attentions of

cloud service providers and users. It is unavoidable

that applications involving the private data must be

used in cloud service. As a result, designing a secure

cloud environment which protects privacy becomes

more signiﬁcant. Detailed reasons of why privacy

is so important for cloud computing can be found in

(Pearson, 2009).

In 2010, Singh et al. gave out a cryptography

based PPDM solution for cloud computing (Singh

et al., 2010), which mainly addressed the clustering

problem with k-NN classiﬁer. They extended the Jac-

card measure to test the equality of two encrypted

items. In this protocol, it needed a semi-honest third

party who can access the decrypted data and this as-

sumption allowed the threat to the privacy of users.

Up to our best knowledge there are no previous work

in cloud computing environment that builds decision

tree.

3 PRELIMINARY KNOWLEDGE

3.1 ID-3 Algorithm

Decision tree is one important traditional model for

data mining. ID-3 (Quinlan, 1986) is one data mining

algorithm for building the decision tree.

There are two important issues in such algorithms

based on possibility statistic: one is the best split point

and the other one is stopping criteria (Bhatnagar et al.,

2010).

For the former one, the ID-3 algorithm chooses

the attribute with the maximum Gain value to split.

To calculate the Gain, ﬁrstly the information entropy

(E(S)) of the dataset S is calculated as follows:

E(S) = −

∑

i=1

(i)log

(i),

where n is the number of different values of attributes

in S; f

(i) is the frequency of the value i in the dataset

After that, the Gain (G(S, A)) is calculated as fol-

lows:

G(S, A) = E(S) −

∑

i=1

)E(S

where G(S, A) denotes the gain of the dataset S after a

split over the attribute A and A

is ith possible value of

A. m is the number of different values of the attribute

A in S. f

) is the frequency of the A

in S. S

is a

subset of S containing all items where the value of A

is A

For the latter one, there are three main criteria

for stopping growing the tree: a) The tree has al-

ready reached the maximum depth; b) there has al-

ready been minimum number of data points for one

branch; c) all of the points share the same label.

CloudbasedPrivacyPreservingDataMiningwithDecisionTree

3.2 Homomorphic Encryption and

Paillier Cryptography

In short, homomorphic encryption can satisfy the de-

mand that one kind of algebraic calculation on cipher

text will be reﬂected as another kind of algebraic cal-

culation on plain text (Fontaine and Galand, 2007).

This concept was ﬁrstly proposed by R.L. Rivest et

al. in (Rivest et al., 1978), and in 2009 Craig Gentry

(Gentry, 2009) published the ﬁrst fully homomorphic

encryption scheme using lattice-based cryptography.

The homomorphic encryption property is usually

presented as:

E(a)

E(b) = E(a

b). (1)

In our paper, we choose Paillier cryptography

(Paillier, 1999) which is a probabilistic asymmetric

algorithm for public key cryptography.

The biggest advantage of Paillier cryptography is

that we can get different cipher text from the same

plain text, with different random numbers used in

the encryption process. This advantage can help our

scheme to protect the client from eavesdropping.

4 REQUIREMENTS OF

PPDM DT Cloud

Before we describe the details of PPDM DT Cloud

scheme, we ﬁrstly need to investigate requirements

for solving this problem.

In PPDM DT Cloud, we concern that only autho-

rized user can have access to the data. Even if unau-

thorized user access to the data, he will get the wrong

one.

Here, we try to design a scheme to protect ac-

tive attacks from the cloud server, as well as a ma-

licious client. Now we list the structural components

in PPDM DT Cloud:

• Central Cloud Server (CCS): There is one CCS

which acts as the interface between cloud service

users and other cloud services.

• Distributed Storage Server (DSS): There are

many DSS which provide the storage space.

• Distributed Computation Server (DCS): There

are many DCS which carry out the computing

task.

• Cloud Service Users (Client): the users who

communicate with CCS to indirectly get services

from different DSS and DCS.

Note that CCS, DCS and DSS together

is the Cloud Server (CS) in the deﬁnition

PPDM DT Cloud. Also, DSS and DCS are

conceptual entities. In practice, they can be the same

machine.

We have to design a system which can satisfy the

following criteria:

1. Client Security: In privatecloud, the cloud server

can be trusted but the user may suffer the attack

from other malicious users.

2. Strong Client Security: In a public cloud en-

vironment, there is no guarantee that the cloud

server or other users of cloud service are hon-

est. We have to assume the cloud server is un-

trusted. Therefore, all of data which is stored in

the cloud server has to be encrypted, and only the

data owner can decrypt them. Note that it is s

stronger security requirement than the ﬁrst crite-

ria (Client Security).

3. Service Availability: The system has to make

sure that the owner of the data can access / search /

proceed any function on his own data at any time.

4. Correctness: The system also need to guarantee

the correctness of the data which is stored in the

cloud server.

5. PPDM Feasibility: This requirement involves

two mainly aspects: 1) satisfy the general require-

ment on PPDM, which means that any participant

will have no idea about the other parties’ private

data; 2) the data mining result is correct (i.e. gives

out the same output as PPDM DT PC).

It is not easy to design a scheme to satisfy all

above ﬁve criteria simultaneous. For the ﬁrst and sec-

ond requirements, the data is encrypted by its owner

and the owner keeps the secret key. No one else can

access to this secret key.

As for the ﬁfth requirement, the data mining is

implemented on the combined database from multi-

ple clients, which means that databases from different

owners are encrypted by separate secret keys. How-

ever, the necessary but not sufﬁcient condition for

data mining on the encrypted database is that the same

message has to share the same transformed format in

the database.

For example, in Table 1, Customer C1’s age is

50. This cell is also in Bank-2’s database. In the

cloud server, these two cells are kept as E

Bank−1

(50)

and E

Bank−2

(50) which are encrypted by Bank-1 and

Bank-2’s key separately. However, it is impossible for

the cloud server to process data mining on encrypted

data with two different keys. Therefore, before data

mining, the cloud server has to transform databases

DATA2012-InternationalConferenceonDataTechnologiesandApplications

from different clients, into a uniﬁed format. We as-

sume that a transformation function f(·) can achieve

this purpose. After going through f

key

(·) we always

get f

key

Bank−1

(50)) = f

key

Bank−2

(50)). Other-

wise, the system cannot satisfy the third requirement.

Note that the second requirement is in fact a

stronger version of the ﬁrst requirement. Actually, in

our designed scheme, if all steps are implemented in

software, requirements 1, 3, 4, 5 are satisﬁed but re-

quirement 2 is not (i.e. there is an attack involving

a malicious untrusted cloud server). In order for our

scheme to satisfy requirement 2 as well, we need the

help of a Black Box (BB), which is a tamper-resistant

hardware token.

In this paper, for convenience, we describe our

solution PPDM DT Cloud with the hardware Black

Box. In applications (such as in a private cloud) that

requirement 1 is needed instead of requirement 2,

the solution can be trivially modiﬁed by putting all

BB implementation in software and some simple key

structure modiﬁcations. Details will be described in

the beginning of Section 5.

Black Box: Here, we need a tamper-resistant

hardware Black Box to help the system defending

the eavesdropping threat. Inside the BB, there is pre-

stored key pair (a private key and a public key). The

private key can only be utilized by functions inside

the BB. In other words, the private key will never be

leaked out. Note that as all BB contain the same key

structure. In our system, we need one BB is a Client

(we call it a Client-side BB) and the same kind of

BB is also used in a DSS (we call it a DSS-side BB).

Therefore only one kind of BB is needed to be man-

ufactured, and this is a signiﬁcant advantage in pro-

duction. We will elaborate the detailed usage of this

Black Box in Section 5

5 CLOUD-BASED PPDM WITH

UNTRUSTED CLOUD SERVER

We roughly divide the whole process into ﬁve phases

and the overall system is depicted in Algorithm 1:

We consider the public and private cloud environ-

ment individually. If cloud server is untrusted (i.e.

the public cloud case), there is a threat that the cloud

server can get the plain text of data from clients.

To protect clients’ privacy from the untrusted cloud

server, as stated in Section 4, we need help of an ex-

ternal hardware (the BB) to defend this attack.

On the other hand, if the cloud server is trusted

(i.e. the private cloud case), as mentioned in Section

4, strong client security is not needed. We can have a

simple scheme without using the hardware BB. In this

Algorithm 1: PPDM DT Cloud with ElGamal.

KeyGeneration:

1: CCS generates an efﬁcient description of a multi-

plicative cyclic group G of order q with generator

g, which are published to each client of this cloud

service.

2: C

secretly chooses a random s

from

{0, . . . , q− 1}, which is C

’s private ElGa-

mal key. While h

= g

is C

’s semi-public

ElGamal key PubEG

, which is kept from

knowing by CCS.

3: Meanwhile, there is one pair of Paillier key pre-

stored in Client-side and DSS-side Black Box.

Progress:

Phase 1: C

calls Algorithm 2 to encrypt his pri-

vate database DB

and then submits the encrypted

database to CCS.

Phase 2: CCS assigns DB

to different DSS and

each DSS keeps one part of DB

Phase 3.1: Whenever there is a subset of clients

sub

agreed on PPDM, they negotiate on a Group

Key GPkey and send a PPDM request to CCS.

Phase 3.2: CCS forwards the PPDM request to

each DSS. Each DSS, who keeps any part of

database belonging to the member of C

sub

, calls Al-

gorithm 3 to unify the database.

Phase 3.3: According to Algorithm 4, DSS com-

bines the uniﬁed databases from different sources

into one, and sends it back to CCS.

Phase 4.1: CCS further combines input from dif-

ferent DSS according to Algorithm 4.

Phase 4.2: CCS asks different DCS to build up the

Decision Tree DT, and gets the result E

GPkey

(DT)

from DCS.

Phase 5.1: CCS calls Algorithm 5 and sends

GPkey

(DT) back to each member of C

sub

Phase 5.2: All members in C

sub

collectively de-

crypt E

GPkey

(DT) to get the ﬁnal result and check

the correctness of DT.

case, BB can be substituted by software. Also as BB

is not needed, the private key of BB will be replaced

by a private key shared by CCS and all DSS.

For the sake of ease discussion, in this paper

we only present the solution with BB for PPDM in-

volving an untrusted cloud server. In our scheme,

we choose a modiﬁed ElGamal cryptography model

combining with Paillier cryptography as the encryp-

tion cryptography.

In our scheme, we use Black Box (BB) to imple-

ment three algorithms, namely Algorithm 2 in Phase

1, Algorithm 3 in Phase 3, and Algorithm 5 in Phase

4. Any outside party can only call BB to execute these

CloudbasedPrivacyPreservingDataMiningwithDecisionTree

three algorithms.

5.1 Phase 1: Initialization

Before the client submits his private database to CCS,

he needs to encrypt the data since there is a threat of

unauthorized access. For simplicity, we use the nota-

tion in Section 1. d

i, j,k

is the cell data inC

’s database,

at the position of row j, column k.

As mentioned in Section 4, the necessary but not

sufﬁcient condition for PPDM DT Cloud is that the

same message has to be encrypted to the same ci-

pher text. Therefore, we choose the mechanism of

Group Key in our scheme. However, not any crypto-

graphic algorithm can be used with Group Key. Here,

we choose the modiﬁed ElGamal and Paillier cryp-

tography. The encryption and decryption of the mod-

iﬁed ElGamal algorithm are denoted by EG

key

(·) and

key

(·). While those of the Paillier algorithm are de-

noted by P

key

(·) and Q

key

). Besides the pre-stored

Paillier key pair, there is also a pre-stored hash func-

tion H(·) inside BB.

The core part of Phase 1, Algorithm 2, is shown

below.

Algorithm 2: The ﬁrst function in BB.

1: Input: (d

i, j,k

, PubEG

)

2: Utilizing the pre-stored function H(·) to generate

a random number: r

i, j,k

= H(d

i, j,k

) where r

i, j,k

∈

{0, . . . , q− 1}.

3: Encrypting this random number with public Pail-

lier key as the ﬁrst part of cipher text: c

i, j,k

)

4: Calculating the second part of cipher text with

ElGamal cryptography: c

= EG

PubEG

i, j,k

) =

i, j,k

· h

i, j,k

5: Output: E(d

i, j,k

)=(c

, c

)

There are two signiﬁcant points needed to be em-

phasized:

1. The pre-deﬁned function H(·) for generating the

pseudo random number, must have the uniqueness

property. This means that for any m

= m

, we

must have H(m

) = H(m

). H(·) can be a colli-

sion resistant cryptographic hash function.

The purpose of requiring the uniqueness of the

pseudo random function is to make sure that CCS

can successfully process data mining on the com-

bined database from various clients.

2. The Paillier cryptography P(·) implemented by

BB, has the property of a non-unique map-

ping, which means that for any m

= m

there can be P

key

) 6= P

key

). This non-

unique mapping can protect the private data

from unauthorized access. Once the unautho-

rized party accesses the cipher-text of E(d

i, j,k

) =

i, j,k

), EG

PubEG

(d(i, j, k))), he cannot de-

duce d

i, j,k

from P

i, j,k

) even if he also has the

same cell in his database. The property of non-

unique mapping can eliminate this threat.

5.2 Phase 2: Distributed Storage

Mechanism

One service supplied by the CCS is availability, which

means that the client can access to their data even

though some DSS are down. In another word, the

CCS has to keep the backup of clients’ data in dif-

ferent DSS. CCS can save the whole database from

one client into n (n > 2) separate DSS, or divide

the database into several overlapping parts and assign

them into various DSS. No matter which approach

CCS chooses, any eavesdropper could not recognize

the source according to the cipher-text which are sent

back by DSS. After CCS gets the encrypted database

from client, it will distribute the database to one or

more DSS.

5.3 Phase 3: Unifying the Database

5.3.1 Phase 3.1: Agreement on PPDM

When a group of clients C

sub

want to carry out a

PPDM DT Cloud action, they will issue a PPDM

command. Then each member C

needs to send a

command to CCS with PubEG

. CCS will check

the correctness of received commands, then broad-

casts the identity of members in C

sub

to DSS.

5.3.2 Phase 3.2: Unifying Individual Database

Components

A DSS who stores some part of encrypted database

belong to a member in C

sub

will pass the database

component to a DSS-side BB, and returns the output

to CCS.

Before carrying out data mining, the preparation

work is to unify various formats of encrypted data,

with a Group Key negotiated by C

sub

With public ElGamal keys of all M members in

sub

, DSS can construct one partial Group Keys for

each group member. The partial Group Key for C

is:

GPkey

∏

l=1 ∧ l6=i

PubEG

DATA2012-InternationalConferenceonDataTechnologiesandApplications

∏

l=1 ∧ l6=i

= g

∑

l=1 ∧ l6=i

. (2)

In a DSS-side BB, Algorithm 3 is executed.

Firstly, BB checks the correctness of cipher text of

i, j,k

. Then BB combines the public ElGamal key of

with partial Group Key GPkey

as the whole Group

Key GPkey, to unify the data from different sources.

Algorithm 3: The second function in BB.

INPUT: [P

i, j,k

), EG

i, j,k

)], PubEG

, GPkey

1: Decrypt r

i, j,k

with BB’s secret key.

2: Decrypt d

i, j,k

)

PubEG

3: If r

i, j,k

= H(d

i, j,k

) then go to next step. Other-

wise, return an error message to DSS.

4: Encrypt the data d

i, j,k

with GPkey:

GPKey

i, j,k

) = EG

i, j,k

) · (GPkey

)

i, j,k

· (g

∑

l=1

)

i, j,k

OUTPUT: EG

GPkey

i, j,k

))

5.3.3 Phase 3.3: Combine Different Databases

DSS combines partial uniformed database of differ-

ent C

sub

members together. DSS also can deal with

the special case of overlapped parts among different

members’ databases. As a matter of fact, this kind

of situation is quite common. For example, one bank

custom may choose more than one commercial banks

to borrow money. Once one bank wants to decide

whether it should continue to lend the money to this

custom, the bank prefers to make the decision based

on this custom’s previous loan history of all other

banks rather than this bank only.

However, as stated in Section 1, all previous re-

searches did not give out any solution to this special

situation, because the PC-based solution cannot se-

curely solve this problem in an efﬁcient way. The

cloud computing technology provides a feasible plat-

form to resolve this kind of problem. In our scheme,

we utilize the homomorphic property of ElGamal al-

gorithm to make the calculation on cipher text possi-

ble. The exact work which CCS needs to do is shown

in Algorithm 4:

The rough idea of the combination work is as fol-

lows. DSS judges whether there are any records from

different sources sharing the same primary key. If

yes, DSS checks under the same attribute whether

the value from different sources are more than one.

Again, if yes, DSS carry out the calculation according

to the pre-deﬁned rules. If the rule asks for keeping

Algorithm 4: Check ﬂow.

Let PK be the primary key attribute and the set of

value is PK = {PK

, PK

, . . .}

|PK| is the size of PK

for i = 1 to |PK| do

if there are more than one records whose primary

key = PK

then

let A be the set of all attributes

for j = 1 to |A| do

if the rule for A

is keeping the same value

of those records then

keeping the same value

else if the rule for A

is multiplying differ-

ent values of those records then

multiplying those records

end if

end for

end if

end for

the same value or multiplying different items, DSS

can complete the computation on the cipher text di-

rectly.

After the combination, DSS only sends

GPkey

i, j,k

) to CCS. The reason is to save

the communication ﬂow and for the purpose of

security,

5.4 Phase 4: Cloud Decision Tree

Building

In Phase 4.1, CCS will build the decision tree DT

using the ID-3 algorithm. Before CCS carries out

the decision tree building, it ﬁrstly combines the

databases from different DSS. As for the overlapped

parts, CCS also goes through Algorithm 4.

After that, Phase 4.2 is executed. With an en-

crypted database, CCS can easily follows the standard

ID-3 algorithm to construct a decision tree DT from

this encrypted database. In our design, CCS will not

run the ID-3 algorithm on its own. Instead, CCS takes

advantage of the cloud computing environment, to co-

ordinate different DCS to run the ID-3 algorithm in a

distributed manner. Doing so, the running time of ID-

3 will be reduced.

For a large size database, the workload for statis-

tic computation in ID-3 is heavy for a single CCS and

thus is a suitable computational task to be distributed

to different DCS. Many DCS can share workload for

statistic computation in parallel. Meanwhile, CCS

plays the role as a controller, which means that CCS

assigns the computational work to DCS.

As mentioned in Section 3, the major computa-

CloudbasedPrivacyPreservingDataMiningwithDecisionTree

tional part of ID-3 is the statistic computation of at-

tribute frequency and calculating the Gain value.

Here we describe how the calculation of the fre-

quency and the Gain value can be distributed to dif-

ferent DCS.

CCS ﬁrstly groups the database entries according

to the value of classiﬁed attribute. After that CCS

sends the grouped encrypted data of one column to

one DCS. Then each DCS calculates the Gain value

corresponding to that column, and sends this value to

CCS. After collecting all Gain values corresponding

to different columns from different DCS, CCS will se-

lect the maximum one as the current node. CCS and

DCS repeat the above steps until reaching the stop-

ping point of the ID-3 algorithm.

Let us take the Table 1 & 2 as example. After

grouping the database according to the value of clas-

siﬁed attribute Overdue, the order of records should

be C2, C4 in the Yes Group and C1, C3, C5 in the

No Group. Then taking the attribute of Age for

instants again, CCS sends E(30), E(40) in the Yes

Group and E(50), E(25), E(60) in the No Group to

DCS. So do other attributes. Each DCS returns the

Gain value back to CCS, and CCS picks up the at-

tribute with the maximum Gain as the root node. In

this example, this attribute is Education. CCS sends

this column, E(HighSchool(HS)), E(University(U))

in the Yes Group and E(HS), E(U), E(HS) in the

No Group, to other DCS who has the attribute with

non-maximum Gain value. Similarly, DCS calculate

Gain value and CCS choose the maximum one as the

second-level node. CCS and DCS repeat the above

steps until reaching the stopping point.

5.5 Phase 5: Output

5.5.1 Phase 5.1

After CCS and DCS ﬁnish the collaboration work on

decision tree building, CCS has a whole encrypted de-

cision tree. The last step is to send back the result

to each participant C

. CCS calls Algorithm 5 to get

i, j,k

, and send it back with EG

GPkey

i, j,k

) to eachC

CCS uses Algorithm 5, the third function in BB, to

decrypt r

i, j,k

Algorithm 5: The third function inside the Black Box.

1: INPUT: P

i, j,k

)

2: Decrypt r

i, j,k

= Q

i, j,k

))

3: OUTPUT: g

i, j,k

Each participant C

calculates (g

i, j,k

)

and sends

it to each other member in C

sub

. The plain text of

decision tree (all are d

i, j,k

values) can be decrypted:

i, j,k

= DG

GPkey

(EG

GPkey

i, j,k

)) =

i, j,k

·g

i, j,k

∑

l=1

∏

l=1

i, j,k

)

Then each C

gets the whole decision tree which

is built from the combined database.

5.5.2 Phase 5.2

sub

also need to check the correctness of DT and we

use the method stated in (Du and Zhan, 2002) for this

checking process. For simplicity, here we give a sim-

ple walk-through of the checking as follows:

sub

chooses one entry with the same primary key

value in each C

’s database, d

i, j

= {d

i, j,k

|k = 1, 2, . . .} .

i, j

is one single row in C

’s database. Each C

tra-

verses through the DT based on the value of this entry

to create a vector V

= v

, . . . , v

where p is the num-

ber of leaf nodes in DT. If a node is split using C

attribute, C

traverses to all children of the node. If C

reaches a leaf node q, he changes the qth entry of V

to 1. At the end, vector V

records all the leaf nodes

that C

can reach.

Other members of C

sub

create the vector in the

same way. It should be noted that V = V

∧ . . . ∧V

has one and only one non-zero entry (here ∧ is the

logical-and operation) because there is one and only

one leaf node that all C

sub

can reach. Therefore,

V = V

∧ . . . ∧V

should also have one and only one

non-zero entry. Otherwise, C

sub

verify that the DT is

not correct.

6 DISCUSSION OF SECURITY

PROPERTY

The encryption scheme used in PPDM DT Cloud

can ensure that conﬁdentiality, integrity, and authen-

ticity of all the records in the database. For simplicity,

here we only show how our scheme can defend some

interesting attacks. In particular, Attacks 2 and 3 are

related to the Strong Client Security property.

6.1 Attack 1

Throughout the whole scheme, note that r has the

unique relationship with m. If one malicious party

gets the r

i, j1,k1

belong to C

, meanwhile, in C

’s

database, he also contains the same value of r

l, j2,k2

i, j1,k1

, then there is a very high possibility that m

l, j2,k2

is equal to m

i, j1,k1

. This is an attack that we want to

avoid. We choose Paillier cryptographic algorithm,

which has the property that encrypting the same mes-

sage will give a different cipher text, to encrypt r to

avoid this attack.

DATA2012-InternationalConferenceonDataTechnologiesandApplications

6.2 Attack 2

In Algorithm 3, there are two main points which can

defend the attack from untrusted cloud server. One

is using the partial Group Key and the original cipher

text to unify the data, the other one is checking the

correctness of cipher text and random number. For

the ﬁrst point, if BB directly uses the Group Key as

the input parameter, there is a threat that DSS may

replace GPkey with its own key PubEG

DSS

= g

DSS

The details of this attack is listed as follows:

1. DSS sends: [P

i, j,k

), EG

i, j,k

)], PubEG

PubEG

DSS

to BB.

2. DSS calls Algorithm 3:

• Decrypt r

i, j,k

with BB’s secret key.

• Decrypt d

i, j,k

)

i, j,k

• Encrypt the data with the Group Key:

DSS

i, j,k

) = d

i, j,k

· (g

DSS

)

i, j,k

3. DSS gets EG

DSS

i, j,k

) from BB.

4. DSS gets g

i, j,k

with the Algorithm 5 from BB.

5. Now, DSS can decrypt d

i, j,k

from EG

DSS

i, j,k

))

with his private key s

DSS

To avoid this attack from untrusted DSS, the Al-

gorithm 3 could not take GPkey as the parameter di-

rectly. Since DSS wants to get d

i, j,k

which means

that it cannot change EG

i, j,k

, otherwise, it will

get the wrongly decrypted d

i, j,k

. EG

i, j,k

) is cal-

culated from g

, meanwhile, g

is one of component

of GPkey. Therefore, we utilize this point to unify

the data, EG

GPKey

i, j,k

) = EG

i, j,k

) · (GPkey

)

By doing this, DSS cannot attack the privacy of data

through sending a wrong parameter to BB.

6.3 Attack 3

As mention in the above, the other one signiﬁcant

point of Algorithm 3 is checking the correctness of

i, j,k

) and r

i, j,k

. Without the checking, DSS can

execute the following attack:

1. DSS sends [P

i, j,k

)

], PubEG

2. DSS calls Algorithm 3:

• Decrypt r

i, j,k

with BB’s secret key.

• Decrypt r

i, j,k

with BB’s secret key to get r

i, j,k

• Encrypt the data with the Group Key:

PubEG

(

i, j,k

) =

)

i, j,k

·(PubEG

)

i, j,k

)

i, j,k

·(g

)

i, j,k

3. DSS gets the output of

i, j,k

from BB.

Hence, DSS gets the information of

i, j,k

. This

is also an attack which we want to avoid. There-

fore, before unifying d

i, j,k

, BB has to check the cor-

rectness of parameters by comparing the value of

i, j,k

)) with the hash value of

i, j,k

)

(PubEG

)

i, j,k

By doing this, we can prevent DSS to modify the ci-

pher text or random number.

7 CONCLUSIONS

In this paper, we designed PPDM DT Cloud, a

Cloud based PPDM solution, which can defend

against the malicious participant. With the aid of a

Black Box, our scheme even can protect clients’ pri-

vacy from the untrusted cloud server as well.

Our scheme allows any subset of clients to carry

out PPDM action. We choose the mechanism of

Group Key to unify the cipher text encrypted by vari-

ous clients separately, and through the whole process,

each client share the same rights that no one has privi-

lege to other clients. To make the Group Key feasible,

we utilize the modiﬁed ElGamal and Paillier crypto-

graphic algorithm as our encryption algorithms. El-

Gamal is used to keep the uniqueness of cipher text

which is the necessary condition for PPDM, while

Paillier is used to defend the attack from malicious

participant as well as the untrusted server. This so-

lution works in a private cloud. In the public cloud

environment, we need the help of the hardware Black

Box (BB) to protect the user from untrusted cloud.

BB keeps never leaks out the Paillier key pair. Be-

sides this, we design three functions which are imple-

mented inside BB. The ﬁrst two carry out the encryp-

tion action. As for the third one, BB does not give out

the decrypted data directly, it gives g

i, j,k

instead of

i, j,k

. This design can avoid the malicious participant

or untrusted cloud server unauthorised decrypting the

data.

This PPDM DT Cloud solution is also the ﬁrst

PPDM solution that can take care of overlapping

databases. So this solution is powerful, efﬁcient, and

easy to deploy. We believe that it can be applied to

many practical situations.

REFERENCES

Bhatnagar, V., Zaman, E., Rajpal, Y., and Bhardwaj, M.

(2010). Vistree: Generic decision tree inducer and

visualizer. In DATABASES IN NETWORKED INFOR-

MATION SYSTEMS. Springer-Verlag.

CloudbasedPrivacyPreservingDataMiningwithDecisionTree

Du, W. and Zhan, Z. (2002). Building decision tree classi-

ﬁer on private data. In Proceedings of the IEEE ICDM

Workshop on Privacy.

Fang, W., Yang, B., and Song, D. (2010). Preserving pri-

vate knowledge in decision tree learning. In Journal

of Computers. ACADEMY PUBLISHER.

Fontaine, C. and Galand, F. (2007). A survey of homo-

morphic encryption for nonspecialists. In EURASIP

Journal on Information Security. Hindawi Publishing

Corp. New York, NY, United States.

Fung, B. C. M., Wang, K., and Yu, P. S. (2005). Top-down

specialization for information and privacy preserva-

tion. In ICDE 2005.

Gentry, C. (2009). Fully homomorphic encryption using

ideal lattices. In STOC '09 Proceedings of the 41st an-

nual ACM symposium on Theory of computing. ACM

New York, NY, USA.

Goethals, B., Laur, S., Lipmaa, H., and Mielikainen, T.

(2004). On private scalar product computation for

privacy-preserving data mining. In Information Secu-

rity and Cryptology ICISC 2004, volume 3506/2005.

Springer-Verlag.

Jha, S., Kruger, L., and McDaniel, P. (2005). Privacy pre-

serving clustering. In COMPUTER SECURITY ES-

ORICS 2005. Springer-Verlag.

Kantarcioglu, M. and Clifton, C. (2004). Privacy-

preserving distributed mining of association rules on

horizontally partitioned data. In Knowledge and Data

Engineering.

Kantarcioglu, M. and Kardes, O. (2009). Privacy-

preserving data mining in the malicious model, vol-

ume 2. Springer-Verlag.

Kantarcioglu, M., Nix, R., and Vaidya, J. (2009). An efﬁ-

cient approximate protocol for privacy-preserving as-

sociation rule mining. In ADVANCES IN KNOWL-

EDGE DISCOVERY AND DATA MINING.

Lindell, Y. and Pinkas, B. (2000). Privacy preserving data

mining. In ADVANCES IN CRYPTOLOGY CRYPTO

2000. Springer-Verlag.

Paillier, P. (1999). Public-key cryptosystems based on com-

posite degree residuosity classes. In Prof. of the EU-

ROCRYPT'99. Springer-Verlag.

Pearson, S. (2009). Taking account of privacy when design-

ing cloud computing services. In CLOUD '09. ICSE

Workshop.

Quinlan, J. R. (1986). MACHINE LEARNING. Springer-

Verlag.

Rivest, R. L. L., Adleman, M., and Dertouzos, M. L. (1978).

On data banks and privacy homomorphisms. In

In Foundations of Secure Computation. ACADEMY

PUBLISHER.

Singh, M. D., Krishna, P. R., and Saxena, A. (2010). A cryp-

tography based privacy preserving solution to mine

cloud data. In COMPUTE '10 Proceedings of the

Third Annual ACM Bangalore Conference. ACM New

York, NY, USA.

Vaidya, J. and Clifton, C. (2005). Privacy-preserving deci-

sion trees over vertically partitioned data. In DATA

AND APPLICATIONS SECURITY XIX. Springer-

Verlag.

Verykios, V. S., Bertino, E., Fovino, I. N., Provenza, L. P.,

Saygin, Y., and Theodoridis, Y. (2004). State-of-the-

art in privacy preserving data mining. In ACM SIG-

MOD Record, volume Vol. 33, No. 1. ACM New York,

NY, USA.

Wang, K., Yu, P. S., and Chakraborty, S. (2004). Bottom-

up generalization: a data mining solution to privacy

protection. In ICDM'04.

Yang, Z., Zhong, S., and Wright, R. N. (2005). Privacy-

preserving classiﬁcation of customer data without loss

of accuracy. In Proceedings of the 5th SIAM Interna-

tional Conference on Data Mining.

Yao, A. (1986). How to generate and exchange secrets. In

Proceedings of the IEEE 27th Annual Symposium on

Foundations of Computer Science.

Zhan, J., Matwin, S., and Chang, L. (2005). Privacy-

preserving collaborative association rule mining. In

19th Annual IFIP WG 11.3 Working Conference on

Data and Applications Security University of Con-

necticut. Springer-Verlag.

DATA2012-InternationalConferenceonDataTechnologiesandApplications