TOWARDS EFFICIENT CRYPTOGRAPHY FOR PRIVACY
PRESERVING DATA MINING IN DISTRIBUTED SYSTEMS
Emmanouil Magkos and Vassilis Chrissikopoulos
Department of Informatics, Ionian University, Palaia Anaktora, 49100, Corfu, Greece
Keywords: Data mining, Privacy, Security, Cryptography, Distributed systems.
Abstract: A common fact for both businesses and physical entities is that sensitive, accurate information would be
more easily diffused if adequate measures for protection were in place. This could also lead to higher
quality data mining results, in a privacy preserving manner. Recent research has proved that it is possible to
provide both privacy and accuracy assurances in a distributed computing scenario, where all participants
may be mutually untrusted, without the presence of an unconditionally trusted third party. We believe that
valuable knowledge can be borrowed from the vast body of literature on e-auction and e-voting systems, in
order to be adapted to privacy preserving data mining systems in a distributed environment. These systems
tend to balance well the efficiency and security criteria, because they need to be implementable in medium
to large scale environments.
1 INTRODUCTION
Following the explosion of Information and
Communications Technologies in the last decade,
we are witnessing the advent of the digital era in an
emphatic way: Most businesses and organizations
have transformed their processes and relationships
with their partners and customers into fully
electronic ones, while the convergence of
telecommunication networks have boosted the use of
the Internet for everyday activities. Furthermore,
advances in both software and hardware, combined
with emerging technologies such as Web 2.0, Grid
computing, and Semantic Web, have facilitated the
collection, storage and processing of large volumes
of digital data, at a relatively low cost.
Data Mining (DM) techniques are well known
for extracting valuable, and usually not obvious,
information from large quantities of data (Chen et al,
1996). Data mining has broad applications in areas
related to market research, as well as to financial and
scientific research. Most techniques can be
categorized into two generic classes (Dunham,
2002): In the predictive class, original (i.e., training)
data are processed for handling, describing, or
predicting future data or phenomena. In the
descriptive class, DM techniques are trying to
develop a better description for objects in databases.
In this paper we focus on descriptive techniques,
which can also be seen as a necessary layer that
facilitates higher level (e.g., predictive) DM
techniques. For example, classification techniques
aim at classifying objects of a database into a
discrete category, based on the values of some
attributes in a transaction. Association rules DM
statistically process a set of transactions in order to
extract high frequencies of occurring patterns such
as “customers who buy milk also tend to bye
cookies”. In summarization techniques, the general
idea is to use statistics, in order to better describe
and present a large set of data at an abstraction level,
for further processing (Chen et al, 1996).
Privacy surveys have shown that Web users may
be willing to divulge some personal information, in
exchange for getting something valuable in return
(e.g., personalization and customization services, or
better search results) (Yang et al, 2005).
Additionally, in the corporate environment, while
most organizations normally recognize the
importance of privacy protection (Deloitte, 2007)
(although their theoresis may be influenced by
factors such as laws and privacy regulations,
comformity with security standards, reputation and
the fear of competition), there are cases where
different organizations may have strong incentives
in sharing private information for extracting
valuable knowledge. For example, two medical
institutions would find it useful to selectively pool
301
Magkos E. and Chrissikopoulos V. (2008).
TOWARDS EFFICIENT CRYPTOGRAPHY FOR PRIVACY PRESERVING DATA MINING IN DISTRIBUTED SYSTEMS.
In Proceedings of the Fourth International Conference on Web Information Systems and Technologies, pages 301-304
DOI: 10.5220/0001531403010304
Copyright
c
SciTePress
their medical records for doing research in
epidemiology or improving medical diagnosis. Or,
several competing businesses would like to pool
their customer transaction data, and get in return a
research analysis on the market trends.
Unfortunately, this is usually either not possible or
not secure, mainly due to confidentiality concerns
(Lindell and Pinkas, 2000). A common fact for both
businesses and physical entities is that sensitive,
accurate information would be more easily diffused
if adequate measures for protection and security
were in place. This could also lead to higher quality
data mining results, in a privacy preserving manner.
2 SECURITY IN DISTRIBUTED
DATA MINING SYSTEMS
In large intra-organizational environments, data are
usually shared among a number of distributed
databases, for security or practicality reasons, or due
to the organizational structure of the business. Data
can be partitioned either horizontally, where each
database contains a subset of complete transactions
(Lindell and Pinkas, 2002; Kantarcioglu and Clifton,
2004), or vertically, where each database contains
shares of each transaction (Vaidya and Clifton,
2002). The role of a data warehouse is to collect and
transform the dispersed data to an acceptable format,
before they will be forwarded to the DM subsystem.
Such central repository raises privacy concerns,
especially if it used in an inter-organizational setting
where several entities, mutually untrusted, may
desire to mine their private inputs, both securely and
accurately. Alternatively, data mining can be
performed locally, at each database (or intranet), and
then the subresults be combined to extract
knowledge, although this will most likely affect the
quality of the output (Vaidya and Clifton, 2002).
If a general discussion was to be made about
protecting privacy in distributed databases, we
would point to the literature for access control and
audit policies, authorization and information flow
control (e.g., multilevel and multilateral security
strategies (Anderson, 2001)), security in the
application layer (e.g., database views), and
Operating Systems security among others. However
in this paper we assume that appropriate security and
access control exist in the intra-organizational
setting, and we mainly focus on the inter-
organizational setting where a set of mutually
untrusted entities wish to execute a miner on their
private databases. As an alternative layer of
protection, original data can be suitably altered (e.g.
randomized) (Agrawal and Srikant, 2000) or
anonymized before given as an input to a miner, or
queries in statistical databases may be restricted
(Anderson, 2001). The problem with data
perturbation is that in highly distributed
environments, preventing the inference of
unauthorized information by combining authorized
information is not an easy problem (Ferrer, 2002).
Furthermore, in most perturbation techniques lies a
tradeoff between protecting privacy of the individual
records and at the same time establishing accuracy
of the DM results (Wang and Zhang, 2007). In the
next session we discuss the important role of
cryptography in privacy preserving data mining.
3 CRYPTOGRAPHY IN PRIVACY
PRESERVING DATA MINING
At a high abstraction level, the problem of privacy
preserving data mining between mutually untrusted
parties can be reduced to the following problem for a
two-party protocol: Each party owns some private
data and both parties wish to execute a function F on
the union of their data without sacrificing the
privacy of their inputs (Pinkas, 2002). In a DM
environment, for example, the function F could be a
classification function that outputs the class of a set
of transactions with specific attributes, a function
that identifies association rules in partitioned
databases, or a function that outputs aggregate
results over the union of two statistical databases.
In the above distributed computing scenario, an
“ideal” protocol would require a trusted third party
who would accept both inputs and announce the
output. However, the goal of cryptography is to
relax or even destroy the need for trusted parties.
Contrary to other strategies, crypto mechanisms
usually do not pose dilemmas between the privacy
of the inputs and the accuracy of the output.
In the academic literature for privacy preserving
data mining, following the line of work that begun
with Yao (Yao, 1986), most theoretical results are
based on the Secure Multiparty Computation (SMC)
approach (e.g. Lindell and Pinkas, 2002; Vaidya and
Clifton, 2002; Kantarcioglu and Clifton, 2004).
SMC protocols are interactive protocols, run in a
distributed network by a set of entities with private
inputs, who wish to compute a function of their
inputs in a privacy preserving manner. The goal is
that no more information is revealed to an entity in
WEBIST 2008 - International Conference on Web Information Systems and Technologies
302
the computation than can be inferred from that
participant's input and output (Goldwasser, 1997).
For example, in (Kantarcioglu and Clifton,
2004), there are three sites with sales transactions,
where each transaction contains an itemset, and all
three sites wish to combine their itemsets in order to
find the most popular items (e.g., thus being able to
mine association rules). Each site uses a symmetric
cipher that is also commutative, i.e., E
A
(E
B
(X)) =
E
B
(E
A
(X)), encrypts its itemset and passes it to the
next site, until all itemsets have 3 layers of
encryptions. Then, all encrypted itemsets are again
passed around, with each site decrypting, until the
complete set is revealed.
Unfortunately, as has been noted in the literature,
SMC protocols require multiple communication
rounds among the participants, and privacy usually
comes at a high performance and communication
cost (Pinkas, 2002).
We believe that research for privacy preserving
DM could borrow knowledge from the vast body of
literature on secure e-auction (Naor et al, 1999) and
e-voting systems (Gritzalis, 2002). These systems
are not strictly related to data mining but, they
exemplify some of the difficulties of the multiparty
case (this has been pointed out first by (Pinkas,
2002) but it only concerned e-auctions, while we
extend it to include e-voting systems as well). Such
systems also tend to balance well the efficiency and
security criteria, in order to be implementable in
medium to large scale environments. Furthermore,
such systems fall within our distributed computing
scenario and have similar architecture and security
requirements, at least at our abstraction level.
In a sealed bid e-auction for example, the
function F, represented by an auctioneer, receives
several encrypted bids and declares the winning bid.
In a secure auction, there is a need to protect the
privacy of the loosing bidders, while establishing
accuracy of the auction outcome and verifiability for
all participants. Or, in an Internet election, the
function F, represented by an election authority,
receives several encrypted votes and declares the
winning candidate. Here the goal is to protect the
privacy of the voters (i.e., unlinkability between the
identity of the voter and the vote that has been cast),
while also establishing eligibility of the voters and
verifiability for the election result.
During the last decade, a few cryptographic
schemes for conducting online e-auctions and e-
elections have been proposed in the literature.
Research has shown that it is possible to provide
both privacy and accuracy assurances in a
distributed computing scenario, where all
participants may be mutually untrusted, without the
presence of an unconditionally trusted third party.
Most efficient schemes of both worlds (e.g., Parkes
et al, 2006, Damgard et al, 2003), follow the
homomorphic model (Cramer et al, 1997), originally
proposed in the e-voting setting, where the privacy
of the inputs and the accuracy of the results can be
universally verified, thanks to the algebraic
properties of several randomized encryption
algorithms. With homomorphic encryption there is
an operation
defined on the message space and an
operation
defined on the cipher space, such that
the “product” of the encryptions of any two private
inputs is the encryption of the “sum” of the inputs:
)()()(
2121
MMEMEME
=
(1)
This property allows, for example, either to tally
votes as aggregates or to combine shares of votes
(e.g., Cramer, 1997; Schoenmakers, 1999), without
decrypting single votes.
We have to note that in the most practical
schemes of both worlds of e-auctions and e-voting, a
third party is still needed to preclude communication
between participating entities, however no third
party is blindly trusted on either the privacy of the
inputs or the accuracy of the results. This is achieved
by using robust cryptographic primitives such as
threshold decryption (Desmedt, 1994). For example,
a set of M voting authorities in a (t, M) threshold
public-key encryption system share a private key,
and there is one public key corresponding to the
shared private key. After all encrypted inputs have
been submitted, any subset of t honest and
functioning authorities are able to combine their key
shares and decrypt the encrypted aggregate.
Selecting the most efficient mechanisms from
both worlds could also be facilitated by the fact that
when considering distributed DM systems, the threat
model seems to be relaxed, thus better balancing the
trade-off in favour of efficiency: for example,
verifiability does not have to be universal (but
atomic), and there seems not to be an obvious need
for receipt-freeness (as in elections).
The first implementation of the homomorphic
model of (Cramer et al, 1997) in the DM setting, to
the best of our knowledge, has recently appeared in
(Yang et al, 2005). In a simplified version of their
scheme, which involves horizontally partitioned
databases, each client holds a record of personal data
(e.g., age, income, marital status, history of
accidents) and a data miner can pose questions such
as “How many people have income > 25 and are
married?”. Each client sends to the data miner the
TOWARDS EFFICIENT CRYPTOGRAPHY FOR PRIVACY PRESERVING DATA MINING IN DISTRIBUTED
SYSTEMS
303
(homomorphic) encryption of “1” (“yes”) or “0”
(“no”) and the system computes the aggregates.
While the scheme in (Yang et al, 2005) is based
on the homomorphic model of (Cramer et al, 1997)
that supports 1-out-of-2 (“yes”/ “no”) selections, we
believe that future research could also look at some
very efficient extensions of the homomorphic model,
where 1-out-of-L or K-out-of-L selections are
allowed (e.g., Baudron et al, 2001; Damgard et al,
2003). In this way, the overall bits of information
that a database sends to the miner could be
increased, leading to new possibilities.
4 CONCLUSIONS
We believe that valuable knowledge can be
borrowed from the vast cryptographic literature on
e-auction and e-voting systems, in order to be
adapted to the specific requirements for privacy
preserving data mining systems in a distributed
environment. These systems tend to balance well the
efficiency and security criteria, because they need to
be implementable in medium to large scale
environments.
Of course, further research is needed to choose
and then adapt the specific cryptographic techniques
to the DM environment, taking into account the kind
of databases to work with, the kind of knowledge to
be mined, as well as the kind of specific DM
technique to be used.
REFERENCES
Agrawal, R., Srikant, R., 2000. Privacy-preserving data
μining. In ACM SIGMOD Conference on Management
of Data. ACM Press, pp. 439-450.
Anderson, R., 2001. Security engineering – A guide to
building dependable distributed systems. Wiley
Computer Publishing.
Baudron, O., Fouque, P., Pointcheval, D., Poupard, G.,
Stern, J., 2001. Practical Multi-Candidate Election
System. In 20th ACM Symposium on Principles of
Distributed Computing. ACM Press, pp. 274–283.
Chen, M., Han, J., Yu, P., 1996. Data mining: An
overview from a database perspective. In IEEE
Transactions on Knowledge and Data Engineering.
IEEE Press, Vol. 8 (6), pp. 866-883.
Cramer, R., Gennaro, R., Schoenmakers, B., 1997. A
secure and optimally efficient multi-authority election
scheme. In European Transactions on
Telecommunications. Vol. 8 (5), pp. 481-490.
Damgard, I., Jurik, M., Nielsen, J., 2003. A generalization
of Paillier's public-key system with applications to
electronic voting. Manuscript. Available at:
www.daimi.au.dk/~ivan/GenPaillier_finaljour.ps
Deloitte, 2007. Global security survey 2007. Deloitte
Touche Tohmatsu. Available at:
http://www.deloitte.com/dtt/cda/doc/content/arg_cons
_encuesta-global-Seguridad-2007_20071031(2).pdf
Desmedt, Y., 1994. Threshold Cryptography. In European
Transactions on Telecommunications. Vol. 5(4), pp.
449–457.
Dunham, M., 2002. Data mining, introductory and
advanced topics. Prentice Hall.
Ferrer, J. (Ed.), 2002. Inference control in statistical
databases, from theory to practice. Springer, LNCS
Vol. 2316.
Goldwasser, S., 1997. Multi-party computations: Past and
present. In 16th Annual ACM Symposium on
principles of Distributed Computing. ACM, pp. 1-6.
Gritzalis, D. (Ed.), 2002. Secure electronic voting: trends
and perspectives, capabilities and limitations. Kluwer
Academic Publishers.
Kantarcioglu, M., Clifton, C., 2004. Privacy-preserving
distributed mining of association rules on horizontally
partitioned data. In IEEE Transactions on Knowledge
and Data Engineering. IEEE Press, Vol. 16 (9), pp.
1026-1037.
Lindell, Y., Pinkas, B., 2000. Privacy preserving data
mining. In Advances in Cryptology - CRYPTO ’00.
Springer, LNCS Vol. 1880, pp. 36–53.
Naor, M., Pinkas, B., Sumner, R., 1999. Privacy
preserving auctions and mechanism design. In 1st
ACM conference on Electronic commerce. ACM
Press, pp. 129 – 139.
Parkes, D., Rabin, M., Shieber, S., Thorpe, C., 2006.
Practical secrecy-preserving, verifiably correct and
trustworthy auctions. In 8th ACM International
Conference on Electronic Commerce. ACM Press, pp.
70 – 81.
Pinkas, B., 2002. Cryptographic techniques for privacy-
preserving data mining. In SIGKDD Explorations.
ACM Press, Vol. 4(2), pp. 12-19.
Schoenmakers, B., 1999. A Simple Publicly Verifiable
Secret Sharing Scheme and Its Application to
Electronic Voting. In Advances in Cryptology–
CRYPTO’99. Springer LNCS Vol. 1666. pp. 148-164.
Vaidya, J., Clifton, C., 2002. Privacy preserving
association rule mining in vertically partitioned data.
In 8th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM Press,
pp. 639-644.
Wang, J., Zhang, J., 2007. Addressing accuracy issues in
privacy preserving data mining through matrix
Factorization. In ISI’07, IEEE International
Conference on Intelligence and Security Informatics.
IEEE Press, pp. 217-220.
Yang, Z., Zhong, S., Wright, R., 2005. Privacy-preserving
classification of customer data without loss of
accuracy. In SDM’05 SIAM Data Mining Conference.
Yao, A., 1986. How to generate and exchange secrets. In
27th Symposium on Foundations of Computer
Science. IEEE Press, pp. 162–167.
WEBIST 2008 - International Conference on Web Information Systems and Technologies
304