On the Connection between t-Closeness and Differential Privacy for Data
Releases
Josep Domingo-Ferrer
Universitat Rovira i Virgili, Dept. of Computer Engineering and Mathematics, UNESCO Chair in Data Privacy,
Av. Pa¨ısos Catalans 26, E-43007 Tarragona, Catalonia, Spain
Keywords:
Differential Privacy, t-Closeness, k-Anonymity, Microaggregation.
Abstract:
t-Closeness was introduced as an improvement of the well-known k-anonymity privacy model for data release.
On the other hand, ε-differential privacy was originally proposed as a privacy property for answers to on-line
database queries and it has been very welcome in academic circles. In spite of their quite diverse origins and
motivations, we show in this paper that t-closeness and ε-differential privacy actually provide related privacy
guarantees when applied to off-line data release. Specifically, k-anonymity for the quasi-identifiers combined
with differential privacy for the confidential attributes yields t-closeness in expectation.
1 INTRODUCTION
There are several privacy models that have been
proposed in the literature. k-Anonymity (Samarati
and Sweeney, 1998; Samarati, 2001) and, more re-
cently, ε-differential privacy (Dwork, 2006) stand
out as probably the two best-known ones. The for-
mer was proposed to anonymize data sets for off-
line release, whereas the latter was proposed to
anonymize answers to interactive queries to on-line
databases (Dwork, 2011). Yet, ε-differential privacy
can also be extended to anonymize data sets.
Assume a data set X from which direct identifiers
have been suppressed, but which contains so-called
quasi-identifier attributes, that is, attributes (e.g. age,
gender, nationality, etc.) which can be used by an in-
truder to link records in X with records in some ex-
ternal database containing direct identifiers. The in-
truder’s goal is to determine the identity of the indi-
viduals to whom the values of confidential attributes
(e.g. health condition, salary, etc.) in records in X
correspond (identity disclosure).
A data set X is said to satisfy k-anonymity if each
combinationof values of the quasi-identifier attributes
in it is shared by at least k records. k-Anonymity pro-
tects against identity disclosure: given an anonymized
record in X, an intruder cannot determine the identity
of the individual to whom the record (and hence the
confidential attribute values in it) corresponds. The
reason is that there are at least k records in X sharing
any combination of quasi-identifier attribute values.
The most usual computational procedure to attain k-
anonymity is generalization of the quasi-identifier at-
tributes (Samarati, 2001), but an alternative approach
is based on microaggregation of the quasi-identifier
attributes (Domingo-Ferrer and Torra, 2005).
While k-anonymity protects against identity dis-
closure as mentioned above, in general it does not
protect against attribute disclosure (Domingo-Ferrer,
2008), that is, disclosure of the value of a confidential
attribute corresponding to an external identified indi-
vidual. Let us assume a target individual T for whom
the intruder knows the identity and the values of the
confidential attributes. Let G
T
be a group of at least k
anonymized records sharing a combination of quasi-
identifier attribute values that is the only one compat-
ible with Ts quasi-identifier attribute values. Then
the intruder knows that the anonymized record corre-
sponding to T belongs to G
T
. Now, if the values for
one (or several) confidential attribute(s) in all records
of G
T
are the same, the intruder learns the values of
that (those) attribute(s) for the target individual T.
The property of l-diversity (Machanavajjhala et
al., 2006) has been proposed as an extension of k-
anonymity which tries to address the attribute disclo-
sure problem. A data set is said to satisfy l-diversity
if, for each group of records sharing a combination
of quasi-identifier attribute values, there are at least
l “well-represented” values for each confidential at-
tribute. Achieving l-diversity in general implies more
distortion than just achieving k-anonymity. Yet, l-
diversity may fail to protect against attribute disclo-
sure if the l values of a confidential attribute are
very similar or are strongly skewed. p-Sensitive k-
478
Domingo-Ferrer J..
On the Connection between t-Closeness and Differential Privacy for Data Releases.
DOI: 10.5220/0004500904780481
In Proceedings of the 10th International Conference on Security and Cryptography (SECRYPT-2013), pages 478-481
ISBN: 978-989-8565-73-0
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
anonymity (Truta and Vinay, 2006) is a property sim-
ilar to l-diversity, which shares similar shortcomings.
See (Domingo-Ferrer, 2008) for a summary of criti-
cisms to l-diversity and p-sensitive k-anonymity.
t-Closeness (Li et al., 2007) is another extension
of k-anonymity which also tries to solve the attribute
disclosure problem. A data set is said to satisfy t-
closeness if, for each group of records sharing a com-
bination of quasi-identifier attribute values, the dis-
tance between the empirical distribution of each con-
fidential attribute within the group and the empirical
distribution of the same confidential attribute in the
whole data set is no more than a threshold t. This
property clearly solves the attribute disclosure vul-
nerability, although the original t-closeness paper did
not propose a computational procedure to achieve this
property and did not mention the large utility loss that
this property is likely to inflict on the original data.
Differential privacy, as originally proposed for
interactive databases, assumes that an anonymiza-
tion mechanism mediates between the user submit-
ting queries and the database. In this way, instead of
getting responses to a query function f computed on
the database, the user gets responses to a randomized
query function κ. This randomized κ is said to satisfy
ε-differential privacy if, for all data sets D
1
, D
2
such
that one can be obtained from the other by modifying
a single record, and all subsets S of the range of κ, it
holds that
Pr(κ(D
1
) S) exp(ε) × Pr(κ(D
2
) S). (1)
In plain words, Expression (1) means that the influ-
ence of any single record on the returned value of
κ is negligible. The computational procedure origi-
nally proposed to reach ε-differential privacy is to ob-
tain κ by adding Laplace noise to the query function
f (Dwork, 2006).
We have recently shown in (Soria-Comas et al.,
2013) that microaggregation-based k-anonymity can
be used as a prior step towards achieving ε-differential
privacy of a data set. The advantage of doing so is
that much less Laplace noise addition is thereafter
needed to attain ε-differential privacy, in such a way
that the utility of the resulting differentially private
data is substantially higher.
1.1 Contribution and Plan of this Paper
In the same spirit of (Soria-Comas et al., 2013)
about finding connections between models based on
k-anonymity and differential privacy, we explore here
how t-closeness and ε-differential privacy are related
to each other regarding anonymization of data sets.
We highlight the formal similarities between t-
closeness and ε-differential privacy in Section 2.
In the same section, we give a lemma showing
that k-anonymity for the quasi-identifiers combined
with differential privacy for the confidential attributes
yields t-closeness in expectation. Section 3 is a con-
clusion.
2 FROM DIFFERENTIAL
PRIVACY TO (EXPECTED)
T-CLOSENESS
Let X be a data set with quasi-identifier attributes
Q
1
, ··· , Q
m
and confidential attributes C
1
, ··· , C
n
. Let
N be the number of records of X. Further, let I
r
(·)
be the function that returns all the attribute values
contained in record r X; let IC
r
(·) be the function
that returns the values of the confidential attributes in
record r X.
Consider the multivariate query
(I
1
(X), ··· , I
N
(X)); the answer to that query returns
the entire data set X. Further, let (Y
1
(X), ··· , Y
N
(X))
be the noise that needs to be added to the answer
to that query to achieve ε-differential privacy. A
differentially private version of the data set X can be
obtained as:
(I
1
(X), ··· , I
N
(X)) + (Y
1
(X), ··· , Y
N
(X)).
From the definition of ε-differential privacy (Ex-
pression (1)), it holds that
Pr((I
1
(X
1
), ··· , I
N
(X
1
)) + (Y
1
(X
1
), ··· , Y
N
(X
1
)) S
N
)
exp(ε)×
Pr((I
1
(X
2
), ··· , I
N
(X
2
)) + (Y
1
(X
2
), ··· , Y
N
(X
2
)) S
N
)
(2)
for any pair of data sets X
1
, X
2
such one can be ob-
tained from the other by suppressing/modifying a sin-
gle record, and all S Range(I
i
() + Y
i
()), where we
assume this range to be the same for all i = 1, ··· , N.
Let us now introduce expected t-closeness. This
means t-closeness in expectation, that is at the level
of the distributions of the noise used to generate
the anonymized confidential attributes, respectively,
within each group of records sharing a combination
of quasi-identifier attributes and in the overall data
set. Actual t-closeness (Li et al., 2007), however, is
defined in terms of the actual values obtained for the
confidential attributes.
Definition 1 (Expected t-closeness). Let X
be an
anonymized data set with N records obtained from an
original data set X by k-anonymizing quasi-identifiers
and adding random noise to the projection of X on its
confidential attributes. Call the latter projection C
OntheConnectionbetweent-ClosenessandDifferentialPrivacyforDataReleases
479
and the corresponding noise-added projection C
. We
say that X
satisfies expected t-closeness if
Pr((IC
i
1
(Z
), ··· , IC
i
|Z
|
(Z
)) S
|Z
|
)
g(t) × Pr((IC
1
(C
), ··· , IC
N
(C
)) S
N
) (3)
for any subset Z
C
of records i
1
, ··· , i
|Z
|
sharing
the same combination of quasi-identifier attribute val-
ues and all S Range(IC
i
()), where we assume this
range to be the same for all i = 1, ··· , N, and where
g(·) is a non-decreasing function such that the ex-
pected values of X
satisfy t-closeness in the sense
of (Li et al., 2007).
Note that expected t-closeness is defined in terms
of the sampling distribution of the noise added to ob-
tain C
fromC. In other words, Definition 1 states that
the noise added to X is expected to produce a data set
X
for which actual t-closeness holds. It may occur,
however, that the actual X
obtained does not satisfy t-
closeness. Thus, in this respect, expected t-closeness
is weaker than t-closeness.
The following lemma connects k-anonymity, ε-
differential privacy and expected t-closeness. It says
that if we k-anonymize the quasi-identifiers of an
original data set and we make its confidential at-
tributes ε-differentially private, then the resulting
anonymized data set is expected to satisfy t-closeness
for t a function of k and ε.
Lemma 1. Let X be an original data set and X
be
a corresponding anonymized data set such that its
quasi-identifiers are k-anonymous and the projection
of X
on the confidential attributes is ε-differentially
private. Then X
satisfies expected t-closeness with
t = g
1
(exp((N k) × ε)).
Proof. The projection C
of X
on its confidential at-
tributes is derived from the corresponding projection
C of X as:
C
= (I
1
(C), · ·· , I
N
(C)) + (Y
1
(C), · ·· , Y
N
(C))
Let Z C be a group of k records with indices
i
1
, ··· , i
k
sharing the same combination of quasi-
identifier attribute values. Note that Z can be obtained
from C by suppressing N k records from C. Now, if
we iterate Expression (2) N k times, we get
Pr((I
i
1
(Z), ··· , I
i
k
(Z)) + (Y
i
1
(Z), ··· , Y
i
k
(Z)) S
k
)
exp((N k) × ε)×
Pr((I
1
(C), · ·· , I
N
(C)) + (Y
1
(C), · ·· , Y
N
(C)) S
N
)
(4)
By comparing with Expression (3), it can be seen that
Expression (4) guarantees that X
satisfies expected
t-closeness with t = g
1
(exp((N k) × ε).
Note 1. The previous lemma gives a computational
procedure to obtain t-closeness, albeit a greedy one:
just keep generating differentially private versions of
C by random noise addition until a version C
is ob-
tained which satisfies actual t-closeness in the sense
of (Li et al., 2007). Of course, the larger the number
of records, the larger the number of attributes inC and
the larger the variance of the noise distribution used,
the longer it will take to terminate this procedure.
3 CONCLUSIONS AND FUTURE
WORK
In previous work, we showed how k-anonymity could
be used as a prior step to obtain differentially pri-
vate data releases with higher utility. In the same
line of finding synergies between privacy models, in
this paper we have highlighted the formal similar-
ity between ε-differential privacy and t-closeness for
anonymization of data sets. Furthermore, we have
shown how expectedt-closeness can be obtained from
ε-differential privacy.
In future work we plan to build on the ideas in
this paper and leverage differential privacy to achieve
actual t-closeness in a way less greedy that the one
sketched in Note 1 above. This will address one of
the weak points of the original t-closeness proposal,
namely the lack of a computational procedure to reach
that property.
ACKNOWLEDGMENTS AND
DISCLAIMER
This work was partly supported by the Government of
Catalonia under grant 2009 SGR 1135, by the Span-
ish Government through projects TSI2007-65406-
C03-01 “E-AEGIS”, TIN2011-27076-C03-01 “CO-
PRIVACY” and CONSOLIDER INGENIO 2010
CSD2007-00004 ARES”, and by the European
Comission under FP7 projects “DwB” and “Inter-
Trust”. The author is partially supported as an ICREA
Acad`emia researcher by the Government of Catalo-
nia. He is with the UNESCO Chair in Data Privacy,
but he is solely responsible for the views expressed in
this paper, which do not necessarily reflect the posi-
tion of UNESCO nor commit that organization.
SECRYPT2013-InternationalConferenceonSecurityandCryptography
480
REFERENCES
Domingo-Ferrer, J. (2008). A critique of k-anonymity
and some of its enhancements. In Proceedings of
ARES/PSAI 2008, IEEE Computer Society, pp. 990-
993.
Domingo-Ferrer, J. and Torra, V. (2005). Ordinal, continu-
ous and heterogeneous k-anonymity through microag-
gregation. Data Mining and Knowledge Discovery,
11(2):195-212.
Dwork, C. (2006). Differential privacy. In Proc. 33rd In-
ternational Colloquium on Automata, Languages and
Programming (ICALP), LNCS 4052, Springer, pp. 1-
12.
Dwork, C. (2011). A firm foundation for private data anal-
ysis. Communications of the ACM, 54(1):86-95.
Li, N., Li, T., and Venkatasubramanian, S. (2007).
t-Closeness: privacy beyond k-anonymity and l-
diversity. In Proceedings of IEEE ICDE 2007.
Machanavajjhala, A., Gehrke, J., Kiefer, D., and Venkita-
subramanian, M. (2006) l-Diversity: privacy beyond
k-anonymity. In Proceedings of IEEE ICDE 2006.
Samarati, P. (2001). Protecting respondents’ identities in
microdata release. IEEE Transactions on Knowledge
and Data Engineering, 13(6):1010–1027.
Samarati, P., and Sweeney, L. (1998). Protecting privacy
when disclosing information: k-anonymity and its
enforcement through generalization and suppression.
SRI International Report.
Soria-Comas, J., Domingo-Ferrer, J., S´anchez, D., and
Mart´ınez, S. (2013). Improving the utility of differ-
entially private data releases via k-anonymity. In 12th
IEEE International Conference on Trust, Security and
Privacy in Computing and Communications -IEEE
TrustCom 2013, Melbourne, Australia, July 16-18,
2013 (to appear).
Truta, T.M. and Vinay, B. (2006). Privacy protection: p-
sensitive k-anonymity property. In 2nd International
Workshop on Privacy Data Management PDM 2006,
IEEE Computer Society, p. 94.
OntheConnectionbetweent-ClosenessandDifferentialPrivacyforDataReleases
481