On the Connection between t-Closeness and Differential Privacy for Data

Releases

Josep Domingo-Ferrer

Universitat Rovira i Virgili, Dept. of Computer Engineering and Mathematics, UNESCO Chair in Data Privacy,

Av. Pa¨ısos Catalans 26, E-43007 Tarragona, Catalonia, Spain

Keywords:

Differential Privacy, t-Closeness, k-Anonymity, Microaggregation.

Abstract:

t-Closeness was introduced as an improvement of the well-known k-anonymity privacy model for data release.

On the other hand, ε-differential privacy was originally proposed as a privacy property for answers to on-line

database queries and it has been very welcome in academic circles. In spite of their quite diverse origins and

motivations, we show in this paper that t-closeness and ε-differential privacy actually provide related privacy

guarantees when applied to off-line data release. Speciﬁcally, k-anonymity for the quasi-identiﬁers combined

with differential privacy for the conﬁdential attributes yields t-closeness in expectation.

1 INTRODUCTION

There are several privacy models that have been

proposed in the literature. k-Anonymity (Samarati

and Sweeney, 1998; Samarati, 2001) and, more re-

cently, ε-differential privacy (Dwork, 2006) stand

out as probably the two best-known ones. The for-

mer was proposed to anonymize data sets for off-

line release, whereas the latter was proposed to

anonymize answers to interactive queries to on-line

databases (Dwork, 2011). Yet, ε-differential privacy

can also be extended to anonymize data sets.

Assume a data set X from which direct identiﬁers

have been suppressed, but which contains so-called

quasi-identiﬁer attributes, that is, attributes (e.g. age,

gender, nationality, etc.) which can be used by an in-

truder to link records in X with records in some ex-

ternal database containing direct identiﬁers. The in-

truder’s goal is to determine the identity of the indi-

viduals to whom the values of conﬁdential attributes

(e.g. health condition, salary, etc.) in records in X

correspond (identity disclosure).

A data set X is said to satisfy k-anonymity if each

combinationof values of the quasi-identiﬁer attributes

in it is shared by at least k records. k-Anonymity pro-

tects against identity disclosure: given an anonymized

record in X, an intruder cannot determine the identity

of the individual to whom the record (and hence the

conﬁdential attribute values in it) corresponds. The

reason is that there are at least k records in X sharing

any combination of quasi-identiﬁer attribute values.

The most usual computational procedure to attain k-

anonymity is generalization of the quasi-identiﬁer at-

tributes (Samarati, 2001), but an alternative approach

is based on microaggregation of the quasi-identiﬁer

attributes (Domingo-Ferrer and Torra, 2005).

While k-anonymity protects against identity dis-

closure as mentioned above, in general it does not

protect against attribute disclosure (Domingo-Ferrer,

2008), that is, disclosure of the value of a conﬁdential

attribute corresponding to an external identiﬁed indi-

vidual. Let us assume a target individual T for whom

the intruder knows the identity and the values of the

conﬁdential attributes. Let G

be a group of at least k

anonymized records sharing a combination of quasi-

identiﬁer attribute values that is the only one compat-

ible with T’s quasi-identiﬁer attribute values. Then

the intruder knows that the anonymized record corre-

sponding to T belongs to G

. Now, if the values for

one (or several) conﬁdential attribute(s) in all records

of G

are the same, the intruder learns the values of

that (those) attribute(s) for the target individual T.

The property of l-diversity (Machanavajjhala et

al., 2006) has been proposed as an extension of k-

anonymity which tries to address the attribute disclo-

sure problem. A data set is said to satisfy l-diversity

if, for each group of records sharing a combination

of quasi-identiﬁer attribute values, there are at least

l “well-represented” values for each conﬁdential at-

tribute. Achieving l-diversity in general implies more

distortion than just achieving k-anonymity. Yet, l-

diversity may fail to protect against attribute disclo-

sure if the l values of a conﬁdential attribute are

very similar or are strongly skewed. p-Sensitive k-

478

Domingo-Ferrer J..

On the Connection between t-Closeness and Differential Privacy for Data Releases.

DOI: 10.5220/0004500904780481

In Proceedings of the 10th International Conference on Security and Cryptography (SECRYPT-2013), pages 478-481

ISBN: 978-989-8565-73-0

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

anonymity (Truta and Vinay, 2006) is a property sim-

ilar to l-diversity, which shares similar shortcomings.

See (Domingo-Ferrer, 2008) for a summary of criti-

cisms to l-diversity and p-sensitive k-anonymity.

t-Closeness (Li et al., 2007) is another extension

of k-anonymity which also tries to solve the attribute

disclosure problem. A data set is said to satisfy t-

closeness if, for each group of records sharing a com-

bination of quasi-identiﬁer attribute values, the dis-

tance between the empirical distribution of each con-

ﬁdential attribute within the group and the empirical

distribution of the same conﬁdential attribute in the

whole data set is no more than a threshold t. This

property clearly solves the attribute disclosure vul-

nerability, although the original t-closeness paper did

not propose a computational procedure to achieve this

property and did not mention the large utility loss that

this property is likely to inﬂict on the original data.

Differential privacy, as originally proposed for

interactive databases, assumes that an anonymiza-

tion mechanism mediates between the user submit-

ting queries and the database. In this way, instead of

getting responses to a query function f computed on

the database, the user gets responses to a randomized

query function κ. This randomized κ is said to satisfy

ε-differential privacy if, for all data sets D

, D

such

that one can be obtained from the other by modifying

a single record, and all subsets S of the range of κ, it

holds that

Pr(κ(D

) ∈ S) ≤ exp(ε) × Pr(κ(D

) ∈ S). (1)

In plain words, Expression (1) means that the inﬂu-

ence of any single record on the returned value of

κ is negligible. The computational procedure origi-

nally proposed to reach ε-differential privacy is to ob-

tain κ by adding Laplace noise to the query function

f (Dwork, 2006).

We have recently shown in (Soria-Comas et al.,

2013) that microaggregation-based k-anonymity can

be used as a prior step towards achieving ε-differential

privacy of a data set. The advantage of doing so is

that much less Laplace noise addition is thereafter

needed to attain ε-differential privacy, in such a way

that the utility of the resulting differentially private

data is substantially higher.

1.1 Contribution and Plan of this Paper

In the same spirit of (Soria-Comas et al., 2013)

about ﬁnding connections between models based on

k-anonymity and differential privacy, we explore here

how t-closeness and ε-differential privacy are related

to each other regarding anonymization of data sets.

We highlight the formal similarities between t-

closeness and ε-differential privacy in Section 2.

In the same section, we give a lemma showing

that k-anonymity for the quasi-identiﬁers combined

with differential privacy for the conﬁdential attributes

yields t-closeness in expectation. Section 3 is a con-

clusion.

2 FROM DIFFERENTIAL

PRIVACY TO (EXPECTED)

T-CLOSENESS

Let X be a data set with quasi-identiﬁer attributes

, ··· , Q

and conﬁdential attributes C

, ··· , C

. Let

N be the number of records of X. Further, let I

(·)

be the function that returns all the attribute values

contained in record r ∈ X; let IC

(·) be the function

that returns the values of the conﬁdential attributes in

record r ∈ X.

Consider the multivariate query

(X), ··· , I

(X)); the answer to that query returns

the entire data set X. Further, let (Y

(X), ··· , Y

(X))

be the noise that needs to be added to the answer

to that query to achieve ε-differential privacy. A

differentially private version of the data set X can be

obtained as:

(X), ··· , I

(X)) + (Y

(X), ··· , Y

(X)).

From the deﬁnition of ε-differential privacy (Ex-

pression (1)), it holds that

Pr((I

), ··· , I

)) + (Y

), ··· , Y

)) ∈ S

)

≤ exp(ε)×

Pr((I

), ··· , I

)) + (Y

), ··· , Y

)) ∈ S

)

(2)

for any pair of data sets X

, X

such one can be ob-

tained from the other by suppressing/modifying a sin-

gle record, and all S ⊂ Range(I

() + Y

()), where we

assume this range to be the same for all i = 1, ··· , N.

Let us now introduce expected t-closeness. This

means t-closeness in expectation, that is at the level

of the distributions of the noise used to generate

the anonymized conﬁdential attributes, respectively,

within each group of records sharing a combination

of quasi-identiﬁer attributes and in the overall data

set. Actual t-closeness (Li et al., 2007), however, is

deﬁned in terms of the actual values obtained for the

conﬁdential attributes.

Deﬁnition 1 (Expected t-closeness). Let X

′

be an

anonymized data set with N records obtained from an

original data set X by k-anonymizing quasi-identiﬁers

and adding random noise to the projection of X on its

conﬁdential attributes. Call the latter projection C

OntheConnectionbetweent-ClosenessandDifferentialPrivacyforDataReleases

479

and the corresponding noise-added projection C

′

. We

say that X

′

satisﬁes expected t-closeness if

Pr((IC

′

), ··· , IC

′

)) ∈ S

′

)

≤ g(t) × Pr((IC

′

), ··· , IC

′

)) ∈ S

) (3)

for any subset Z

′

⊆ C

′

of records i

, ··· , i

′

sharing

the same combination of quasi-identiﬁer attribute val-

ues and all S ⊂ Range(IC

()), where we assume this

range to be the same for all i = 1, ··· , N, and where

g(·) is a non-decreasing function such that the ex-

pected values of X

′

satisfy t-closeness in the sense

of (Li et al., 2007).

Note that expected t-closeness is deﬁned in terms

of the sampling distribution of the noise added to ob-

tain C

′

fromC. In other words, Deﬁnition 1 states that

the noise added to X is expected to produce a data set

′

for which actual t-closeness holds. It may occur,

however, that the actual X

′

obtained does not satisfy t-

closeness. Thus, in this respect, expected t-closeness

is weaker than t-closeness.

The following lemma connects k-anonymity, ε-

differential privacy and expected t-closeness. It says

that if we k-anonymize the quasi-identiﬁers of an

original data set and we make its conﬁdential at-

tributes ε-differentially private, then the resulting

anonymized data set is expected to satisfy t-closeness

for t a function of k and ε.

Lemma 1. Let X be an original data set and X

′

a corresponding anonymized data set such that its

quasi-identiﬁers are k-anonymous and the projection

of X

′

on the conﬁdential attributes is ε-differentially

private. Then X

′

satisﬁes expected t-closeness with

t = g

−1

(exp((N − k) × ε)).

Proof. The projection C

′

of X

′

on its conﬁdential at-

tributes is derived from the corresponding projection

C of X as:

′

= (I

(C), · ·· , I

(C)) + (Y

(C), · ·· , Y

(C))

Let Z ⊂ C be a group of k records with indices

, ··· , i

sharing the same combination of quasi-

identiﬁer attribute values. Note that Z can be obtained

from C by suppressing N − k records from C. Now, if

we iterate Expression (2) N − k times, we get

Pr((I

(Z), ··· , I

(Z)) + (Y

(Z), ··· , Y

(Z)) ∈ S

)

≤ exp((N − k) × ε)×

Pr((I

(C), · ·· , I

(C)) + (Y

(C), · ·· , Y

(C)) ∈ S

)

(4)

By comparing with Expression (3), it can be seen that

Expression (4) guarantees that X

′

satisﬁes expected

t-closeness with t = g

−1

(exp((N − k) × ε). 

Note 1. The previous lemma gives a computational

procedure to obtain t-closeness, albeit a greedy one:

just keep generating differentially private versions of

C by random noise addition until a version C

′

is ob-

tained which satisﬁes actual t-closeness in the sense

of (Li et al., 2007). Of course, the larger the number

of records, the larger the number of attributes inC and

the larger the variance of the noise distribution used,

the longer it will take to terminate this procedure.

3 CONCLUSIONS AND FUTURE

WORK

In previous work, we showed how k-anonymity could

be used as a prior step to obtain differentially pri-

vate data releases with higher utility. In the same

line of ﬁnding synergies between privacy models, in

this paper we have highlighted the formal similar-

ity between ε-differential privacy and t-closeness for

anonymization of data sets. Furthermore, we have

shown how expectedt-closeness can be obtained from

ε-differential privacy.

In future work we plan to build on the ideas in

this paper and leverage differential privacy to achieve

actual t-closeness in a way less greedy that the one

sketched in Note 1 above. This will address one of

the weak points of the original t-closeness proposal,

namely the lack of a computational procedure to reach

that property.

ACKNOWLEDGMENTS AND

DISCLAIMER

This work was partly supported by the Government of

Catalonia under grant 2009 SGR 1135, by the Span-

ish Government through projects TSI2007-65406-

C03-01 “E-AEGIS”, TIN2011-27076-C03-01 “CO-

PRIVACY” and CONSOLIDER INGENIO 2010

CSD2007-00004 “ARES”, and by the European

Comission under FP7 projects “DwB” and “Inter-

Trust”. The author is partially supported as an ICREA

Acad`emia researcher by the Government of Catalo-

nia. He is with the UNESCO Chair in Data Privacy,

but he is solely responsible for the views expressed in

this paper, which do not necessarily reﬂect the posi-

tion of UNESCO nor commit that organization.

SECRYPT2013-InternationalConferenceonSecurityandCryptography

480

REFERENCES

Domingo-Ferrer, J. (2008). A critique of k-anonymity

and some of its enhancements. In Proceedings of

ARES/PSAI 2008, IEEE Computer Society, pp. 990-

993.

Domingo-Ferrer, J. and Torra, V. (2005). Ordinal, continu-

ous and heterogeneous k-anonymity through microag-

gregation. Data Mining and Knowledge Discovery,

11(2):195-212.

Dwork, C. (2006). Differential privacy. In Proc. 33rd In-

ternational Colloquium on Automata, Languages and

Programming (ICALP), LNCS 4052, Springer, pp. 1-

12.

Dwork, C. (2011). A ﬁrm foundation for private data anal-

ysis. Communications of the ACM, 54(1):86-95.

Li, N., Li, T., and Venkatasubramanian, S. (2007).

t-Closeness: privacy beyond k-anonymity and l-

diversity. In Proceedings of IEEE ICDE 2007.

Machanavajjhala, A., Gehrke, J., Kiefer, D., and Venkita-

subramanian, M. (2006) l-Diversity: privacy beyond

k-anonymity. In Proceedings of IEEE ICDE 2006.

Samarati, P. (2001). Protecting respondents’ identities in

microdata release. IEEE Transactions on Knowledge

and Data Engineering, 13(6):1010–1027.

Samarati, P., and Sweeney, L. (1998). Protecting privacy

when disclosing information: k-anonymity and its

enforcement through generalization and suppression.

SRI International Report.

Soria-Comas, J., Domingo-Ferrer, J., S´anchez, D., and

Mart´ınez, S. (2013). Improving the utility of differ-

entially private data releases via k-anonymity. In 12th

IEEE International Conference on Trust, Security and

Privacy in Computing and Communications -IEEE

TrustCom 2013, Melbourne, Australia, July 16-18,

2013 (to appear).

Truta, T.M. and Vinay, B. (2006). Privacy protection: p-

sensitive k-anonymity property. In 2nd International

Workshop on Privacy Data Management PDM 2006,

IEEE Computer Society, p. 94.

OntheConnectionbetweent-ClosenessandDifferentialPrivacyforDataReleases

481