Big Data Anonymization Requirements vs Privacy Models

Josep Domingo-Ferrer

Universitat Rovira i Virgili, Department of Computer Science and Mathematics,

CYBERCAT-Center for Cybersecurity Research of Catalonia,

UNESCO Chair in Data Privacy,

Av. Pa

ısos Catalans 26, 43007 Tarragona, Catalonia, Spain

Keywords:

Privacy, Big Data, Anonymization, Privacy Models, k-anonymity, Differential Privacy, Randomized Response,

Post-randomization, t-closeness, Permutation, Deniability.

Abstract:

The big data explosion opens unprecedented analysis and inference possibilities that may even enable mod-

eling the world and forecasting its evolution with great accuracy. The dark side of such a data bounty is that

it complicates the preservation of individual privacy: a substantial part of big data is obtained from the dig-

ital track of our activity. We focus here on the privacy of subjects on whom big data are collected. Unless

anonymization approaches are found that are suitable for big data, the following extreme positions will become

more and more common: nihilists, who claim that privacy is dead in the big data world, and fundamentalists,

who want privacy even at the cost of sacriﬁcing big data analysis. In this article we identify requirements

that should be satisﬁed by privacy models to be applicable to big data. We then examine how well the two

main privacy models (k-anonymity and ε-differential privacy) satisfy those requirements. Neither model is

entirely satisfactory, although k-anonymity seems more amenable to big data protection. Finally, we highlight

connections between the previous two privacy models and other privacy models that might result in synergies

between them in order to tackle big data: the principles underlying all those models are deniability and permu-

tation. Future research attempting to adapt the current privacy models for big data and/or design new models

will have to adhere to those two underlying principles. As a side result, the above inter-model connections

allow gauging what is the actual protection afforded by differential privacy when ε is not sufﬁciently small.

1 INTRODUCTION

Big data have become a reality with the new millen-

nium. Almost any human activity leaves a digital

trace that is collected and stored by someone (sen-

sors of the Internet of Things, social apps, machine-

to-machine communication, mobile video, etc.). As

a result, data from several different sources are avail-

able, and they can be merged and analyzed to gen-

erate knowledge. Big data differ from conventional

data (small data) in several aspects, including their

huge volume, their velocity (frequent updates or even

continuous production) and their variety (they may in-

clude complex and unstructured data).

Even though big data are extremely valuable in

many ﬁelds, they increasingly threaten the privacy

of individuals on whom they are collected (often un-

awares of these). Thus, among the three dimensions

of privacy (subject privacy, user privacy and owner

privacy (Domingo-Ferrer, 2007)), we focus here on

subject privacy and in particular in the anonymization

approach to it. Statistical disclosure control (SDC,

(Hundepool et al., 2012)) tries to enable useful in-

ferences on subpopulations from a data set, while

preserving the privacy of the subjects to whom the

records in the set correspond. Researchers have de-

signed a good number of techniques to limit disclo-

sure risk in data releases that refer to individual sub-

jects, the so-called “microdata”. These techniques

have the common feature to keep original data se-

cret and replace them with a modiﬁed version, which

is called the anonymized version. In the last twenty

years, on the other hand, several privacy models have

been proposed. Instead of determining the speciﬁc

transformation that must be applied to original data,

a privacy model speciﬁes a condition that, if satis-

ﬁed by the anonymized data set, guarantees that the

disclosure risk is kept under control. Privacy models

normally have one or several parameters that deter-

mine how much disclosure risk is acceptable. Current

models have been designed for a single data set, and

they have several shortcomings in a big data scenario.

Domingo-Ferrer, J.

Big Data Anonymization Requirements vs Privacy Models.

DOI: 10.5220/0006830003050312

In Proceedings of the 15th International Joint Conference on e-Business and Telecommunications (ICETE 2018) - Volume 2: SECRYPT, pages 305-312

ISBN: 978-989-758-319-3

305

1.1 Contribution and Plan of this Paper

Unfortunately, SDC methods and privacy models in

the literature have been designed with small data in

mind, whereas the requirements of big data are differ-

ent. The following extreme positions are becoming

more and more common in front of the tension be-

tween subject privacy and big data analytics:

• Nihilists. They claim that no privacy is possi-

ble with big data. Some of them (typically gov-

ernments

) claim that privacy needs to be sac-

riﬁced to security; others (typically corporations

and some researchers

) claim that privacy is a

hindrance to business and progress; yet others

(typically Internet companies) do not make many

claims but offer enticing and free services that re-

sult in privacy being overridden; ﬁnally, others

claim nothing and offer nothing but gather, pack-

age and sell as many personal data as possible

(data brokers (FTC, 2014)).

• Fundamentalists. They prioritize privacy, what-

ever it takes in terms of utility loss. Players in

this camp are basically a fraction of the academics

working in security and privacy.

While fundamentalists are unlikely to prevail over

nihilists, they could make themselves more use-

ful if they invested their research effort in ﬁnding

anonymization solutions that are tailored to the nat-

ural requirements of big data.

In Section 2 of this paper, we try to identify the

main requirements of big data anonymization. Then

we examine how well the two main families of pri-

vacy models satisfy those requirements: we deal with

k-anonymity in Section 3 and with differential privacy

in Section 4. Seeing that neither family is completely

satisfactory, we explore connections of k-anonymity

and differential privacy with other privacy models

in Section 5; we characterize those connections in

terms of two principles, deniability and permutation.

In Section 6 (conclusions and future research), we

conclude that focusing on these two principles is a

promising way to tackle the adaptation of current pri-

vacy models to big data or the design of new privacy

models.

In 2009 the UK Cabinet Ofﬁce’s former security and

intelligence co-ordinator, Sir David Omand, warned: “citi-

zens will have to sacriﬁce their right to privacy in the ﬁght

against terrorism”. Also, in April 2016, the European Par-

liament backed the EU directive enabling the European se-

curity services to share information on airline passengers.

E.g. Stephen Brobst, the CTO of Teradata, stated: “I

want to know every click and every search that led up to that

purchase... interactions are orders of magnitude larger than

the transactions... the interactions give you the behavior”,

The Irish Times, Aug. 7, 2014.

2 BIG DATA ANONYMIZATION

REQUIREMENTS

Leveraging the potential of available big data to im-

prove human life and even to make proﬁt is perfectly

legitimate. However, this should not encroach on the

subjects’ privacy. Released big data should be pro-

tected, that is, transformed so that: i) they yield statis-

tical results very close to those that would be obtained

if the original big data were available, but ii) they do

not allow unequivocal reconstruction of the proﬁle of

any speciﬁc subject.

SDC methods (Hundepool et al., 2012) (for ex-

ample, noise addition, generalization, suppression,

lower and upper recoding, microaggregation and oth-

ers) specify transformations whose purpose is to limit

the risk of disclosure. Nevertheless, they do not

prescribe any mechanism to assess the disclosure

risk that remains in the transformed data. In con-

trast, privacy models (such as k-anonymity (Samarati

and Sweeney, 1998) and differential privacy (Dwork,

2006), as well as l-diversity (Machanavajjhala et al.,

2007), t-closeness (Li et al., 2007) and probabilis-

tic k-anonymity (Soria-Comas and Domingo-Ferrer,

2012), among others) specify some properties to be

met by a data set to limit disclosure risk, but they

do not prescribe any speciﬁc SDC method to satisfy

those properties. Privacy models seem more attrac-

tive, because they state the privacy level to be attained

and leave it to the data protector to adopt the least

utility-damaging method. The reality, however, is that

most privacy models were designed to protect a single

static data set and they have notorious shortcomings if

used in a big data context.

For a privacy model to be useful for big data,

it must be compatible with the volume, the veloc-

ity and the variety of this kind of data. To assess

this compatibility, we propose to take into account to

what extent the model satisﬁes the following proper-

ties (the last three of them described in (Soria-Comas

and Domingo-Ferrer, 2015)):

• Protection. Anonymized big data should not al-

low unequivocal reconstruction of any subject’s

proﬁle, let alone re-identiﬁcation. While protec-

tion was also essential in anonymization of small

data, it is more difﬁcult to achieve in big data: the

availability of many data sources on overlapping

sets of individuals implies that a lot of attributes

may be available on a certain individual, which

may facilitate her re-identiﬁcation.

• Utility for exploratory analyses. Anonymized big

data that are published should yield results simi-

lar to those obtained on the original big data for a

broad range of exploratory analyses. While utility

SECRYPT 2018 - International Conference on Security and Cryptography

306

was also important in traditional anonymization of

small data, it is more complicated to obtain useful

anonymized big data, because they tend to be an-

alyzed in ways that cannot be anticipated at the

time of anonymizing them.

• Composability. A privacy model is composable if

its privacy guarantees are preserved (possibly in a

limited way) after repeated application. To put it

otherwise, a privacy model is not composable if

independently published data sets, each of which

satisﬁes the model, can lead to a violation of the

model when pooled together. Composability is es-

sential for the privacy guarantees of the model to

survive in a big data context, where data collec-

tion is not centralized, but distributed among sev-

eral data sources. If one of the collectors cares

about privacy and decides to use a speciﬁc pri-

vacy model, the guarantees of this model should

be preserved (at least to some extent) after data

fusion.

• Computational cost. This cost measures the

amount of computation needed to satisfy the re-

quirements of the privacy model. As said above,

normally several alternative SDC methods are

available to satisfy a certain privacy model. Thus,

the cost will depend on the selected method. It is

very important that the model be attainable with

some efﬁcient method, given the huge amount of

big data. Ideally, the cost of the method ought to

be linear or loglinear in the data set size. Meth-

ods with quadratic or superquadratic costs are not

suitable. Anyway, there are options to reduce

the computational cost of a method that would

be too costly if directly used: one strategy is

to use blocking to split the data set into smaller

blocks and anonymize these separately. Obvi-

ously, blocking may have negative utility and pri-

vacy implications, so its application should be

carefully pondered.

• Linkability. In big data, the information on an in-

dividual subject is collected from several sources.

Hence, the ability to link records that correspond

to the same individual or to similar individuals is

basic to build big data. To preserve the privacy of

subjects, each source ought to anonymize its data

before releasing them. However, if each source

independently anonymizes data, merging the data

anonymized by different sources can become dif-

ﬁcult if not impossible. This would reduce very

substantially the variety of analyses that can be

conducted on the data and, therefore, the knowl-

edge obtainable from them. To be useful for big

data, a model should allow the analyst to link in-

dependently anonymized data that correspond to

the same subject. Note that, if records referred

to the same subject are linked, the information on

that subject is increased. This is a threat to privacy

and, hence, the linkage accuracy should be lower

in anonymized data sets than in original data sets.

3 BIG DATA ANONYMIZATION

UNDER K-ANONYMITY

k-Anonymity aims to limit an attacker’s ability to re-

identify the record corresponding to a certain target

subject. On the one hand, the data set to be protected

contains conﬁdential attributes, such as medical data,

ﬁnancial data, religion, sexual orientation, etc. (if it

did not include conﬁdential attributes, there would

be no need to protect it!). On the other hand, the

data set contains attributes that we will collectively

name quasi-identiﬁer; no value of any single quasi-

identiﬁer attribute in a record identiﬁes its subject,

but the combination of values of all quasi-identiﬁer

attributes may. For example, age, gender, zipcode

and education level are a reasonable quasi-identiﬁer,

because in a certain zipcode there may be a single

woman over 70 that holds a PhD. In fact, the attack

model assumes that the attacker can at least identify

the subject of some of the records using the quasi-

identiﬁer attributes: a way to do this is for the at-

tacker to have access to an external database that con-

tains the quasi-identiﬁer attributes together with iden-

tiﬁers (name, passport number, etc.) for some of the

subjects of the released dataset. In this case, he will

manage to link the values of conﬁdential attributes in

some records of the released data set with some iden-

tiﬁers, which results in re-identiﬁcation and undesired

disclosure. To bring the re-identiﬁcation probability

down to 1/k, k-anonymity requires that each combi-

nation of values of the quasi-identiﬁer attributes be

shared by a group of at least k records in the protected

data set (this group is called k-anonymous class).

Even in the traditional scenario of protecting a sin-

gle static data set, deciding which attributes the data

protector should include in the quasi-identiﬁer is not

obvious. To avoid mistakes, the protector should ex-

actly know the attacker’s background knowledge, but

he does not know it.

In a big data scenario, satisfying the protection re-

quirement is still more complicated, because the at-

tacker may have a lot of background knowledge. An

option to play it safe is to consider that all attributes,

including the conﬁdential ones, are part of the quasi-

identiﬁer.

Beyond the inconvenience of establishing a quasi-

identiﬁer, k-anonymity has another problem: al-

Big Data Anonymization Requirements vs Privacy Models

307

though it protects against re-identiﬁcation, it may

be insufﬁcient to protect against attribute disclosure.

This is the case when the values of a conﬁdential at-

tribute in the records of a k-anonymous class are iden-

tical or very similar. If the attacker manages to link

his target subject to that class, then, even if he cannot

ascertain which of the k records of the class is the tar-

get subject’s, he will learn the value of the subject’s

conﬁdential attribute. This is also disclosure. Several

k-anonymity extensions have been proposed to rem-

edy this, such as l-diversity (Machanavajjhala et al.,

2007) or t-closeness (Li et al., 2007).

In what concerns the utility requirement, a big data

scenario in which data from many subjects are to be

anonymized may allow forming k-anonymous classes

whose subjects are more homogeneous than in a small

data scenario with fewer subjects. Thus, even if ex-

ploratory analyses cannot be anticipated at the time

of anonymization, ceteris paribus more homogeneous

classes will yield better utility.

Regarding composability, k-anonymity was de-

signed to protect a single data set and, in princi-

ple, it is not composable. If several independently

anonymized k-anonymous data sets have been re-

leased that share some subjects, the attacker can lever-

age the so-called intersection attack to rule out some

of the records in a k-anonymous class as not cor-

responding to the target subject. To achieve some

composability, the protectors of several k-anonymized

data sets ought to coordinate so that, for the subjects

shared by two data sets, their k-anonymous classes

contain the same k subjects. In a big data environ-

ment, this coordination is not easy but it is not impos-

sible. If it is not feasible, and/or the various data sets

independently grow with time, the strategies sketched

in (Domingo-Ferrer and Soria-Comas, 2016) can be

used.

As to computational cost, k-anonymity is usu-

ally reached by modifying the values of the quasi-

identiﬁer attributes either by a combination of gener-

alizations and suppressions (Samarati and Sweeney,

1998), or by microaggregation (Domingo-Ferrer and

Torra, 2005). Although ﬁnding the optimal modiﬁ-

cation (to minimize information loss) results in NP

problems, heuristics and blocking strategies allow

reaching complexities O(n lnn) where n is the num-

ber of records. Therefore, k-anonymity is reasonably

computable for big data.

Finally, regarding linkability, assume we have two

independently k-anonymized data sets that have some

subjects in common. We need to see whether the

records of a subject that is known to be in both

ﬁles can be linked. The answer is that at least

the k-anonymous classes of the subject in both data

sets can be linked. If the two data sets share also

some conﬁdential attributes, the accuracy of the link-

age can improve, because one can use the values of

those attributes to link speciﬁc records within each k-

anonymous class.

In summary, in a big data scenario, k-anonymity

offers a composable privacy guarantee as long as

the protectors of data sets that share subjects coor-

dinate or follow suitable strategies (Domingo-Ferrer

and Soria-Comas, 2016). On the other hand, there are

heuristics that allow reaching k-anonymity at quasi-

linear cost in the number of records. At the same time,

it is possible to link the information of the same sub-

ject in several independently k-anonymized data sets,

at least at the level of a k-anonymous class (and in

some cases at the record level). Also, utility improves

when parameter k is smaller and/or the set of sub-

jects is larger, because less modiﬁcation of the origi-

nal records is needed to reach k-anonymity. Nonethe-

less, reducing k also reduces protection. All in all,

with some coordination effort, k-anonymity can be a

starting point to anonymize big data.

4 BIG DATA PROTECTION

UNDER DIFFERENTIAL

PRIVACY

A randomized query function F gives ε-differential

privacy (DP) if, for all data sets D

, D

such that one

can be obtained from the other by modifying a single

record (neighboring data sets), and all S ⊂ Range(F),

Pr(F(D

) ∈ S) ≤ exp(ε) × Pr(F(D

) ∈ S). (1)

A common SDC method to enforce DP is Laplace

noise addition. Although the original deﬁnition above

was for interactive queries to databases, DP was later

extended for anonymizing data sets (Soria-Comas et

al., 2014), (Xiao et al., 2010), (Xu et al., 2012),

(Zhang et al., 2014).

Protection under DP is very good if ε is small,

because in that case Expression (1) ensures that the

presence or absence of any single individual cannot

be noticed from the anonymized output.

The biggest problem of DP is that it offers very

poor utility for exploratory analyses fot the values of

ε that are required to ensure a good privacy (typically

below 1, see Section 5 below). Indeed, enough noise

must be added to make the presence or absence of any

individual unnoticeable, and this includes any outly-

ing individual that can potentially be part of the data

set. Thus, for small ε, the noise to be added is huge.

Regarding composability, DP offers strong prop-

erties (McSherry, 2009):

SECRYPT 2018 - International Conference on Security and Cryptography

308

• Sequential composition. If, for the same subset of

subjects, data sets D

are released with i ∈ I, each

of which is ε

-diferentially private, the pooled re-

leased data sets offer

∑

i∈I

-differential privacy.

That is, when several differentially private data

sets on a set of subjects are collated, differential

privacy is not broken, but the level of privacy is

reduced.

• Parallel composition. If several ε-differentially

private data sets D

are released, for i ∈ I, each one

referring to a disjoint set of subjects, all those data

sets taken together are ε-differentially private.

As to computational cost, DP is attained by adding

noise to original data. The cost of adding noise is

linear in the number n of records, that is, O(n).

Finally, in what concerns linkability, if we have

two differentially private data sets in which noise has

been added to all values of all attributes, in general it

is not possible to link the records in both data sets that

correspond to the same subject. However, if both data

sets share some attributes that have been left unmod-

iﬁed (e.g. because they are not deemed conﬁdential),

then the values of those attributes can be used to link

the records corresponding to the same subject.

Thus, differential privacy has some interesting

properties for big data: good composability, good

computational cost, and linkability if there are shared

and unmodiﬁed attributes across several data sets.

The killing problem is the lack of utility of DP data,

especially for data uses that could not be anticipated

by the data protector.

Some authors have suggested computational pro-

cedures different from mere noise addition to im-

prove the utility of differentially private data (Cor-

mode et al., 2012), (Zhang et al., 2014), (S

anchez

et al., 2014), (S

anchez et al., 2016b). Unfortunately,

these more sophisticated procedures are computation-

ally more demanding than simple noise addition and

they provide rather small utility increases, insufﬁcient

for exploratory analyses of reasonable quality. The

reason is that, as explained above, the very deﬁnition

of differential privacy requires destroying the infor-

mation of each record so that its presence or absence

cannot be noticed.

5 CONNECTIONS BETWEEN

PRIVACY MODELS

Since none of the two big families of privacy mod-

els is entirely satisfactory for big data, let us exam-

ine their connections and underlying principles. That

may shed light on possible ways to adapt or replace

them.

We will show that the following privacy models

are interconnected around the principles of deniabil-

ity and permutation: randomized response (RR), post-

randomization (PRAM), differential privacy (DP) and

t-closeness. Speciﬁcally, we will characterize RR

in terms of deniability and then we will show that

PRAM can be view as RR done by the data controller.

Then we will use the connection between RR and DP

to explain the effect of taking large ε in DP in terms

of deniability. After that, we will use the connection

between DP and t-closeness to explain DP in terms

of the intruder’s knowledge gain on the sensitive at-

tribute. Finally, we will present PRAM in terms of

the permutation paradigm; since PRAM is connected

to RR, and RR to DP and t-closeness, it will turn out

that all those models can in fact be viewed as permu-

tation. This is a novel result.

5.1 Randomized Response, Plausible

Deniability and PRAM

Let X be an attribute containing the answer to a sen-

sitive question. If X can take r possible values, then

the randomized response Y (Greenberg et al., 1969)

reported by the respondent instead of X is computed

using

P =







··· p







where p

= Pr(Y = v|X = u), for u,v ∈ {1,... , r} de-

notes the probability that the randomized response is

v when the respondent’s true attribute value is u.

Let π

,...,π

be the proportions of respondents

whose true values fall in each of the r categories

of X; let λ

∑

u=1

for v = 1,... , r, be the

probability of the reported value Y being v. If we

deﬁne λ = (λ

,...,λ

)

and π = (π

,...,π

)

, then

λ = P

π. Furthermore, if

λ is the vector of sample

proportions corresponding to λ and P is nonsingular,

it is proven in (Chaudhuri and Mukherjee, 1988) that

an unbiased estimator of π can be obtained as

π = (P

)

−1

λ.

Privacy Guarantees of RR. The privacy guaran-

tees RR offers to respondents are plausible deniability

and secrecy:

• By the Bayes’ formula:

ˆp

= Pr(X = u|Y = v) =

∑

Big Data Anonymization Requirements vs Privacy Models

309

• Given a reported Y = v, deniability can be mea-

sured as

H(X|Y = v) = −

∑

u=1

ˆp

log

ˆp

• If the probabilities within each column of P are

identical, then ˆp

= π

, for u, v ∈ {1,..., r}, and

H(X|Y = v) = H(X) for any v, and thus H(X|Y) =

H(X) (Shannon’s perfect secrecy).

• The price paid for perfect secrecy is a singular ma-

trix P, so no unbiased estimator

π can be com-

puted.

Randomized Response: a Local Version of PRAM.

Matrix P looks exactly as the PRAM transition ma-

trix (Gouweleeuw et al., 1998). The main difference

is that in RR randomization is done by the respon-

dent, whereas in PRAM it is done by the data con-

troller. Thus, RR is a local anonymization method

avant la lettre: when RR was invented, the no-

tion of anonymization did not exist, let alone local

anonymization.

5.2 Randomized Response and

Differential Privacy

(Wang et al., 2016) show that RR is ε-differentially

private if

≥ max

v=1,...,r

max

u=1,...,r

min

u=1,...,r

. (2)

We can assert that:

• If the maximum ratio between the probabilities in

a column of P is bounded by e

, the inﬂuence of

the real value X on the reported value Y is limited.

• When ε = 0, in bound (2), the probabilities within

each column of P are identical, and RR provides

perfect secrecy. Thus, DP with strictest privacy

(ε = 0) offers perfect secrecy.

Explaining Large ε in DP using Deniability.

When one takes not-so-small ε, the intuition of DP

is unclear: it is no longer tenable that the presence

or absence of any single record is unnoticeable. The

connection of DP with RR and hence with deniability

helps understand what large ε implies.

Example 1. if ε = 2, in some columns of P the

probability ratio may be as large as e

= 7.389. If

r = 2, one might have a column with p

= 0.7389

and p

= 0.1. Thus, after reporting Y = v, the most

likely value is X = 1 and there is only a small margin

to deny it. Thus, clearly ε = 2 does not seem to offer

enough privacy.

5.3 Differential Privacy and t-closeness

A data set is said to satisfy t-closeness if, for each

group of records sharing a combination of quasi-

identiﬁer attributes, the distance between the distri-

bution of the conﬁdential attribute in the group and

the distribution of the attribute in the whole data set is

no more than a threshold t.

Given two random distributions F

and F

, con-

sider the distance

d(F

) = max

i=1,2,···,t



)



. (3)

In Expression (3), we take the quotients of proba-

bilities to be zero if both Pr

) and Pr

) are

zero, and to be inﬁnity if only the denominator is

zero. In (Domingo-Ferrer and Soria-Comas, 2015),

we showed the following connection between ε-DP

and t-closeness:

Proposition 1. Let k

(D) be the function that returns

the view on subject I’s sensitive attributes given a

data set D. If D satisﬁes exp(ε/2)-closeness when us-

ing the distance in Expression (3), then k

(D) satisﬁes

ε-differential privacy. In other words, if we restrict

the domain of k

to exp(ε/2)-close data sets, then we

have ε-differential privacy for k

Proposition 1 can explain DP in terms of the

intruder’s knowledge gain on the sensitive attribute

value of a target respondent if the intruder can deter-

mine the respondent’s cluster.

Example 2. Take DP with ε = 2. By the proposition,

the probability weight attached to a certain value of a

sensitive attribute X can grow by a factor e ≈ 2.718 if

the target individual’s cluster is learnt by the intruder.

To decide whether a probability has grown too

much, consider that the reported value v is the clus-

ter identiﬁer and probabilities ˆp

= Pr(X = u|Y = v),

for u = 1, . . .,r are the probabilities assigned by the

cluster-level distribution to the values of the sensi-

tive attribute within the cluster. Determining the real

X given the reported Y becomes determining the tar-

get respondent’s sensitive value X given the target re-

spondent’s cluster Y . We can use a deniability argu-

ment to assess whether the cluster-level distribution is

too inhomogeneous:

Example 3. Take ε = 2 and assume the sensitive

attribute can take r = 5 different values, with uni-

form data set-level distribution (prob. 1/5 for each

value). A cluster-level distribution with one value

having relative frequency 1/5 × exp(1) = 0.5436 and

the remaining four values 0.1141 satisﬁes exp(1) −

closeness. The cluster-level distribution makes guess-

ing the sensitive attribute value much easier than the

SECRYPT 2018 - International Conference on Security and Cryptography

310

data set-level distribution (thus ε = 2 does not offer

enough protection).

In Example 1, we used the connection between

differential privacy and randomized response to illus-

trate the weak privacy protection by differential pri-

vacy when ε is not sufﬁciently small. Example 3 pro-

vides yet another way to assess the effects of taking

large ε in differential privacy, this time based on the

connection with t-closeness.

5.4 PRAM and the Permutation

Paradigm

Reverse Mapping. In (Domingo-Ferrer and Mu-

ralidhar, 2016), the following procedure was de-

scribed:

Require: Original attribute X = {x

,··· ,x

}

Require: Anonymized attribute Y = {y

,··· ,y

}

for i = 1 to n do

Compute j = Rank(y

)

Set z

= x

( j)

(where x

( j)

is the value of X of rank

end for

return Z = {z

,··· ,z

}

The Permutation Paradigm. The output Z is a per-

mutation of X and has the same rank order as Y . Thus

any anonymization procedure can be viewed as a per-

mutation (X into Z) followed by residual noise addi-

tion (Z into Y ) that does not alter ranks.

PRAM and the Permutation Paradigm. PRAM

does not permute attribute values in the data set,

rather it permutes in the domain of attributes. Hence,

PRAM should be viewed in terms of the permutation

paradigm as permutation plus noise. Hence, RR can

also be viewed as permutation, and so can DP and

t-closeness.

6 CONCLUSIONS AND FURTHER

RESEARCH

There is a debate on whether big data are compati-

ble with the privacy of citizens. We have stated the

desirable properties of privacy models for big data

(protection, utility, composability, low computation

and linkability). We have examined how well the two

main privacy models (k-anonymity and ε-differential

privacy) satisfy those properties. None of them is en-

tirely satisfactory, although k-anonymity seems more

amenable to big data anonymization. We have high-

lighted connections between the main privacy models

that might result in synergies between them in order

to tackle big data: the principles underlying all those

models are deniability and permutation.

Future research will have to deal with adapting the

current privacy models for big data and/or designing

new models. When tackling this endeavor, it will es-

sential to adhere to the above-mentioned principles of

deniability and permutation.

ACKNOWLEDGMENTS AND

DISCLAIMER

Partial support for this work has been received from

the Government of Catalonia (ICREA Acad

emia

Award), the European Commission (projects H2020-

644024 “CLARUS” and H2020-700540 “CANVAS”)

and the Spanish Government (project TIN2014-

57364-C2-1-R “SmartGlacis”). I hold the UNESCO

Chair in Data Privacy, but the opinions in this paper

are my own and do not commit UNESCO.

REFERENCES

Chaudhuri, A., and Mukherjee, R. (1988) Randomized Re-

sponse: Theory and Techniques. Marcel Dekker.

Cormode, G., Procopiuc, C., Srivastava, D., Shen, E., and

Yu, T. (2012) Differentially private spatial decompo-

sitions. In Proceedings of the 2012 IEEE 28th Inter-

national Conference on Data Engineering-ICDE’12,

Washington, DC, USA, pp. 20-31. IEEE Computer

Society.

Domingo-Ferrer, J. (2007) A three-dimensional conceptual

framework for database privacy. In 4th VLDB Work-

shop on Secure Data Management-SDM’07, pp. 193-

202. Springer.

Domingo-Ferrer, J., and Muralidhar, K. (2016) New direc-

tions in anonymization: permutation paradigm, ver-

iﬁability by subjects and intruders, transparency to

users. Information Sciences, 337-338:11-24.

Domingo-Ferrer, J., and Soria-Comas, J. (2015) From t-

closeness to differential privacy and vice versa in data

anonymization. Knowledge-Based Systems, 74:151-

158.

Domingo-Ferrer, J., and Soria-Comas, J. (2016)

Anonymization in the time of big data. In Pri-

vacy in Statistical Databases-PSD 2016, pp. 225-236.

Springer.

Domingo-Ferrer, J., and Torra, V. (2005) Ordinal, con-

tinuous and heterogeneous k-anonymity through mi-

croaggregation. Data Mining & Knowledge Discov-

ery, 11(2):195-212.

Dwork, C. (2006) Differential privacy. In 33rd Inter-

national Colloquium on Automata, Languages and

Programming-ICALP’06, pp. 1-12. Springer.

Big Data Anonymization Requirements vs Privacy Models

311

Gouweleeuw, J. M., Kooiman, P., Willenborg, L.C.R.J, and

De Wolf, P.-P. (1998) Post randomisation for statis-

tical disclosure control: theory and implementation.

Journal of Ofﬁcial Statistics, 14:463-478.

Greenberg, B. G., Abul-Ela, A.-L. A., Simmons, W. R., and

Horvitz, D. G. (1969) The unrelated question random-

ized response model: theoretical framework. Journal

of the American Statistical Association, 64(326):520-

539.

Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing,

S., Schulte Nordholt, E., Spicer, K., and De Wolf, P.-P.

(2012) Statistical Disclosure Control. Wiley.

Li, N., Li, T., and Venkatasubramanian, S. (2007) t-

Closeness: privacy beyond k-anonymity and l-

diversity. In Proceedings of the 23rd IEEE Interna-

tional Conference on Data Engineering-ICDE 2007,

Istanbul, Turkey, pp. 106-115. IEEE Computer Soci-

ety.

Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkita-

subramaniam, M. (2007) l-Diversity: privacy beyond

k-anonymity. ACM Transactions on Knowledge Dis-

covery from Data, 1(1).

McSherry, F. D. (2009) Privacy integrated queries: an ex-

tensible platform for privacy-preserving data anal-

ysis. In Proceedings of the 2009 ACM SIGMOD

International Conference on Management of Data-

SIGMOD’09, New York, NY, USA, pp. 19-30. ACM.

Samarati, P., and Sweeney, L. (1998) Protecting privacy

when disclosing information: k-anonymity and its

enforcement through generalization and suppression.

Technical Report, SRI International.

anchez, D., Domingo-Ferrer, J., and Mart

ınez, S. (2014)

Improving the utility of differential privacy via uni-

variate microaggregation. In Privacy in Statistical

Databases-PSD 2014, pp. 130-142. Springer.

anchez, D., Domingo-Ferrer, J., Mart

ınez, S., and Soria-

Comas, J. (2016) Utility-preserving differentially pri-

vate data releases via individual ranking microaggre-

gation. Information Fusion, 30:1-14.

anchez, D., Mart

ınez, S., and Domingo-Ferrer, J. (2016b)

Comment on ‘Unique in the shopping mall: on

the reidentiﬁability of credit card metadata’. Science,

351(6279), pp. 1274. March 18.

Soria-Comas, J., and Domingo-Ferrer, J. (2012) Proba-

bilistic k-anonymity through microaggregation and

data swapping. In Proceedings of the IEEE Inter-

national Conference on Fuzzy Systems-FUZZ-IEEE

2012, Brisbane, Australia, pp. 1-8. IEEE.

Soria-Comas, J., and Domingo-Ferrer, J. (2015) Big data

privacy: challenges to privacy principles and models.

Data Science and Enginering, 1(1):21-28.

Soria-Comas, J., Domingo-Ferrer, J., S

anchez, D., and

Mart

ınez, S. (2014) Enhancing data utility in differen-

tial privacy via microaggregation-based k-anonymity.

VLDB Journal 23(5):771-794.

U. S. Federal Trade Commission (2014) Data Brokers: A

Call for Transparency and Accountability.

Wang, Y., Xu, X., and Hu, D. (2016) Using randomized

response for differential privacy preserving data col-

lection. In EDBT/ICDT 2016 Joint Conference, Bor-

deaux, France.

Xiao, Y., Xiong, L., and Yuan, C. (2010) Differentially

private data release through multidimensional parti-

tioning. In Proceedings of the 7th VLDB Conference

on Secure Data Management - SDM’10, pp. 150-168.

Springer.

Xu, J., Zhang, Z., Xiao, X., Yang, Y., and Yu, G. (2012)

Differentially private histogram publication. In Pro-

ceedings of the 2012 IEEE 28th International Con-

ference on Data Engineering-ICDE’12, Washington,

DC, USA, pp. 32-43. IEEE Computer Society.

Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D.,

and Xiao, X. (2014) Privbayes: private data release

via Bayesian networks. In Proceedings of the 2014

ACM SIGMOD International Conference on Manage-

ment of Data-SIGMOD’14, New York, NY, USA, pp.

1423-1434. ACM.

SECRYPT 2018 - International Conference on Security and Cryptography

312