1.1 Contribution and Plan of this Paper
Unfortunately, SDC methods and privacy models in
the literature have been designed with small data in
mind, whereas the requirements of big data are differ-
ent. The following extreme positions are becoming
more and more common in front of the tension be-
tween subject privacy and big data analytics:
• Nihilists. They claim that no privacy is possi-
ble with big data. Some of them (typically gov-
ernments
1
) claim that privacy needs to be sac-
rificed to security; others (typically corporations
and some researchers
2
) claim that privacy is a
hindrance to business and progress; yet others
(typically Internet companies) do not make many
claims but offer enticing and free services that re-
sult in privacy being overridden; finally, others
claim nothing and offer nothing but gather, pack-
age and sell as many personal data as possible
(data brokers (FTC, 2014)).
• Fundamentalists. They prioritize privacy, what-
ever it takes in terms of utility loss. Players in
this camp are basically a fraction of the academics
working in security and privacy.
While fundamentalists are unlikely to prevail over
nihilists, they could make themselves more use-
ful if they invested their research effort in finding
anonymization solutions that are tailored to the nat-
ural requirements of big data.
In Section 2 of this paper, we try to identify the
main requirements of big data anonymization. Then
we examine how well the two main families of pri-
vacy models satisfy those requirements: we deal with
k-anonymity in Section 3 and with differential privacy
in Section 4. Seeing that neither family is completely
satisfactory, we explore connections of k-anonymity
and differential privacy with other privacy models
in Section 5; we characterize those connections in
terms of two principles, deniability and permutation.
In Section 6 (conclusions and future research), we
conclude that focusing on these two principles is a
promising way to tackle the adaptation of current pri-
vacy models to big data or the design of new privacy
models.
1
In 2009 the UK Cabinet Office’s former security and
intelligence co-ordinator, Sir David Omand, warned: “citi-
zens will have to sacrifice their right to privacy in the fight
against terrorism”. Also, in April 2016, the European Par-
liament backed the EU directive enabling the European se-
curity services to share information on airline passengers.
2
E.g. Stephen Brobst, the CTO of Teradata, stated: “I
want to know every click and every search that led up to that
purchase... interactions are orders of magnitude larger than
the transactions... the interactions give you the behavior”,
The Irish Times, Aug. 7, 2014.
2 BIG DATA ANONYMIZATION
REQUIREMENTS
Leveraging the potential of available big data to im-
prove human life and even to make profit is perfectly
legitimate. However, this should not encroach on the
subjects’ privacy. Released big data should be pro-
tected, that is, transformed so that: i) they yield statis-
tical results very close to those that would be obtained
if the original big data were available, but ii) they do
not allow unequivocal reconstruction of the profile of
any specific subject.
SDC methods (Hundepool et al., 2012) (for ex-
ample, noise addition, generalization, suppression,
lower and upper recoding, microaggregation and oth-
ers) specify transformations whose purpose is to limit
the risk of disclosure. Nevertheless, they do not
prescribe any mechanism to assess the disclosure
risk that remains in the transformed data. In con-
trast, privacy models (such as k-anonymity (Samarati
and Sweeney, 1998) and differential privacy (Dwork,
2006), as well as l-diversity (Machanavajjhala et al.,
2007), t-closeness (Li et al., 2007) and probabilis-
tic k-anonymity (Soria-Comas and Domingo-Ferrer,
2012), among others) specify some properties to be
met by a data set to limit disclosure risk, but they
do not prescribe any specific SDC method to satisfy
those properties. Privacy models seem more attrac-
tive, because they state the privacy level to be attained
and leave it to the data protector to adopt the least
utility-damaging method. The reality, however, is that
most privacy models were designed to protect a single
static data set and they have notorious shortcomings if
used in a big data context.
For a privacy model to be useful for big data,
it must be compatible with the volume, the veloc-
ity and the variety of this kind of data. To assess
this compatibility, we propose to take into account to
what extent the model satisfies the following proper-
ties (the last three of them described in (Soria-Comas
and Domingo-Ferrer, 2015)):
• Protection. Anonymized big data should not al-
low unequivocal reconstruction of any subject’s
profile, let alone re-identification. While protec-
tion was also essential in anonymization of small
data, it is more difficult to achieve in big data: the
availability of many data sources on overlapping
sets of individuals implies that a lot of attributes
may be available on a certain individual, which
may facilitate her re-identification.
• Utility for exploratory analyses. Anonymized big
data that are published should yield results simi-
lar to those obtained on the original big data for a
broad range of exploratory analyses. While utility
SECRYPT 2018 - International Conference on Security and Cryptography
306