research works has brought techniques and
algorithms trying to ensure the non-re-identification
of sensitive information while maintaining
usefulness of these data. However, we noticed the
lack of approaches guiding data holders in the
choice of techniques and, given a technique, of an
algorithm among all existing implementations of this
technique. Thus, we conducted a detailed review,
dedicated to generalization techniques, aiming to
elicit first guidelines helping data publishers to
choose a generalization algorithm. We have
compared the algorithms according their four
constituents: pre-requisites, inputs, process and
outputs (Table 4). Some algorithms, such as
Incognito and Samarati, are restricted to small data
sets (Fung et al., 2010). All of them are limited to
categorical and continuous micro data. Moreover,
algorithms preserving the classification or regression
capabilities require correlation between multiple
target attributes (LeFevre, DeWitt and
Ramakrishnan, 2006b).
All generalization algorithms require input
parameters. At least we need to decide the value of k
(corresponding to k-anonymity), to declare which
columns constitute the QI and finally we have to
provide the generalization hierarchies. Let us note
that some algorithms can compute the generalization
hierarchy for continuous attributes. Moreover, for
algorithms including tuple suppressions, the number
of allowed suppressions (MaxSup) is also an input
parameter. Finally, all the algorithms that preserve
the quality of data regarding a data mining specific
task such as classification or regression require the
declaration of at least one target attribute.
From process point of view, we can notice that
some algorithms are completely automatic. Most of
them are iterative processes guided (Sweeney, 1998)
or not (Samarati, 2001) by metrics. Moreover, some
of them are bottom up processes (Sweeney, 1998)
where small groups of tuples are constituted and
then merged iteratively until each group contains at
least k rows (k-anonymity satisfaction) (Fung et al.,
2010). The other ones are top down processes (Fung,
Wang and Yu, 2005) i.e. they start from a group
containing all rows and iteratively split each group
into two subgroups while preserving k-anonymity.
Finally, the generalization algorithms do not all
provide the same outputs. Some algorithms deliver a
unique anonymized table while others compute
several alternative tables. Some algorithms compute
an optimal k-anonymity solution but they are limited
to small data sets (Fung et al., 2010). Others, based
on heuristics, do not guarantee the optimality.
Finally, they may provide three different
generalizations that we define as: full-domain, sub-
tree and multidimensional generalization. Full-
domain means that, for a given generalized column,
all the values in the output table belong to the same
level of the generalization hierarchy. Sub-tree means
that values sharing the same direct parent in the
hierarchy are necessarily generalized at the same
level, taking the value of one of their common
ancestors. Finally, in multidimensional
generalizations, two identical values in the original
table may lead to different generalized values (i.e.
are not generalized at the same level).
In terms of usage scenario, let us note that the
data resulting from anonymization are designed for
specific usages. Bottom up generalization, top down
specialization and InfoGain Mondrian produce data
for classification tasks. LSD Mondrian is used in the
case where regression will be performed on the
anonymized data.
Our comparative study helps us to define
patterns that capture knowledge about the main
generalization algorithms. These patterns will be
part of a knowledge base. The latter will be made
available through a guidance approach to help data
publishers in the choice of the anonymization
algorithms. We are convinced that the guidance
depends on the data publisher expertise level. We
expect several expertise levels and, for each level, at
least one guidance scenario. A guidance scenario
consists of a list of generalization algorithms (at
least one) according to the context. These context
elements are linked to the set of criteria used in our
comparative study. For instance, the size of a data
set to be anonymized and the usage scenario are the
two parameters that we consider relevant for the
definition of guidance scenarios addressed to data
publishers who don’t have technical skills in
anonymization. An example of scenario follows:
“If you don’t project a specific usage of your
large data set then you can perform Datafly, µ-argus
or Median Mondrian”.
For a data publisher having a little expertise in
anonymization, a guidance scenario could be: “If
you don’t project a specific usage of your small data
set and if you wish to have an anonymized data set
satisfying an optimal k-anonymity
and having all the
values of each anonymized attribute at the same
level of the generalization hierarchy (full-domain
generalization) then you can perform Samarati or
Incognito”. In this scenario the criteria used to select
the algorithms are respectively: “Scenario of usage”,
“Size of dataset”, “Quality”, and “Generalization
type”.
KMIS2014-InternationalConferenceonKnowledgeManagementandInformationSharing
364