Example 2: Consider the taxonomy tree in Figure 1.
Let S
1
= {<Orange, Beef>, <Apple, Chicken, Beef>},
LCG(S
1
) = <Fruit, Beef>. LCG(S
1
) cannot be <Fruit,
Meat> since <Fruit, Beef> is a more specialized
common transaction. For S
2
= {<Orange, Milk>,
<Apple, Cheese, Butter>}, LCG(S
2
)=<Fruit, Dairy>.
Dairy represents Milk in the first transaction and
represents one of Cheese and Butter in the second
transaction. Thus one of Cheese or Butter is
considered as a suppressed item. For S
3
= {<Orange,
Apple>, <Orange, Banana, Milk>, <Banana, Apple,
Beef>}, LCG(S
3
)=<Fruit, Fruit>, which represents
that all three transactions contain at least two items
under Fruit. Milk and Beef are suppressed items. For
S
4
= {<Orange, Beef>, <Apple, Milk>}, LCG(S
4
) =
<Fruit, Food>, where Food represents Beef in the
first transaction and Milk in the second transaction.
Here LCG contains both a parent and a child item.
Various metrics have been proposed in the
literature to measure the quality of generalized data
including Classification Metric (CM), Generalized
Loss Metric (LM) (Iyengar, 2002), and Discernibility
Metric (DM) (Bayardo et al., 2005). We use LM to
measure item generalization distortion. The similar
notion of NCP has also been employed for set-
valued data (Terrovitis et al., 2008) and (He et al.,
2009). Let M be the total number of leaf nodes in the
taxonomy tree T, and let Mp be the number of leaf
nodes in the subtree rooted at a node p. The Loss
Metric for an item p, denoted by LM(p), is defined
as (Mp-1) / (M-1). For the root item p, LM(p) is 1. In
words, LM captures the degree of generalization of
an item by the percentage of the leaf items in the
domain that are indistinguishable from it after the
generalization. For example, considering taxonomy
in Figure 1, LM(Fruit)=2/7.
Suppose that we generalize every transaction in a
subset of transactions S to a common generalized
transaction t, and we want to measure the distortion
of this generalization. Recall that every item in t
represents one distinct item in each transaction in S
(Definition 1). Therefore, each item in t generalizes
exactly |S
| items, one from each transaction in S,
where |S| is the number of transactions in S. The
remaining items in a transaction (that are not
generalized by any item in t) are suppressed items.
Therefore, the distortion of this generalization is the
sum of the distortion for generalized items, |S|Σ
it
LM(i), and the distortion for suppressed items. For
each suppressed item, we charge the same distortion
as if it is generalized to the root item, i.e., 1.
Definition 3 (GGD). Suppose that we generalize
every transaction in a set of transactions S to a
common generalized transaction t. The Group
Generalization Distortion of the generalization is
defined as GGD(S, t) = |S|Σ
it
LM(i) + N
s
, where N
s
is the number of occurrences of suppressed items.
To minimize the distortion, we shall generalize S
to the least common generalization LCG(S), which
has the distortion GGD(S, LCG(S)).
Example 3: Consider the taxonomy in Figure 1 and
S
1
={<Orange, Beef>, <Apple, Chicken, Beef>}. We
have LCG(S
1
) = <Fruit, Beef>. LM(Fruit)=2/7,
LM(Beef)=0, and |S
1
|=2. Since Chicken is the only
suppressed item, N
s
=1. Thus GGD(S
1
, LCG(S
1
)) =
2(2/7+0) + 1 = 11/7.
2.3 Problem Definition
We adopt the transactional k-anonymity in (He et al.,
2009) as our privacy notion. A transaction database
D is k-anonymous if for every transaction in D, there
are at least k-1 other identical transactions in D.
Therefore, for a k-anonymous D, if one transaction
is linked to an individual, so are at least k-1 other
transactions, so the adversary has at most 1/k
probability to link a specific transaction to the
individual. For example, the last column in Table 1
is a 2-anonymous transaction database.
Definition 5 (Transaction Anonymization). Given a
transaction database D, a taxonomy of items, and a
privacy parameter k, we want to find the clustering
C={S
1
,…,S
n
} of D such that S
1
,…,S
n
are pair-wise
disjoint subsets of D with each S
i
containing at least
k transactions from D, and Σ
i=1..|C|
GGD(S
i
, LCG(S
i
))
is minimized.
Let C={S
1
,…,S
n
} be a solution to the above
anonymization problem. A k-anonymized database
of D can be obtained by generalizing every
transaction in S
i
to LCG(S
i
), i=1,…,n.
3 CLUSTERING APPROACH
In this section we present our algorithm Clump for
solving the problem defined in Definition 5. In
general, the problem of finding optimal k-
anonymization is NP-hard for k3 (Meyerson et al.,
2004). Thus, we focus on an efficient heuristic
solution to this problem and evaluate its
effectiveness empirically. In this section, we assume
that the functions LCG(S) and GGD(S, LCG(S)) are
given. We will discuss the detail of computing these
functions in Section 4.
The central idea of our algorithm is to group
transactions in order to reduce GGD(S
i
, LCG(S
i
)),
subject to the constraint that S
i
contains at least k
SECRYPT 2010 - International Conference on Security and Cryptography
112