NOTES ON PRIVACY-PRESERVING DISTRIBUTED MINING AND
HAMILTONIAN CYCLES
Renren Dong and Ray Kresman
Department of Computer Science, Bowling Green State University, Bowling Green, OH, U.S.A.
Keywords:
Data mining, Privacy-preserving mining, Hamiltonian cycle, Edge disjoint Hamiltonian cycle.
Abstract:
Distributed storage and retrieval of data is both the norm and a necessity in today’s computing environment.
However, sharing and dissemination of this data is subject to privacy concerns. This paper addresses the role
of graph theory, especially Hamiltonian cycles, on privacy preserving algorithms for mining distributed data.
We propose a new heuristic algorithm for discovering disjoint Hamiltonian cycles in the underlying network.
Disjoint Hamiltonian cycles are useful in a number of applications; for example, to ensure that someone’s
private data remains private even when others collude to discover the data.
1 INTRODUCTION
In our everyday lives, we see the use and collection of
various kinds of data by a number of entities. Data
mining is the process of extracting hidden patterns
from the underlying data repository. It provides tech-
niques for new knowledge discovery, finding hidden
patterns in the data set such as classifiers and associ-
ation rules (Han and Kamber, 2006). With the explo-
sion of the internet, distributed storage and retrieval
of data is both the norm and a necessity. Distributed
data mining (DDM) helps mine this data by distribut-
ing the mining computation too.
The globally networked society places great de-
mand on the dissemination and sharing of private
data. For example, imagine a situation where compet-
ing schools trying to discover a set of core classes as-
sociated with high grades. While knowing that calcu-
lus, English 10, and Spanish 101 are associated with
high scholastic achievement on an aggregate level,
it would be unfavorable, in particular, for Underdog
Academy to admit its rampant failure at teaching any
of these subjects with excellence. Hence, Underdog
Academy would like to know what classes to put its
small budget into without giving its competition a leg
up on its weaknesses (Shepard, 2007).
Privacy issues are also relevant to a host of other
domains, including banking (Bottcher and Obermeier,
2008), patient medical records (Karr et al., 2007),
electronic voting (Baiardi et al., 2005), and others
(Chaum, 1988; Verykios et al., 2004) . Privacy pre-
serving distributed data mining (PPDM) algorithms
mine distributed data while addressing these privacy
concerns.
Secure multiparty computation (SMC) is the ba-
sic operation in PPDM. SMC was first introduced in
(Andrew, 1986). It is predicated on the notion that
a computation is secure if at the end of the compu-
tation, no party knows anything except its own input
and the results. Hamiltonian cycles, and its derivative
edge-disjoint Hamiltonian cycles, play a crucial role
in SMC/PPDM.
This paper is about privacy preserving distributed
data mining algorithms and the role of these cycles,
specifically edge disjoint ones. The rest of this paper
is organized as follows. In Section 2 we formally de-
fine these cycles. We also provide the motivation for
our work and describe a popular approach that uses
these cycles to mine distributed data. In Section 3,
we take a closer look at the cycles and state two new
theorems. Then, we propose a heuristic method for
finding the edge disjoint cycles. Concluding remarks
are given in Section 4.
2 HAMILTONIAN CYCLES AND
MINING
2.1 Mathematical Preliminaries
Hamiltonian cycles (HC) play an important role in
graph theory. A Hamiltonian cycle, H, is a cycle
103
Dong R. and kresman R. (2010).
NOTES ON PRIVACY-PRESERVING DISTRIBUTED MINING AND HAMILTONIAN CYCLES.
In Proceedings of the 5th International Conference on Software and Data Technologies, pages 103-107
DOI: 10.5220/0002922001030107
Copyright
c
SciTePress
that visits every node in the graph exactly once be-
fore returning to the source node. Alternatively, a
Hamiltonian cycle can be described as a sequence
of edges where any two adjacent edges share ex-
actly one unique node in the graph, where the first
and last edge in the cycle share exactly one unique
node in the graph, and where the number of edges
in H equals the number of nodes in the graph. In
graph G = (V, E), if we define Hamiltonian cycle as
H = {h
1
, h
2
, . . . , h
N
}, h
i
E, 1 i N, we will have
N = |V|, and V = (
N1
i=1
h
i
h
i+1
) (h
N
h
i
).
Now we define an edge-disjoint Hamiltonian cy-
cle (EDHC). A set of Hamiltonian cycles is called
edge-disjoint if every edge in one Hamiltonian cycle
is disjoint with every edge in the rest of the Hamilto-
nian cycles of the set. Formally, suppose we have a
set H = {H
(i)
|H
(i)
is a Hamilton cycle in graph G},
|H| is the size of H. H is edge-disjoint if e H
(i)
,
e 6= d, d H
( j)
, 1 i 6= j |H|.
EDHC is useful in computer networks. It can be
used to improve the capacity of the network or to
provide fault-tolerance and computer security (Urabe
et al., 2007b). For example, if a route in a network is
faulty or compromised, traffic can be routed through
another - edge disjoint cycle or - route that does not
share any edges with the previous route. So, an inter-
esting question is how to find the various EDHCs.
Other researchers have investigated this problem
for specialized network topologies. Alspach et. al.,
(Alspach et al., 1990) provide an algorithm to gener-
ate the EDHCs for fully connected networks. Bae and
Bose (Bae and Bose, 2003) give a method for gen-
erating EDHCs in Torus and k-ary n-cubes by using
Gray Codes. A polynomial time distributed algorithm
for k-ary n-cubes and hypercubes is given in (Stewart,
2007). Our research, in this paper, appeals to the gen-
eration of EDHCs in more general networks, and to
their application in privacy preserving data mining
2.2 An Example
A number of useful primitives for privacy preserving
distributed data mining are given in (Clifton et al.,
2002), including secure set union (Bottcher and Ober-
meier, 2008), secure size of intersection, and secure
sum, the last of which we use in this paper. Many
more algorithms exist, see (Verykios et al., 2004) for
a review. Secure sum (Schneier, 2007) can be defined
formally as follows.
Secure Sum (SS): The goal of SS is that the value
of each site’s individual input be masked while
the global sum of all inputs is universally known.
This means that at the end of the computation, no
party knows anything except its own input and the
global sum. Assume we have N sites. For each
site we have a value V
i
, 1 i N and we want to
calculate global sum V =
N
i=1
V
i
.
Figure 1 shows an example of HC (Dong and Kres-
man, 2009). Assume the presence of a HC as shown
in Figure 1. The algorithm initiates a SS computation
starting at the source station p
1
whose private value
is x
1
= 8. p
1
uses a random r
x
, r
x
= 5, to mask its
private value. Then p
1
sends x
1
+ r
x
= 13 to the next
station. p
2
receives x
1
+ r
x
= 13 and adds its local
value, x
2
, to the sum before sending it to the next sta-
tion. At the end, start node, p
1
, receives r
x
+
5
i=1
x
i
.
Since p
1
selected r
x
= 5, the global sum can be easily
calculated. A similar algorithm can also be used to
compute secure set union and secure set intersection
(Clifton et al., 2002).
Figure 1: Secure Sum Computation - One Cycle.
Clifton and Vaidya (2002) lay the foundation for
a cycle-partitioned secure sum (CPSS) algorithm by
noting how any node in a secure sum can divide
its value into random shares such that each share is
passed along a different secure sum circuit or cycle.
Collusion protection is then achieved when no node
has the same neighboring nodes twice for each cycle.
(See (Urabe et al., 2007a) for additional discussions.)
An example of CPSS is given in Figure 2 (Dong and
Kresman, 2009). Here, p
3
is connected to 4 other par-
ticipants which means that at least 4 other participants
have to collude - or join together - in order to discover
p
3
s contribution to the global sum.
Note that the HCs used in Figure 2 are edge-
disjoint. However, the number of EDHCs in a given
graph is usually very limited. The limited number
of EDHCs in the graph constrains the application of
CPSS algorithms. So an interesting problem is how
to enumerate the EDHCs in a graph: in the rest of
this paper, we refer to it as the EDHC problem or just
EDHC.
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
104
Figure 2: Secure Sum Computation - Two Cycles.
3 OUR CONTRIBUTION
Recall that edge-disjoint HC is a special type of
Hamiltonian cycles. It is well-known that HC is NP-
complete (Garey et al., 1979). We have proved the
following two theorems. However, because of the
application nature of this conference, we state them
without the underlying mathematical proof (please
see (Dong and Kresman, 2010) for the proof).
Theorem 1: The EDHC problem is NP Complete.
Theorem 2: The order in which the various edge-
disjoint HC is discovered may limit the number of
HC that can be found.
In Section 2 we noted the importance of EDHC
for privacy preserving data mining algorithms. In this
section we focus on how we can find these EDHCs.
One obvious way to find EDHC is to use a greedy al-
gorithm that invokes the single HC algorithm repeat-
edly. However, due to Theorem 2 above, a simple
greedy algorithm is not guaranteed to find many edge-
disjoint HCs in a given graph.
Theorem 1 notes the difficulty of solving the
EDHC problem. In fact, even the decision problem
of determining whether a given graph has a given
number of edge disjoint Hamiltonian cycles can also
shown to be NP complete; again, we omit the proof
here, but it can be found in (Dong and Kresman,
2010). Just like many other NP-complete and NP-
hard problems, we do not have any efficient (polyno-
mial time) algorithm for the EDHC problem. Heuris-
tic algorithms, which work by using some heuristic
rules to guide their search of the solution space, are
powerful tools to solve NP-hard problems. However,
they may converge to an approximate or suboptimal
solution.
In this section, we will first present a connection
between the EDHC problem and another graph theo-
retic problem, maximum clique (MC): for MC, given
an undirected graph G and a positive integer k, we ask
if graph G has a fully-connected subgraph SG(V, E)
with the constraint |SG(V)| k. In the later part of
this section, we will provide a heuristic algorithm for
EDHC problem.
MC is one of the fundamental NP-complete prob-
lems (Garey et al., 1979) and is well researched.
The purpose of finding a connection between EDHC
and MC is to apply some of the results from MC
to the EDHC. We present a procedure to transform
the EDHC problem to MC. Note, we assume that we
have an oracle machine that lists all the HCs in a
given graph. The oracle machine we assume here is
only used to explain the transform and it will be re-
placed by another algorithm later (see Section 3.1).
The transform works as follows: take a graph G(v, e)
as an input to the EDHC problem; then use the ora-
cle machine to obtain a set HCPool which contains all
the HC in the given graph; finally, construct another
graph G
(v, e) in which each vertex in G
has a corre-
sponding HC in the set HCPool, and if two elements
in HCPool is edge-disjoint then there is an edge be-
tween the two corresponding vertices in G
and vice
versa. Then, solving the MC problem in G
gives us
the results for the original EDHC problem.
Now, the question is what is a good MC algorithm
that can be used as a basis for our EDHC problem.
Pullan and Hoos (2006) present a stochastic MC algo-
rithm – that does a local search – called Dynamic Lo-
cal Search (DLS-MC). According to them, the DLS-
MC outperforms other state-of- the-art - MC search
algorithms in many instances. See (Pullan and Hoos,
2006) for a discussion of other algorithms. In our
work, we will take DLS-MC as a basis to construct
a dynamic local search algorithm to solve EDHC.
We explain DLS-MC in the next paragraph. Then,
we note the process for constructing the oracle that
can enumerate the HCs. Finally, we formally describe
our dynamic local search algorithm in Section 3.1.
DLS-MC algorithm works as follows: start with
a random initial vertex from the given graph as a
current clique C. Then the search alternates be-
tween an iterative improvement phase and a plateau
search phase. In the iterative improvement (or ex-
pand) phase, suitable vertices (connect to each ele-
ment in current clique) are repeatedly added to current
clique. The expand phase terminates when there is no
more vertex can be added intoC. In the plateau search
phase, one vertex of current clique C is swapped with
a vertex currently not in C. The plateau search termi-
nates when all the possible swap operation have been
tried. Following the plateau search, a numerical value
is added to each vertex as vertex penalties. The al-
NOTES ON PRIVACY-PRESERVING DISTRIBUTED MINING AND HAMILTONIAN CYCLES
105
gorithm repeats the above steps - expand phase and
plateau phase - until large enough clique is found or
the step limitation is reached. Algorithm details and
empirical results can be found in (Pullan and Hoos,
2006).
As noted earlier, we can transform the EDHC
problem to MC, and there are many algorithms for
solving MC. The only barrier left is the construction
of the oracle machine to enumerate the HCs. Our
solution is to replace the oracle machine with some
heuristic algorithms (or backtrack algorithm) for HC
problem. We will not try to generate all the HCs - an
impossible task, due to Theorem 1 above. Instead, we
will dynamically construct the graph for the DLS-MC
program. In our algorithm, we will consider a HC
algorithm for finding HC as a component since it is
very hard to determine what the best heuristic HC al-
gorithm is. In practice, one can choose a heuristic HC
algorithm based on his/her needs because algorithms
that need more accuracy may take up more comput-
ing time. In our work, all HC algorithms take two
parameters as input, a graph and a random number.
We know that most of the HC algorithms start with
picking an edge (or vertex) in the graph. The pur-
pose of the random number is to give the algorithm
an opportunity to return a different HC each time. All
algorithms return a HC if it finds a new HC else re-
turn ‘NULL.’ For notational convenience, we use the
term singleHCAlg(G, r) to mean any HC algorithm
with parameters G and r.
3.1 A Heuristic Algorithm for EDHC
Based on the above considerations, we introduce a
new heuristic algorithm Algorithm 1 for finding ED-
HCs: edge-disjoint HC using Heuristic Dynamic Lo-
cal Search-EDHC. An outline of the algorithm is
given in Algorithm 1. For the given graph G(v, e),
we start with finding a set of EDHC as current so-
lution curEDHC by expandSearch. Then, we exe-
cute expandSearch and plateauSearch alternatively.
The process repeats until enough EDHC are found
or or step limit is reached. For brevity, the code for
expandSearch and plateauSearch are omitted here,
but we explain their functionality in the next two para-
graphs.
In expandSearch, our goal is to expand the
curEDHC set as much as possible. To achieve this
goal, we construct a graph G
by removing all the HCs
in curEDHC from G, and feed singleHCAlg with G
.
It is obvious that any new HC found by singleHCAlg
can be used to expand the size of curEDHC. Repeat
this process until singleHCAlg returns NULL which
means no more HC can be found in the remaining
graph G
. expandSearch returns the updated version
of curEDHC.
Algorithm 1. Heuristic Dynamic Local Search-EDHC.
Procedure HDLS-EDHC(G,targetSize,maxStep)
Iuput:
G: Graph G(v,e)
targetSize: Number of HC we want to find
maxStep: Steps limitation
Output:
a set of EDHC with size at least target-
Size or the best EDHC set found.
Begin
numStep:=0;
bestEDHC:={};
curEDHC:={};
(HCPool,HCPenalties):={};
Do
curEDHC:=expandSearch(G,curEDHC);
If |curEDHC| is at least targetSize
return curEDHC as result and exit;
If curEDHC is better then bestEDHC
replace the bestEDHC with curEDHC;
curEDHC:= plateauSearch(G,curEDHC);
update Penalties for HCs in curEDHC;
While numStep < maxStep;
Return bestEDHC;
End
In plateauSearch,we try to find a different EDHC
set based on the curEDHC. The EDHC set we find in
plateauSearch should also has |curEDHC| HCs. The
main idea behind plateauSearch is to replace some
of the HCs in curEDHC by other new HCs found
by singleHCAlg. We first construct a new graph,
graph G
′′
, by combining a HC in curEDHC with
the remaining graph G
(after removing all the HC in
curEDHC from G). So we have |curEDHC| possible
G
′′
. We start with the HC with the least penalty vaule.
The penalty value here is used to indicate how fre-
quent a HC is used in the previous search process. By
selecting the HC with least penalty vaule, we are try-
ing to avoid searching the same group of HCs. Feed
singleHCAlg with each G
′′
; if any singleHCAlg re-
turns a new HC, then replace the new HC with the
one used to construct G
′′
. If all singleHCAlg return
nothing, set curEDHC to nothing and then exit. So
the program can start over again.
At the end of plateauSearch, the cycle penalties
are updated by incrementing the penalty values of all
cycles in curEDHC by 1. So in HDLS-EDHC, if a
HC has less penalty, this means the HC is less fre-
quently used in the searching process and vice versa.
In sum, following the computation of the algo-
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
106
rithm (Algorithm 1), we return a set of EDHCs with
a size at least targetSize or we will return the best so-
lution found within the search process as the result,
given by bestEDHC - the set of EDHCs in G.
4 CONCLUDING REMARKS
This paper focused on distributed mining and the role
of Hamiltonian cycles in keeping information pri-
vate. We stated a couple of theorems on edge disjoint
Hamiltonian cycles, but without proof. Then, we pro-
posed a heuristic algorithm to enumerate these cycles.
These cycles have applications in data mining, net-
work routing and fault tolerant computing. In an ex-
tended version of this paper, we will formally prove
these theorems and compare the performance of the
heuristic algorithm with that of the greedy algorithm.
REFERENCES
Alspach, B., Bermond, J., and Sotteau, D. (1990). Decom-
position into cycles 1: Hamilton decompositions. Cy-
cles and Rays, page 9.
Andrew, C. (1986). How to generate and exchange secrets.
In Proc. of the 27th IEEE Symposium on Foundations
of Computer Science, pages 162–167.
Bae, M. and Bose, B. (2003). Edge disjoint Hamiltonian
cycles in k-ary n-cubes and hypercubes. IEEE Trans-
actions on Computers, 52(10):1271–1284.
Baiardi, F., Falleni, A., Granchi, R., Martinelli, F., Petroc-
chi, M., and Vaccarelli, A. (2005). SEAS, a secure
e-voting protocol: design and implementation. Com-
puters & Security, 24(8):642–652.
Bottcher, S. and Obermeier, S. (2008). Secure set union and
bag union computation for guaranteeing anonymity of
distrustful participants. Journal of Software, 3(1):9.
Chaum, D. (1988). The dining cryptographers prob-
lem: Unconditional sender and recipient untraceabil-
ity. Journal of Cryptology, 1(1):65–75.
Clifton, C., Vaidya, J., Kantarcioglu, M., Lin, X., and Zhu,
M. Y. (2002). Tools for privacy preserving distributed
data mining. SIGKDD Explor. Newsl., 4(2):28–34.
Dong, R. and Kresman, R. (2009). Indirect disclosures in
data mining. In Frontier of Computer Science and
Technology, Japan-China Joint Workshop, pages 346–
350, Shanghai, China. IEEE Computer Society.
Dong, R. and Kresman, R. (2010). Proof of certain proper-
ties of edge disjoint Hamiltonian cycles. Under prepa-
ration.
Garey, M., Johnson, D., et al. (1979). Computers
and Intractability: A Guide to the Theory of NP-
completeness. W.H. freeman San Francisco.
Han, J. and Kamber, M. (2006). Data mining: concepts and
techniques. Morgan Kaufmann.
Pullan, W. and Hoos, H. (2006). Dynamic local search for
the maximum clique problem. Journal of Artificial
Intelligence Research, 25:159–185.
Schneier, B. (2007). Applied cryptography: protocols, al-
gorithms, and source code in C. A1bazaar.
Shepard, S. (2007). Anonymous Opt-Out and Secure Com-
putation in Data Mining. Master’s thesis, Bowling
Green State University, Bowling Green, OH, USA.
Stewart, I. (2007). Distributed algorithms for building
Hamiltonian cycles in k-ary n-cubes and hypercubes
with faulty links. Journal of Interconnection Net-
works, 8(3):253.
Urabe, S., Wang, J., Kodama, E., and Takata, T. (2007a).
A high collusion-resistant approach to distributed
privacy-preserving data mining. IPSJ Digital Courier,
3(0):442–455.
Urabe, S., Wong, J., Kodama, E., and Takata, T. (2007b).
A high collusion-resistant approach to distributed
privacy-preserving data mining. In Proceedings of the
25th conference on Proceedings of the 25th IASTED
International Multi-Conference: parallel and dis-
tributed computing and networks, page 331. ACTA
Press.
Verykios, V. S., Bertino, E., Fovino, I. N., Provenza, L. P.,
Saygin, Y., and Theodoridis, Y. (2004). State-of-the-
art in privacy preserving data mining. SIGMOD Rec.,
33(1):50–57.
NOTES ON PRIVACY-PRESERVING DISTRIBUTED MINING AND HAMILTONIAN CYCLES
107