Mining Encrypted Software Logs using Alpha Algorithm
Gamze Tillem, Zekeriya Erkin and Reginald L. Lagendijk
Cyber Security Group, Department of Intelligent Systems, Delft University of Technology, The Netherlands
Keywords:
Software Privacy, Homomorphic Encryption, Applied Cryptography, Software Process Mining.
Abstract:
The growing complexity of software with respect to technological advances encourages model-based analysis
of software systems for validation and verification. Process mining is one recently investigated technique
for such analysis which enables the discovery of process models from event logs collected during software
execution. However, the usage of logs in process mining can be harmful to the privacy of data owners. While
for a software user the existence of sensitive information in logs can be a concern, for a software company,
the intellectual property of their product and confidential company information within logs can pose a threat
to company’s privacy. In this paper, we propose a privacy-preserving protocol for the discovery of process
models for software analysis that assures the privacy of users and companies. For this purpose, our proposal
uses encrypted logs and processes them using cryptographic protocols in a two-party setting. Furthermore, our
proposal applies data packing on the cryptographic protocols to optimize computations by reducing the number
of repetitive operations. The experiments show that using data packing the performance of our protocol is
promising for privacy-preserving software analysis. To the best of our knowledge, our protocol is the first of
its kind for the software analysis which relies on processing of encrypted logs using process mining techniques.
1 INTRODUCTION
Software systems have an evolving nature which en-
ables them to respond to the needs of technological
advances continuously (van der Aalst, 2015). While
this evolution is advantageous to improve service
quality for users, the drawback is growing complex-
ity which complicates the management of software
systems (Rubin et al., 2007). The complication oc-
curs especially in the verification and validation of the
system properties. Considering that current systems
can reach up to billions of lines of code (Levenberg,
2016), the classical analysis of software becomes im-
practical (van der Aalst, 2015). Overcoming the diffi-
culties of classical approach is possible using model-
based analysis techniques. In these techniques, a for-
mal model of a system is generated and the confor-
mance of properties are checked by automated tools
to address defects in the design (Gluch et al., 2002).
A common approach in model-based analysis is
modeling the system behavior through event logs that
contain information about software execution (Pec-
chia and Cinque, 2013). A promising technique for
such an analysis is process mining that aims to dis-
cover, monitor and enhance processes using the in-
formation in event logs (van der Aalst, 2016). The
discovery, i.e. process discovery, aims to generate a
process model from the logs to observe system behav-
ior. Monitoring, or conformance checking, compares
an existing model with real logs of the same process
to conform the real behavior to the expected behavior.
Finally, enhancement, i.e. process enhancement, im-
proves an existing model with the real event logs, to
replay the reality on the existing model.
In every category of process mining, the content
of event logs are crucial in the system analysis. The
logs may contain information about users (e.g. user
id or e-mail), duration of execution, system prop-
erties (e.g. memory usage, OS type) or component
interactions. Although this information is useful in
modelling the behavior, the content might leak sensi-
tive information of owners; user and software com-
pany. For a user, sharing sensitive data with third
parties may pose a privacy threat. A recent discus-
sion about GHTorrent (Gousios, 2013), a platform to
monitor and publish GitHub events as dataset, exem-
plifies such a threat in shared logs. In the dataset
user e-mails used to be published since they are al-
ready public on GitHub (Gousios, 2016). However,
this situation initiated a displeasure when the dataset
is used by third companies to send survey e-mails to
data owners (Gousios, 2016). The discussion ended
Tillem, G., Erkin, Z. and Lagendijk, R.
Mining Encrypted Software Logs using Alpha Algorithm.
DOI: 10.5220/0006408602670274
In Proceedings of the 14th International Joint Conference on e-Business and Telecommunications (ICETE 2017) - Volume 4: SECRYPT, pages 267-274
ISBN: 978-989-758-259-2
Copyright © 2017 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
267
by removing personal data from the dataset (Gousios,
2016). Sharing logs is also arguable for software com-
panies regarding the intellectual property and confi-
dential information in logs. (Leemans and van der
Aalst, 2015) show it is possible to reverse engi-
neer software logs with process mining. Considering
the risk of piracy through reverse engineering (Nau-
movich and Memon, 2003), the companies are not
willing to share information with external parties.
The existing literature on software analysis for
security and privacy approaches the problem from
several aspects. The studies for the protection of
the intellectual property are mostly focus on crypto-
graphic solutions such as code obfuscation (Collberg
et al., 1997), watermarking (Collberg and Thombor-
son, 1999) and tamper-proofing (Aucsmith, 1996).
For the protection of user privacy, some studies ap-
proach the problem as the privacy of data in test-
ing applications (Grechanik et al., 2010; Lucia et al.,
2012) and provide solutions by applying anonymiza-
tion. Several studies attempt to protect user privacy
during log generation by reducing the sensitive infor-
mation in log reports (Castro et al., 2008; Broadwell
et al., 2003). Furthermore, the control of information
flow between software components is also a concern.
(Enck et al., 2014) and (Zhu et al., 2011) address the
problem of controlling sensitive information flow us-
ing taint tracking and analysis mechanisms.
While there are many efforts for securing log-
based software analysis in the literature, no studies
have focused on privacy issues in software analysis
with process mining. In this paper, we propose a
protocol for privacy-preserving process discovery for
software analysis, namely AlphaSec. Thus, we select
the alpha algorithm (van der Aalst et al., 2004) which
is a favorable algorithm in understanding the mecha-
nism of discovery with a relatively simple structure.
Our scenario has three parties namely, users, soft-
ware company (SC) and process miner (PM). The
users send the event logs to SC and are not active in
the rest of the protocol. PM executes the process dis-
covery protocol on the logs under the supervision of
SC. We assume a semi-honest setting where PM and
SC do not collude. In order to achieve privacy, we en-
crypt the logs under a homomorphic cryptosystem. To
identify the items in the logs and the relations between
them, we use several cryptographic protocols as se-
cure equality checking, secure multiplication and bit
decomposition. Furthermore, we use data packing to
eliminate the repetition of same operations and to ex-
ploit encryption modulus optimally. During the proto-
col execution, PM and SC are not allowed to directly
decrypt the logs. Moreover, the decryptions on inter-
mediate values are secured. In this setting, our proto-
col guarantees the privacy of data owners. To the best
of our knowledge, our paper presents the first protocol
for privacy-preserving software analysis with process
mining which assures both user and software privacy.
Our protocol does not change the original structure of
alpha algorithm and it can be adapted to other discov-
ery algorithms with slight modifications. While our
proposal adopts well-known cryptographic protocols,
it reduces the cost of those protocols significantly by
using data packing. We provide computational and
communication complexity analysis along with ex-
periments to show the improvement of our protocol.
2 PRELIMINARIES
In this section we summarize the alpha algorithm and
and introduce the cryptographic tools used in our pro-
tocol. Table 1 summarizes the notation.
Table 1: Explanation of the notation.
Symbol Explanation
T Set of activities t
i
s.t. T = {t
1
,t
2
,· ·· ,t
}
σ
i
A trace with ω
i
events s.t. σ
i
= he
1σ
i
,· ·· ,e
ω
i
σ
i
i
e
jσ
i
j
th
event of σ
i
, where 1 j ω
i
and 1 i τ
L Event log with τ traces, s.t. L = {σ
0
,· · · ,σ
τ
}
Secure multiplication operator
Homomorphic addition operator
M
x×y
A matrix M of size x × y
M
x,y
Index of matrix M in row x and column y
M
,y
y
th
column of matrix M
θ Compartment size for data packing
N Plaintext modulus for Paillier cryptosystem
µ
X
Number of packs for the packed array X
2.1 The Alpha Algorithm
The alpha algorithm takes an event log L =
{σ
0
,·· · ,σ
τ
} as input, where L is a set of traces σ
i
such that every σ
i
is composed of events e
jσ
i
, scans
it to find patterns and outputs the result as a Petri
net
1
(van der Aalst et al., 2004). Moreover, every e
jσ
i
contains several attributes, such as activity, timestamp
or resource which determine the perspective of pro-
cess discovery. Following the common approach in
process mining, in this work we assume that activity
attribute is used for process discovery, so every e
jσ
i
has only one attribute which is activity.
The algorithm runs in 8 steps (van der Aalst,
2016). In Steps 1-3, the set of activities appeared in
L, T
L
T , and the sets of the first (T
I
T ) and last
(T
O
T ) activities are discovered. Step 4 aims to dis-
cover the ordering relations between activities. The
1
A modeling language used in process mining.
See (van der Aalst et al., 2004) for details.
SECRYPT 2017 - 14th International Conference on Security and Cryptography
268
ordering is based on direct succession, t
b
> t
c
, which
means t
c
directly follows t
b
in σ
i
. The direct succes-
sions are used to define 3 ordering relations which are
1. Causality (t
b
t
c
or t
c
t
b
): t
b
> t
c
, but not t
c
> t
b
,
2. Parallel (t
b
|| t
c
): both t
b
> t
c
and t
c
> t
b
, and
3. Choice (t
b
# t
c
) : neither t
b
> t
c
nor t
c
> t
b
. The re-
sult of orderings is represented as a footprint matrix.
Once the footprint matrix is created, the pairs with
causality relation are collected in X
L
and in Step 5 the
maximal pairs of X
L
are assigned to Y
L
. In Steps 6-7
the set of places P
L
and the set of arches, F
L
, which
connects the elements of P
L
are determined. Finally,
Step 8 returns the result α(L) as (P
L
, T
L
, F
L
).
To illustrate how the alpha algorithm works, we
provide a toy example in the following. Let L =
{ha,b,e, f i,ha, b, e,c,d, b, f i,ha,b, c, e, d, b, f i,ha,b,c, d,e,
b, f i,ha,e,b, c, d,b, f i} be an event log. The 8 steps of
alpha algorithm for L is:
T
L
= {a, b,c, d, e, f }, T
I
= {a}, T
O
= { f }.
X
L
= {({a},{b}), ({a},{e}), ({b},{c}), ({b},{ f }),({c},
{d}),({d}, {b}),({e}, { f }), ({a,d},{b}),({b}, {c, f }}.
See the footprint matrix in Table 2 for orderings.
Table 2: Footprint matrix for L.
a b c d e f
a # # # #
b # ||
c # # || #
d # # || #
e || || || #
f # # # #
Y
L
= {({a},{e}),({c},{d}), ({e},{ f }), ({a, d},{b}),
({b}, {c, f }}.
P
L
= {i
L
,o
L
, p
({a},{e})
, p
({c},{d})
, p
({e},{ f })
, p
({a,d},{b})
,
p
({b},{c, f }
}.
F
L
= {(i
L
,a),( f , o
L
),(a, p
({a},{e})
),(p
({a},{e})
,e),(c,
p
({c},{d})
),·· · ,(p
({b},{c, f })
,c),(p
({b},{c, f })
, f )}.
The output α(L) = (P
L
,T
L
,F
L
) as in Figure 1.
a
b
cd
e
f
i
L
o
L
p
({c},{d})
p
({a},{e})
p
({a,d},{b})
p
({e},{f})
p
({b},{c,f})
Figure 1: The output of the alpha algorithm for the example
L as Petri net.
The output of the alpha algorithm is used in confor-
mance checking and process enhancement, to observe
the system behavior and to detect the deviations.
2.2 Paillier Cryptosystem
For our protocol we select Paillier cryptosystem (Pail-
lier, 1999) for the encryption of L due to its homomor-
phic property. In Paillier, encryption of a message
m modulus N = p · q is performed as E(m) = g
m
· r
N
mod N
2
, where p,q are large primes, g = N + 1 and
r
R
Z
N
. We refer readers to (Paillier, 1999) for details
of decryption scheme. Paillier cryptosystem enables
to perform homomorphic addition on ciphertexts as
E(m
1
) × E(m
2
) = E(m
1
+ m
2
). In the rest of the paper,
we represent a Paillier ciphertext by [·] and a homo-
morphic addition by , for the sake of simplicity.
2.3 Data Packing
In our protocol to eliminate the cost of repeated op-
erations, we use data packing as in (Erkin et al.,
2012). The bit size of inputs in plaintext, deter-
mines the compartment size, θ, in packed ciphertext.
The number of items in one pack is computed as
ρ =
b
log
2
N/θ
c
where log
2
N is the length of plaintext
modulus. Let [W ] = {[w
0
],·· · ,[w
s1
]} be an encrypted
array of s elements, w
i
, we pack [W ] into µ =
d
s/ρ
e
ciphertexts such that [W
pack
] = {[W
pack
0
],·· · ,[W
pack
µ1
]}
where data packing for every [W
pack
t
] is performed as
[W
pack
t
] =
ρ1
j=0
[w
j
] · (2
θ
)
j
, s.t. 0 t µ 1. Using
[W
pack
], we can simultaneously employ homomorphic
addition and also reduce the total cost of decryption.
In the rest of the paper, we represent data packing as
pack([W ],θ,N).
2.4 Homomorphic Protocols
For encrypted data processing, we use secure equal-
ity check (Nateghizad et al., 2016), secure multi-
plication (Erkin et al., 2012) and bit decomposi-
tion (Lazzeretti, 2012) protocols.
2.4.1 Secure Equality Check (SEQ)
The common approach to securely check whether
[x] = [y] is to check if [q] = [x y] is 0. One way to
test if [q] = 0 is to use Hamming distance as in (Lip-
maa and Toft, 2013). In our work, we use NEL-I SEQ
protocol from (Nateghizad et al., 2016) that is an ef-
ficient version of (Lipmaa and Toft, 2013). We refer
reader to (Lipmaa and Toft, 2013) and (Nateghizad
et al., 2016) for the details.
2.4.2 Secure Multiplication Protocol (SMP)
(Erkin et al., 2012) presents an SMP protocol where
Alice has [a] and [b] and Bob holds the secret key as
Mining Encrypted Software Logs using Alpha Algorithm
269
follows. Alice selects randoms r
a
, r
b
R
Z
N
, blinds the
inputs as [a
0
] = [a]· [r
a
], [b
0
] = [b]· [r
b
] and sends [a
0
],
[b
0
] to Bob. After decryption, Bob computes a
0
·b
0
, and
sends [a
0
· b
0
] to Alice. Computing [a · b] = [a
0
· b
0
] · [b]
r
a
·
[a]
r
b
·[r
a
·r
b
], Alice gets the encrypted multiplication.
2.4.3 Bit Decomposition (BD)
Using BD protocol in (Lazzeretti, 2012), Alice and
Bob can compute the encrypted bits of an `-bit x as
follows. Assume Alice has [x], and Bob holds the
secret key. Alice blinds [x] as [z] = [x r], where
r
R
{0,1}
`+κ
, and sends [z] to Bob. After decryption,
Bob sends the least significant ` bits of z to Alice in
encrypted form. Using [c
i
] = [z
i
]
r
i
· [c
i1
]
r
i
· [z
i
· c
i1
],
[x
i
] = [z
i
] · [r
i
] · [c
i1
] · [c
i
]
2
, Alice computes the set
{[x
0
],[x
1
],·· · ,[x
`1
]} which is BD of [x].
3 ALPHASEC: SECURE ALPHA
ALGORITHM
In this section we introduce the privacy-preserving al-
pha algorithm protocol, namely AlphaSec.
3.1 Scenario
Our scenario has three parties: 1. Software Company
(SC) is the owner of the software product who holds
public and private keys (pk,sk) and stores the en-
crypted logs. 2. Users are the users of the software
who send the encrypted logs to SC and are not ac-
tive in the rest. 3. Process Miner (PM) is a service
provider for SC who models the software. PM has the
knowledge and resources to perform process mining
techniques, thus, SC needs PM’s expertize to analyze
the software.
Our goal is to minimize the information leakage
for users and SC during the protocol execution. Thus,
PM must not access the content of encrypted logs and
his statistical observations should be restricted. He
should not learn the frequencies, but can only observe
the ordering relation between two encrypted activi-
ties. For instance, for activities a and b, PM can see
that [a] > [b] without knowing the values of [a] and
[b] and the frequencies of [a], [b] and [a] > [b]. On the
other hand, SC is only allowed to decrypt the inter-
mediate blinded values and the output of the protocol
which contains his own information. In this setting,
our protocol is based on semi-honest security model
where PM and SC are non-colluding.
3.2 Setup
In the setup phase, SC generates (pk , sk) and shares
pk with PM and users. We assume that SC shares
T with PM as [T ] = {[t
1
],·· · ,[t
]}. Furthermore, SC
collects [L] = {h[e
1σ
1
],·· · [e
ω
1
σ
1
]i, ··· ,h[e
1σ
τ
],·· · [e
ω
τ
σ
τ
]i}
from users and shares it with PM to run AlphaSec.
3.3 Process Model Discovery
AlphaSec protocol focuses on the first 4 steps of the
original alpha algorithm, since the sensitive data is
processed in these steps. Accordingly, the first task
is the discovery of activities T
L
,T
I
and T
F
in encrypted
domain, i.e. Steps 1-3. The second task is to find the
ordering relations, i.e. Step 4. Afterwards, a footprint
matrix is constructed and Steps 5-8 of the original al-
gorithm are operated in plaintext. Thus, our proto-
col is based on 3 subprotocols which are 1. Secure
Activity Discovery, where the activities are discov-
ered, 2. Secure Direct Succession Discovery where
the orderings are determined and 3. Secure Modeling
where the eventual process model is generated.
Protocol 1 shows how AlphaSec works. When SC
requests a process model, in Step 1, PM creates 3 ma-
trices, namely R
×
, ID
×1
and FD
×1
. While R is
used to store direct successions and discovered activ-
ities, ID and FD are used to store the initial and fi-
nal activities. Between Steps 2-5, for each [σ
i
] of [L],
Secure Activity Discovery and Secure Direct Suc-
cession Discovery subprotocols are operated subse-
quently. After all [σ
i
]s are scanned, a Petri net is gen-
erated in Step 6, by Secure Modelling subprotocol.
Protocol 1 AlphaSec
Input: [L] , [T ]
1: R,ID, FD
2: for all [σ
i
] [L] do
3: (AD
σ
i
,ID,FD) =SecureActivityDiscovery([σ
i
])
4: R = SecureDirectSuccessionDiscovery(AD
σ
i
)
5: end for
6: α([L]) = SecureModelling(R, ID, FD)
Output: α([L])
3.3.1 Secure Activity Discovery
The first subprotocol aims to securely discover T
L
, T
I
and T
O
as shown in Subprotocol 1. Accordingly, PM
collaborates with SC to compare every [e
jσ
i
] with ev-
ery [t
m
] using SEQ and the result is stored in AD
σ
i
×ω
i
.
As showed in Step 3, if [e
jσ
i
] = [t
m
], AD
σ
i
m, j
is set to [1],
else to [0]. Finally, in Step 6, ID and FD are updated
with AD
σ
i
,1
and AD
σ
i
,ω
i
, respectively. In Figure 2(a), we
illustrate the procedure for the sample [L].
SECRYPT 2017 - 14th International Conference on Security and Cryptography
270
Subprotocol 1 Secure Activity Discovery
Input: [σ
i
] , ID, FD
1: for all [e
jσ
i
] [σ
i
] where 1 j ω
i
do
2: for all [t
m
] [T ] where 1 m do
3: AD
σ
i
m, j
= ([e
jσ
i
]
?
= [t
m
]) ? [1] : [0]
4: end for
5: end for
6: ID = ID AD
σ
i
,1
, FD = FD AD
σ
i
,ω
i
Output: AD
σ
i
,ID,FD
[a]
[b]
[c]
[d]
[e]
[f]
[a]
[0] [4]
[0]
[0]
[1]
[b]
[0] [0]
[3]
[0]
[2]
[c]
[0] [0]
[0]
[3]
[1]
[d]
[0] [3]
[0]
[0]
[1]
[e]
[0] [2]
[1]
[1]
[0]
[f] [0] [0]
[0]
[0]
[0]
[a]
[b]
[c]
[d]
[e]
[f]
[a]
0 1 0 0 1 0
[b]
0 0 1 0 1 1
[c]
0 0 0 1 1 0
[d]
0 1 0 0 1 0
[e]
0 1 1 1 0 1
[f] 0 0 0 0 0 0
[a]
[b]
[c]
[d]
[e]
[f]
[a]
# # # #
[b]
# ||
[c]
# # || #
[d]
# # || #
[e]
|| || || #
[f] # # # #
[a]
[b]
[e]
[f]
[a]
[1] [0]
[0]
[0]
[b]
[0] [1]
[0]
[0]
[c]
[0] [0]
[0]
[0]
[d]
[0] [0]
[0]
[0]
[e]
[0] [0]
[1]
[0]
[f] [0] [0]
[0]
[1]
(a) AD
σ
1
for σ
1
of L.
[a]
[b]
[c]
[d]
[e]
[f]
[a]
[0] [4]
[0]
[0]
[1]
[0]
[b]
[0] [0]
[3]
[0]
[2]
[4]
[c]
[0] [0]
[0]
[3]
[1]
[0]
[d]
[0] [3]
[0]
[0]
[1]
[0]
[e]
[0] [2]
[1]
[1]
[0]
[1]
[f] [0] [0]
[0]
[0]
[0]
[0]
[a]
[b]
[c]
[d]
[e]
[f]
[a]
0 1 0 0 1 0
[b]
0 0 1 0 1 1
[c]
0 0 0 1 1 0
[d]
0 1 0 0 1 0
[e]
0 1 1 1 0 1
[f] 0 0 0 0 0 0
[a]
[b]
[c]
[d]
[e]
[f]
[a]
# # # #
[b]
# ||
[c]
# # || #
[d]
# # || #
[e]
|| || || #
[f] # # # #
[a]
[b]
[e]
[f]
[a]
[1] [0]
[0]
[0]
[b]
[0] [1]
[0]
[0]
[c]
[0] [0]
[0]
[0]
[d]
[0] [0]
[0]
[0]
[e]
[0] [0]
[1]
[0]
[f] [0] [0]
[0]
[1]
(b) Final R matrix.
[a]
[b]
[c]
[d]
[e]
[f]
[a]
[0] [4]
[0]
[0]
[1]
[0]
[b]
[0] [0]
[3]
[0]
[2]
[4]
[c]
[0] [0]
[0]
[3]
[1]
[0]
[d]
[0] [3]
[0]
[0]
[1]
[0]
[e]
[0] [2]
[1]
[1]
[0]
[1]
[f] [0] [0]
[0]
[0]
[0]
[0]
[a]
[b]
[c]
[d]
[e]
[f]
[a]
0 1 0 0 1 0
[b]
0 0 1 0 1 1
[c]
0 0 0 1 1 0
[d]
0 1 0 0 1 0
[e]
0 1 1 1 0 1
[f] 0 0 0 0 0 0
[a]
[b]
[c]
[d]
[e]
[f]
[a]
# # # #
[b]
# ||
[c]
# # || #
[d]
# # || #
[e]
|| || || #
[f] # # # #
[a]
[b]
[e]
[f]
[a]
[1] [0]
[0]
[0]
[b]
[0] [1]
[0]
[0]
[c]
[0] [0]
[0]
[0]
[d]
[0] [0]
[0]
[0]
[e]
[0] [0]
[1]
[0]
[f] [0] [0]
[0]
[1]
(c) Result of zero-check.
[a]
[b]
[c]
[d]
[e]
[f]
[a]
[0] [4]
[0]
[0]
[1]
[0]
[b]
[0] [0]
[3]
[0]
[2]
[4]
[c]
[0] [0]
[0]
[3]
[1]
[0]
[d]
[0] [3]
[0]
[0]
[1]
[0]
[e]
[0] [2]
[1]
[1]
[0]
[1]
[f] [0] [0]
[0]
[0]
[0]
[0]
[a]
[b]
[c]
[d]
[e]
[f]
[a]
0 1 0 0 1 0
[b]
0 0 1 0 1 1
[c]
0 0 0 1 1 0
[d]
0 1 0 0 1 0
[e]
0 1 1 1 0 1
[f] 0 0 0 0 0 0
[a]
[b]
[c]
[d]
[e]
[f]
[a]
# # # #
[b]
# ||
[c]
# # || #
[d]
# # || #
[e]
|| || || #
[f] # # # #
[a]
[b]
[e]
[f]
[a]
[1] [0]
[0]
[0]
[b]
[0] [1]
[0]
[0]
[c]
[0] [0]
[0]
[0]
[d]
[0] [0]
[0]
[0]
[e]
[0] [0]
[1]
[0]
[f] [0] [0]
[0]
[1]
(d) Footprint matrix.
Figure 2: Illustrating AlphaSec protocol on the sample log.
Since SEQ is an expensive protocol that has to be re-
peated · ω
i
times for each σ
i
, we use data packing in
our protocol. Notice that only a number of interme-
diate steps of the adopted SEQ protocol (Nateghizad
et al., 2016) can be modified for data packing. We
use pack([e
jσ
i
t
m
],θ,N) as packing function where
θ = (
d
log
2
e
+ κ), µ = · ω
i
/ρ and ρ =
b
log
2
N/θ
c
.
3.3.2 Secure Direct Succession Discovery
The next step in AlphaSec is to identify direct succes-
sions between activities. To detect subsequent events
in [σ
i
], we merge two subsequent columns of AD
σ
i
by SMP. Thus, every element in the former column,
AD
σ
i
, j
is securely multiplied with every element in the
transpose of latter column (AD
σ
i
, j+1
)
T
. Then, the re-
sult is added to corresponding index of R.
This subprotocol has two bottlenecks in terms of
efficiency. First, the inputs of SMP are encrypted
bits, so the plaintext space is not optimally used.
Second, for every σ
i
SMP protocol runs
2
· (ω
i
1)
times. These bottlenecks require us to use data pack-
ing. Accordingly, we pack the column AD
σ
i
, j+1
as
pack(AD
σ
i
, j+1
,θ,N) where θ =
d
log
2
Γ
e
and the column
AD
σ
i
, j
as pack(AD
σ
i
, j
,θ,N) where θ =
d
log
2
Γ
e
· and Γ
is the number of events in L. Since, the protocol
requires to add the result to R, we select a larger com-
partment size, which is the total number of events in
the worst case. The result of SMP is a packed ci-
phertext with θ =
d
log
2
Γ
e
· . The number of com-
partments in one pack and the number of packs
are ρ
1
=
b
log
2
N/
d
log
2
Γ
e
·
c
, µ
1
= · ω
i
/ρ
1
and ρ
2
=
b
log
2
N/
d
log
2
Γ
ec
, µ
2
= · ω
i
/ρ
2
, respectively. In this
setting, SMP runs µ
1
· µ
2
· (ω
i
1) times for every σ
i
.
In Subprotocol 2, we show how to perform secure di-
rect succession discovery with packing. The result of
SMP, mult, is stored in R
pack
, whose size is µ
1
· µ
2
.
Subprotocol 2 Secure Direct Succession Discovery
Input: AD
σ
i
1: for 1 j ω
i
1 do
2: AD
p
1
= pack(AD
σ
i
, j
,θ,N),AD
p
2
= (AD
σ
i
, j+1
,θ,N)
3: for 1 k µ
1
do
4: for 1 m µ
2
do
5: mult = AD
p
1
k
AD
p
2
m
6: R
pack
k,m
= R
pack
k,m
mult
7: end for
8: end for
9: end for
Output: R
pack
After the execution of subprotocol, the result R
pack
is unpacked using BD to create R. It is important to
mention that BD outputs individual bits, but every in-
dex of R is a
d
log
2
Γ
e
-bit integer. Thus, after BD, we
perform data packing for every
d
log
2
Γ
e
bits to create
R. Figure 2(b) shows R matrix for the sample L.
3.3.3 Secure Modelling
In the last step of AlphaSec, the output α([L]) is gener-
ated using R,ID, FD. Here PM needs to know which
activity pairs have an ordering relation, but the fre-
quency of the relation should be hidden from him.
Thus, we perform a zero-check function on the inputs
to observe whether two encrypted activities has an or-
dering relation, also, whether an activity is first or last
activity. For zero-check, PM blinds R
i, j
with r
R
Z
N
as [R
0
i, j
] = [R
i, j
]
r
where 1 i, j and sends [R
0
i, j
] to
SC for a secure decryption. If the result of the decryp-
tion is non-zero, which means the activity pairs have a
direct succession relation, then SC sends 1 and other-
wise sends 0 to PM. Hence, PM can only observe the
relation between two encrypted activities, but noth-
ing else. Using the result of zero-check, the footprint
matrix can be constructed and then the output is gen-
erated as in the original alpha algorithm. The only
difference is that activities are encrypted and only SC
can decrypt them. In Figure 2(c)-2(d), we illustrate
the result of zero-check on R and the footprint matrix,
respectively.
Mining Encrypted Software Logs using Alpha Algorithm
271
4 PROTOCOL ANALYSIS
In this section, we first provide a security analysis
for our protocol, then analyze its computational and
communicational complexity and show experimental
results. In Table 3, we summarize the notation.
Table 3: Summary of the notation for complexity analysis.
Notation Explanation
Γ Total number of events in L, s.t. Γ =
τ
i=1
ω
i
HAD Homomorphic addition
HSM Homomorphic scalar multiplication
ZCF Zero check function
SEQ Secure Equality Check
SMP Secure Multiplication
BD Bit Decomposition
SAD Secure Activity Discovery
SDS Secure Direct Succession Discovery
MD Secure Modelling
4.1 Security Analysis
The privacy considerations in our protocol are
twofold: user privacy and software company privacy.
On one hand, users want to protect their sensitive in-
formation from PM and SC. On the other hand, SC
wants to protect the intellectual property of his prod-
uct from PM. In the following, we analyze how these
concerns are overcome against each party.
Users are not active during protocol execution.
They only take part in generation of [L], so they do
not have an active adversarial role in our setting.
PM has access to [L] and the results of SEQ, SMP
and HAD. The cryptographic protocols are proven to
be secure, thus, we assume that PM cannot infer any
additional information. Furthermore, to prevent sta-
tistical inferences, we hide the frequencies from PM
by zero-check. PM can only observe the ordering be-
tween two encrypted activities. However, it is not an
advantage for PM since the real values are unknown.
SC holds sk and collaborates with PM to operate
SEQ and SMP protocols. As the owner of sk, he does
not have direct access to [L] to assure user privacy.
During SMP, decryption result is blinded, thus, SC
cannot infer the original values. For SEQ, we rely on
the security of the underlying protocol.
4.2 Computational Analysis
Prior to the analysis of AlphaSec, we analyze the
computational complexity of the original alpha algo-
rithm. The operations in the original algorithm are
mostly integer or string comparisons which detect dis-
tinct activities and the orderings. Thus, T
L
, T
I
and T
O
can be discovered in Γ comparisons. For the discovery
of direct successions, every e
jσ
i
can be paired with its
successor in Γ operations. Then, the footprint matrix
can be generated with at most
2
comparisons.
For the analysis of AlphaSec, we count the num-
ber of operations in every subprotocol and illustrate
them in Table 4 without packing (w/o Packing) and
with packing (w/ Packing). Apart from the opera-
tions in Table 4, Γ and encryptions are performed
to encrypt L and T in setup. In AlphaSec, SDS dom-
inates the computations by the quadratic complexity
of SMP and HAD. Using data packing, the number of
SMP reduces from
2
to
d
(/ρ
1
)
e
·
d
(/ρ
2
)
e
, where ρ =
b
log
2
N/(κ +
d
log
2
e
)
c
, ρ
1
=
b
(log
2
N κ)/(
d
log
2
Γ
e
)
c
and ρ
2
=
b
(log
2
N κ)/(
d
log
2
Γ
e
)
c
.
Table 4: The number of operations performed in AlphaSec.
w/o Packing w/ Packing
SAD
SEQ ∆Γ
d
∆Γ/ρ
e
HAD 2 · (τ 1)
SDS
SMP
2
(ω
i
1)τ
d
(/ρ
1
)
ed
(/ρ
2
)
e
(ω
i
1)τ
HAD
2
(ω
i
1)τ
d
(/ρ
1
)
ed
(/ρ
2
)
e
(ω
i
1)τ
BD
d
(/ρ
1
)
ed
(/ρ
2
)
e
SM
HSM
2
ZCF
2
4.3 Communicational Analysis
In Table 5, we summarize the communication com-
plexity of AlphaSec in terms of the number of cipher-
texts exchanged both for packed and unpacked ver-
sion. The numbers show that data packing cannot re-
duce the bandwidth usage for SEQ proportional to the
number of packed ciphertext but it reduces the band-
width usage in intermediate steps. On the other hand,
for SMP, the reduction in bandwidth usage is directly
proportional to the number of packs.
Table 5: Bandwidth usage of AlphaSec in terms of the num-
ber of exchanged ciphertexts, where χ = (log
2
log
2
).
w/o Packing w/ Packing
SEQ ∆Γ(3 +
d
log
2
e
+ 2
d
χ
e
) 3∆Γ/ρ + ∆Γ(
d
log
2
e
+ 2
d
χ
e
)
SMP 3
2
(ω
i
1) · τ 3
d
/ρ
1
ed
/ρ
2
e
(ω
i
1)τ
BD (3(log
2
N κ) 1)
d
/ρ
1
ed
/ρ
2
e
ZCF
2
For numerical analysis, we measure the bandwidth
usage for a dataset with Γ = 10000 events, = 20 ac-
tivities, τ = 1000 traces and w
i
= 10 with and without
packing, where ciphertext size 4096 bits. The com-
parison results in Figure 3(a) show that data pack-
ing can reduce the communication cost significantly.
The total improvement in communication cost is 83%,
which is mainly based on SDS, where the bandwidth
usage of SMP is reduced by a factor of 133. We pro-
vide a zoom in to show the communication cost of
SDS and BD for w/ Pack, but SM is not visible due to
its insignificant cost.
SECRYPT 2017 - 14th International Conference on Security and Cryptography
272
w/o Pack w/ Pack
2
4
6
·10
10
Number of bits exchanged
SAD
SDS
BD
SM
Zoom in
9
9.9
·10
9
(a) Bandwidth usage of AlphaSec
with and without packing.
w/o Pack w/ Pack
400
600
800
1,000
1,200
Time (in seconds)
SAD
SDS
BD
(b) Performance of AlphaSec in seconds
without and with data packing.
100 1000 10000
0
2
4
6
·10
4
Size of L (in number of events)
Time (in seconds)
SAD
SDS
BD
SM
Total
(c) Execution time of AlphaSec on dif-
ferent datasets.
Figure 3: Evaluating the performance of AlphaSec protocol.
4.4 Experiments
To measure the real time performance of AlphaSec,
we implemented it in C++ with GMP-6.1.2 library.
The machine we use runs OSX El Capitan with Intel
Core i5 2.7 GHz processor. We choose log
2
N = 2048
for Paillier and κ = 80 as security parameter. As
dataset, we select 3 synthetic datasets (D
1
,D
2
,D
3
)
from the event log dataset of IEEE TF on Process
Mining
2
, where for D
1
Γ = 109, τ = 13 and = 10,
for D
2
Γ = 1,226, τ = 100 and = 16, and for D
3
Γ = 10696, τ = 1000 and = 20.
As the first experiment, we measure the effect of
packing on performance. Thus, we run AlphaSec on
D
1
to compare the timing for SAD, SDS and BD on
packed and unpacked inputs. Since BD is only used
when data is packed, we separate it from SDS. Fur-
thermore, we do not include SM in results, since it is
same for packed and unpacked data. As the results
in Figure 3(b) show applying packing in SDS reduces
the computation time significantly. The improvement
in the computation of SDS is 96% while the total im-
provement is 71% approximately. On the other hand,
SAD is not affected significantly by packing, since it
cannot be fully adapted to SEQ.
In the second experiment, we observe the perfor-
mance on different dataset sizes. Thus we compare
the timing of AlphaSec on D
1
,D
2
,D
3
. We run this
experiment only on the packed version and measure
the time required for SAD, SDS, BD, SM and the to-
tal time as illustrated in Figure 3(c). For D
3
it takes
65133 seconds to run AlphaSec, of which 61885 sec-
onds are spent for SAD, i.e. SEQ. However, perform-
ing SDS requires 3135 seconds including BD which
takes around 210 seconds. Finally, SM can be per-
formed approximately in 3 seconds.
2
http://data.4tu.nl/repository/collection:event logs
5 CONCLUSION
In this paper, we present the first privacy-preserving
protocol in process mining for model-based software
analysis with the alpha algorithm. The output of our
protocol can be used as an input for other process min-
ing techniques such as conformance checking or pro-
cess enhancement under a privacy-preserving setting.
As a first attempt to provide dual privacy for users
and SC, we propose a solution based on cryptographic
primitives, which provides provable security and pri-
vacy. To achieve our goal we use homomorphic en-
cryption along with two-party cryptographic proto-
cols. To reduce the number of operations, we applied
data packing on our computations. The performance
analyses show that the employment of cryptographic
techniques on log analysis provides encouraging re-
sults. Furthermore, applying data packing improves
the performance significantly.
Although the state-of-the-art process mining tech-
niques are efficient in plaintext domain, our protocol
proposes a way to protect sensitive data with addi-
tional computational overhead which is promising for
the future of this research line. The research challenge
is to improve the efficiency of our protocol further by
designing custom-tailored cryptographic protocols to
replace costly operations such as SEQ and deploying
our ideas on more complex process discovery algo-
rithms. With our proposal, we aim to attract the atten-
tion of the research community to the privacy aspects
of model-based software analysis, which is a distinct
and important topic that deserves to be investigated.
REFERENCES
Aucsmith, D. (1996). Tamper resistant software: An imple-
mentation. In Information Hiding, First International
Workshop, Cambridge, U.K., May 30 - June 1, 1996,
Proceedings, pages 317–333.
Mining Encrypted Software Logs using Alpha Algorithm
273
Broadwell, P., Harren, M., and Sastry, N. (2003). Scrash:
A system for generating secure crash information. In
Proceedings of the 12th USENIX Security Symposium,
Washington, D.C., USA, August 4-8, 2003.
Castro, M., Costa, M., and Martin, J. (2008). Better bug
reporting with better privacy. In Proceedings of the
13th International Conference on Architectural Sup-
port for Programming Languages and Operating Sys-
tems, ASPLOS 2008, Seattle, WA, USA, March 1-5,
2008, pages 319–328.
Collberg, C., Thomborson, C., and Low, D. (1997). A tax-
onomy of obfuscating transformations. Technical Re-
port 148, Department of Computer Science, The Uni-
versity of Auckland, New Zealand.
Collberg, C. S. and Thomborson, C. D. (1999). Software
watermarking: Models and dynamic embeddings. In
POPL ’99, Proceedings of the 26th ACM SIGPLAN-
SIGACT Symposium on Principles of Programming
Languages, San Antonio, TX, USA, January 20-22,
1999, pages 311–324.
Enck, W., Gilbert, P., Han, S., Tendulkar, V., Chun, B., Cox,
L. P., Jung, J., McDaniel, P., and Sheth, A. N. (2014).
Taintdroid: An information-flow tracking system for
realtime privacy monitoring on smartphones. ACM
Trans. Comput. Syst., 32(2):5:1–5:29.
Erkin, Z., Veugen, T., Toft, T., and Lagendijk, R. L.
(2012). Generating private recommendations effi-
ciently using homomorphic encryption and data pack-
ing. IEEE Trans. Information Forensics and Security,
7(3):1053–1066.
Gluch, D., Cornella-Dorda, S., Hudak, J. J., Lewis, G. A.,
Walker, J., Weinstock, C. B., and Zubrow, D. (2002).
Model-based verification: An engineering practice.
Technical Report CMU/SEI-2002-TR-021, Carnegie
Mellon University, PA.
Gousios, G. (2013). The GHTorrent dataset and tool suite.
In Proceedings of the 10th Working Conference on
Mining Software Repositories, MSR ’13, pages 233–
236, Piscataway, NJ, USA. IEEE Press.
Gousios, G. (2016). The issue 32 incident an update. Ac-
cessed May 3, 2016.
Grechanik, M., Csallner, C., Fu, C., and Xie, Q. (2010).
Is data privacy always good for software testing? In
IEEE 21st International Symposium on Software Reli-
ability Engineering, ISSRE 2010, San Jose, CA, USA,
1-4 November 2010, pages 368–377.
Lazzeretti, R. (2012). Privacy preserving processing of
biomedical signals with application to remote health-
care systems. PhD thesis, Ph. D. thesis, PhD school of
the University of Siena, Information Engineering and
Mathematical Science Department.
Leemans, M. and van der Aalst, W. M. P. (2015). Pro-
cess mining in software systems: Discovering real-life
business transactions and process models from dis-
tributed systems. In 18th ACM/IEEE International
Conference on Model Driven Engineering Languages
and Systems, MoDELS 2015, Ottawa, ON, Canada,
September 30 - October 2, 2015, pages 44–53.
Levenberg, J. (2016). Why Google stores billions of lines of
code in a single repository. Commun. ACM, 59(7):78–
87.
Lipmaa, H. and Toft, T. (2013). Secure equality and greater-
than tests with sublinear online complexity. In Au-
tomata, Languages, and Programming - 40th Inter-
national Colloquium, ICALP 2013, Riga, Latvia, July
8-12, 2013, Proceedings, Part II, pages 645–656.
Lucia, Lo, D., Jiang, L., and Budi, A. (2012). kbe-
anonymity: test data anonymization for evolving
programs. In IEEE/ACM International Conference
on Automated Software Engineering, ASE’12, Essen,
Germany, September 3-7, 2012, pages 262–265.
Nateghizad, M., Erkin, Z., and Lagendijk, R. L. (2016).
Efficient and secure equality tests. In IEEE Interna-
tional Workshop on Information Forensics and Secu-
rity, WIFS 2016, Abu Dhabi, United Arab Emirates,
December 4-7, 2016, pages 1–6.
Naumovich, G. and Memon, N. D. (2003). Preventing
piracy, reverse engineering, and tampering. IEEE
Computer, 36(7):64–71.
Paillier, P. (1999). Public-key cryptosystems based on com-
posite degree residuosity classes. In Advances in
Cryptology - EUROCRYPT ’99, International Confer-
ence on the Theory and Application of Cryptographic
Techniques, Prague, Czech Republic, May 2-6, 1999,
Proceeding, pages 223–238.
Pecchia, A. and Cinque, M. (2013). Log-Based Failure
Analysis of Complex Systems: Methodology and Rel-
evant Applications, pages 203–215. Springer Milan,
Milano.
Rubin, V. A., G
¨
unther, C. W., van der Aalst, W. M. P.,
Kindler, E., van Dongen, B. F., and Sch
¨
afer, W.
(2007). Process mining framework for software pro-
cesses. In Software Process Dynamics and Agility,
International Conference on Software Process, ICSP
2007, Minneapolis, MN, USA, May 19-20, 2007, Pro-
ceedings, pages 169–181.
van der Aalst, W. M. P. (2015). Big software on the run:
in vivo software analytics based on process mining
(keynote). In Proceedings of the 2015 International
Conference on Software and System Process, ICSSP
2015, Tallinn, Estonia, August 24 - 26, 2015, pages
1–5.
van der Aalst, W. M. P. (2016). Process Mining - Data
Science in Action, Second Edition. Springer.
van der Aalst, W. M. P., Weijters, T., and Maruster, L.
(2004). Workflow mining: Discovering process mod-
els from event logs. IEEE Trans. Knowl. Data Eng.,
16(9):1128–1142.
Zhu, D. Y., Jung, J., Song, D., Kohno, T., and Wetherall,
D. (2011). Tainteraser: protecting sensitive data leaks
using application-level taint tracking. Operating Sys-
tems Review, 45(1):142–154.
SECRYPT 2017 - 14th International Conference on Security and Cryptography
274