As described previously, aLINb ⇔ aSEQb ∨ bSEQa.
Finally, the OR relation is based on the # relation, but
is much more limited. It holds only when the SEQ
relation does not hold: aORb ⇔ a#b ∧ ¬(aLINb).
Consequently, the first step of the mining algo-
rithm is to partition the set of all task pairs (t
i
,t
j
),
i = 1, N, j = 1, N, i 6= j into three subsets of task
pairs that obey the original three relations →, k, and
#, respectively. This is done by means of establish-
ing the > relation first by performing a single scan of
all traces in the workflow log, identically to the oper-
ation of the α algorithm (van der Aalst et al., 2004).
The computational complexity of this step is linear in
the length of the workflow log, but is independent of
the number of tasks N. Establishing →, k, and # from
> can be done in time O(N
2
).
The resulting partition of task pairs can be rep-
resented conveniently in the matrix M
α
whose en-
try M
α
i, j
contains the relation label for the pair (t
i
,t
j
),
i = 1, N, j = 1, N, i 6= j. The diagonal entries M
i,i
,
i = 1, N are undefined and excluded from considera-
tion. Note that M
α
is not symmetric, in general.
The second step is to build the relation matrix M
of the workflow tree mining algorithm, whose entries
are based on the entries M
α
and the definitions de-
scribed above. The order of filling in the matrix M is
strictly as listed above: AND, SEQ, LIN, and OR, and
since LIN labels overwrite SEQ labels, the end result
is a partition of the task pair set into three relation
subsets labelled with AND, OR, and LIN. Again, the
diagonal elements of M are undefined and excluded
from consideration. Note that in contrast to M
α
, M is
symmetric. The complexity of this step is O(N
2
).
The third, and central, step of the algorithm is to
find the difference ∆
i, j
between each distinct pair of
rows (i, j), i 6= j in the matrix M, defined as ∆
i, j
=
∑
N
k=1
δ(i, j, k), for
δ(i, j, k)
.
=
1 iff i 6= k ∧ j 6= k ∧ M
i,k
6= M
j,k
0, otherwise.
(1)
If ∆
i, j
= 0 for a distinct pair of tasks (i, j), i 6= j,
this means that these two tasks have identical respec-
tive relations with respect to all remaining tasks, and
by virtue of Theorem 1 applied in the forward direc-
tion, they must have the same parent. In such case,
we can build a workflow subtree that has a root node
labeled with M
i, j
, and children t
i
and t
j
.
When more than one element ∆
i, j
= 0 (excluding
the symmetric element ∆ j, i which is also necessar-
ily zero because of the symmetry of ∆), it is not nec-
essarily always true that all corresponding nodes are
children of the same relation node. This is only true
when every pair of them (i, j) will have pairwise dis-
tance ∆
i, j
= 0 (from Theorem 1, applied in the reverse
direction.) In the general case, there might be sev-
eral sets of tasks, such that each pair of tasks t
i
and t
j
within the same set has distance ∆
i, j
= 0, but distances
between tasks from different sets are not zero. Such
distinct, non-overlapping sets of tasks can be deter-
mined easily by scanning the matrix ∆ row-wise, and
adding a task t
i
to an existing set only if its distance
∆
i j
to all other tasks t
j
in that set is zero; when it has
distance ∆
ik
= 0 to some other task t
k
thas is not in
any existing set, a new set is initiated with members t
i
and t
k
.
Once all such sets have been found, a sub-tree is
constructed for each of them. The root of this subtree
is labeled with the relation that holds among these
tasks. Due to the semantics of WF-trees, a sub-tree
is simply a composite task that can participate in a
higher-level block just like any other atomic task. Be-
cause of this, we can create a new task label for each
sub-tree so identified. Let the set of these new com-
posite tasks be T
new
; this set complements the initial
set of atomic tasks T . The tasks t
i
∈ T
new
are given
successive ordinal numbers beyond N. Let also the
atomic tasks that are members of one of the cliques be
defined as T
inc
; each task in T
inc
is a child of a member
of T
new
. Finally, let also a set T
act
of active tasks be
identified, and initialized at this point as T
act
:= T .
The complexity of this (third) step is O(N
3
), be-
cause it is dominated by the cost of computing the ma-
trix ∆. (As noted, identifying sets and building sub-
trees requires a single scan of ∆, or only O(N
2
).) The
next series of steps are largely similar to the one just
described, only they work on a progressively modified
active set of tasks. The newly created composite tasks
in T
new
are added to the set of active tasks T
act
, while
atomic tasks that have already been included in some
sub-tree are excluded from T
act
. New sub-trees are
again constructed on the current tasks in T
act
, and the
process is repeated until only one task remains in T
act
— the root of the entire WFT. For a detailed descrip-
tion of these steps, see (Nikovski and Baba, 2007).
The overall complexity of this series of steps is again
O(N
3
), because new rows and columns of the matri-
ces M and ∆ are introduced only for new composite
tasks, and there can be at most N −1 such tasks. Each
new row or column has O(N) elements, and the com-
putation of each element takes O(N).
The last step of the algorithm is to re-order the
children of all LIN nodes, so that the SEQ relation
among them holds, and re-label those nodes with the
label SEQ. This completes the construction of the
workflow tree. Since, by construction, each com-
posite node has at least two children, this workflow
tree is also compact. The complexity of this step
is O(N
2
logN), since the induced tree has at most
WORKFLOW TREES FOR REPRESENTATION AND MINING OF IMPLICITLY CONCURRENT BUSINESS
PROCESSES
35