Frequent and Significant Episodes in Sequences of Events
Computation of a New Frequency Measure based on Individual Occurrences of the
Events
Oscar Quiroga, Joaquim Mel´endez and Sergio Herraiz
Institute of Informatics and Applications, University of Girona, 17071 Girona, Catalonia, Spain
Keywords:
Data Mining, Event Sequences, Frequent Episodes, Pattern Discovery.
Abstract:
Pattern discovery in event sequences is based on the mining of frequent episodes. Patterns are the result of the
assessment of frequent episodes using episode rules. However, with a simple search usually a huge number
of frequent episodes and rules are found, then, methods to recognise the most significant patterns and to
properly measure the frequency of the episodes, are required. In this paper, two new indexes called cohesion
and backward-confidence of the episodes are proposed to help in the extraction of significant patterns. Also,
two methods to find the maximal number of non-redundant occurrences of serial and parallel episodes are
presented. Experimental results demonstrate the compactness of the mining result and the efficiency of our
mining algorithms.
1 INTRODUCTION
Frequent episodes mining in sequences of events can
be applied in many application domains, as for exam-
ple in telecommunication networks (Mannila et al.,
1997), web access pattern discovery, family protein
analysis (Casas-Garriga, 2003), fault prognosis based
on logs of a manufacturing plant (Laxman et al.,
2007), study of multi-neuronal spike train recordings
(Patnaik, 2006) or event tracking problems for news
stories (Iwanuma et al., 2005), etc. The starting point
are data sets organised as a single long sequence of
events where each event is described by its type and
its time of occurrence.
Pattern discovery in event sequences includes two
main steps: frequent episodes extraction and signi-
ficant episodes recognition. Frequent episodes ex-
traction is usually a process that starts looking for
frequent single events (frequent events) and, follow-
ing an iterative procedure, candidate episodes are
build from frequent episodes found in the previous
level. Candidates who reach a minimum frequency
threshold are classified as frequent episodes(Agrawal
and Srikant, 1995). Several methods have been
proposed to cope with particularities of events and
episodes, considering for example duration, maximal
gap between events, overlapping/non-overlapping of
episodes, minimal occurrences, etc. The frequency
of an episode i.e., the number of occurrences over a
sequence, may vary among different algorithms and
results depend on these procedures but also on how
events are distributed in the sequence (Gan and Dai,
2010).
Relevance of frequent episodes have to be evalu-
ated to find representativeconnectionsbetween events
describing patterns. Confidence of the episode rule
can be used for that purpose (Agrawal and Srikant,
1994). However, with a simple search usually a huge
number of frequent episodes and rules are found.
Thus, to extract the most relevant information it is
necessary to use auxiliary criteria (Gan and Dai,
2011).
In this paper, an improved frequency measure
method and two new indexesto assess the significance
of the episodes are suggested. The method locates
and counts the largest number of non-redundant oc-
currences of an episode, and is useful for both, se-
rial and parallel episodes. The new indexes are pro-
posed as complementary criteria to the confidence of
the episode rules. The first one, named cohesionof the
episode, is based on the comparison of the number of
serial and parallel occurrences, whereas the second,
named backward-confidence of the episode, is analo-
gous to the confidence of the episode rule but evalu-
ates the beginning of the episode.
The usefulness of the method and the indexes pro-
posed, are evaluated with synthetic event sequences.
324
Quiroga O., Meléndez J. and Herraiz S..
Frequent and Significant Episodes in Sequences of Events - Computation of a New Frequency Measure based on Individual Occurrences of the Events.
DOI: 10.5220/0004118003240328
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 324-328
ISBN: 978-989-8565-29-7
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
2 FMINEVENT
Fminevent is the short name of the frequency measure
method based on individual occurrences of the events
proposed in this paper. The method is described with
commonly used terms in literature of sequence data
mining.
Event: An event is defined by the pair (e,t) where t
denotes the occurrence time (timestamp) and e repre-
sent the event attributes (one or several) that contain
the informationusefulto characterise the event. Event
attributes can be a single label or a vector of contin-
uous/discrete attribute-value pairs defined in a given
range or set of predefined values.
Sequence of Events: A sequence of events S is de-
fined as an ordered list of events or a n-tuple S =
h(e
1
,t
1
),(e
2
,t
2
),..., (e
n
,t
n
)i where t
i
t
i+1
for all i
{1,2,...,n 1}. The length of S, |S|, is n. In single
sequence mining, the events are represented categori-
cally from a finite set of event types.
Episode: An episode α is orderedlist ofcharacterised
events E of the form α = ha
1
,a
2
,..., a
m
i with a
j
E
for all j = 1,...,m. The size of α is the number of
elements of α that is |α| = m. An episode α imposes
a constrain on relative order of occurrences of a
j
i.e.,
if the event type a
j
occurs before event type a
j+1
for
all j = 1,..., m 1 is a serial episode. If there are
no constraints about the order of their appearances is
called parallel episode and will be denoted as α =
ha
1
· a
2
· ... · a
m
i.
Sub-episode, Super-episode and Maximal Fre-
quent Episode: An episode β = hb
1
,b
2
,..., b
m
i is
a sub-episode of another episode α = ha
1
,a
2
,..., a
n
i
if there exist 1 i
1
< i
2
< ... < i
m
n such that
b
j
= a
i
j
for all j = 1,2,.. .,m. In this case, α is a
super-episode of β. If we have i
1
= 1,i
2
= 2,·· · ,i
m
=
m, then α is called the forward-extension super
episode of β. If we have i
1
= n m + 1, i
2
= n
m + 2,··· ,i
m
= n, then α is called the backward-
extension super episode of β. When an episode do
not have a super-episode it is called a maximal fre-
quent episode.
Occurrences: An episode α = ha
1
,a
2
,..., a
m
i
occurs in a sequence of events S =
h(e
1
,t
1
),(e
2
,t
2
),..., (e
n
,t
n
)i if there is at
least one ordered sequence of events S
=
h(e
i
1
,t
i
1
),(e
i
2
,t
i
2
),..., (e
i
m
,t
i
m
)i such that S
S
and a
j
= e
i
j
for all j = 1,2,...,m. Usually an occur-
rence is denoted as o = hi
1
,i
2
,..., i
m
i where o[ j] = i
j
and j = 1, 2,...,m.
Non Redundant Occurrences: A set of occurrences
of an episode α is called non-redundant if for any two
occurrences o = hi
1
,i
2
,..., i
m
i and o
= hi
1
,i
2
,..., i
m
i
no event occurs simultaneously in both, i. e., e
i
j
6=
e
i
j
for all j {1,2,...,m}.
Suffix, Prefix and Large Suffix of an Episode: The
suffix of an episode α is defined as an episode com-
posed by the last element of α, the prefix of α is
the episode composed by all elements in α without
the last one and the large suffix is the episode com-
posed by the elements in α without the first one. That
is, if α = ha
1
,a
2
,..., a
m
i, then the suf fix(α) = ha
m
i,
the prefix(α) = ha
1
,a
2
,..., a
m1
i and the large suffix
lsuf fix(α) = ha
2
,..., a
m
i.
Anti-monotonicity: It is a common principle that fre-
quency measure methods should obey in frequent pat-
tern mining. This principle says that the frequency of
an episode must be less or equal to the frequency of its
sub-episodes i.e., two episodes α and β in a sequence
where α β follow the principleof anti-monotonicity
if freq(β) freq(α).
2.1 Serial Occurrences
Given a sequence of events S =
h(e
1
,t
1
),(e
2
,t
2
),..., (e
n
,t
n
)i, a candidate episode
α = ha
1
,a
2
,..., a
m
i and a maximal gap between
events max gap = k, Algorithm 1 returns the set
of maximal non-redundant occurrences, maxnO.
First, for m = 1, the occurrences of the episode
are the same minimal occurrences of the event
a
1
, maxnO(S,α,k) = mo(a
1
). Then, for m > 1,
maxnO(S,α,k) is obtained by properly joining each
occurrence of a
1
with the occurrences of a
2
,..., a
m
located between the correspondingt
1
to t
1
+(m 1)k.
For simplicity let each t
i
in S take values from
j = 1,2,..., and t
i
= j means the i-th data element
occurs at the j-th tiemestamp. The algorithm has a
two phase structure. In the first phase (lines 4-9), a
list for each occurrence of a
1
, mo(a
1
)(i), is created
containing the occurrences of the other events (a
j
)
within the constraint k, where list.a
1
= mo(a
1
)(i) and
list.a
j
= mo(a
j
) such that list.a
j1
(1) < mo(a
j
)
list.a
j1
(end) + k for j = 2, ...,m.
In the second phase (lines 11-17), the most proper
serial occurrence sO is selected from the list. The
most proper occurrence is composed by the most left
occurrence of each event found in list.a
j
that meets
the restrictions of k between events. Thisis donestart-
ing with the first occurrence of the last event from the
list, that is list.a
m
(1) (line 12) and in an iterative pro-
cedure the left most occurrence of the other events
within k are located (lines 13-17). Each serial occur-
rence sO is added to maxnO (line 18) and constitutes
the output of the algorithm.
Note that to search the occurrences of an episode
(any size), the method requires only the single event
occurrences without using their sub-episodes.
FrequentandSignificantEpisodesinSequencesofEvents-ComputationofaNewFrequencyMeasurebasedonIndividual
OccurrencesoftheEvents
325
Algorithm 1: serialMethod.
Input: An event sequence S, a candidate episode α =
ha
1
,a
2
,...,a
m
i, the maximal gap k, occurrences of the
events in α i.e, mo(a
1
),...,mo(a
m
).
Output: The maximal non-redundant occurrences of α,
maxnO(S,α,k).
Procedure:
1: Initialise maxnO(S,α, k) {}
2: for i = 1 to |mo(a
1
)| do
3: //From each mo(a
1
) create a list of candidate occur-
rences.
4: if mo(a
1
)(i) / maxnO(S,α, k) then
5: list.a
1
mo(a
1
)(i)
6: for j = 2 to |α| do
7: oc mo(a
j
) such that list.a
j1
(1) <
mo(a
j
) list.a
j1
(end) + k and
mo(a
j
) / maxnO(S,α,k)
8: if oc 6= {} then
9: list.a
j
mo(a
j
)(oc)
10: //From list select the most proper occurrence.
11: if size(list) = |α| then
12: sO list.a
m
(1)
13: for j = m 1 to 1 do
14: for kk = 1 to
list.a
j
do
15: if sO(1)list.a
j
(kk) k then
16: sO [list.a
j
(kk) sO]
17: break
18: Add sO to maxnO(S,α, k)
2.2 Parallel Occurrences
In a parallel episode there are no constraints about the
partial orderof the events. The occurrences of a paral-
lel episode include the occurrences of serial episodes
composed by the same events, and its frequency is
equal or greater than any serial episode composed by
the same event types. Methods to measure frequency
based on Fixed window width and Non-overlapped
occurrences have been reported in (Mannila et al.,
1997) and (Laxman et al., 2004), respectively.
Given a sequence of events S, a candidate paral-
lel episode α = ha
1
· a
2
· ... · a
m
i and a maximal gap
between events max gap = k, the Algorithm 2 re-
turns the set of maximal non-redundant occurrences
maxnO. For m = 1, the occurrences of the episode
are the same minimal occurrences of the event a
1
,
maxnO(S,α,k) = mo(a
1
). For m > 1, the occur-
rences of all events in α are sorted in a structure
vm where vm.o = unique(mo(a
1
),..., mo(a
m
)) con-
tains the occurrences and vm.e contains the corres-
ponding event types, then the set maxnO(S,α,k) is
then obtained from it. For each occurrence vm.o(i)
the correspondingset of events located between tm(i)
to tm(i+ (m 1)k) are evaluated to search the more
proper occurrence.
The structure of the algorithm is as follows. Each
parallel occurrence of an episode is selected in three
Algorithm 2: parallelMethod.
Input: An event sequence S, a candidate episode α =
ha
1
· a
2
· ... · a
m
i, the maximal gap k, occurrences of the
events in α i.e, mo(a
1
),...,mo(a
m
).
Output: The maximal non-redundant occurrences of α,
maxnO(S,α,k).
Procedure:
1: Initialise maxnO(S,α,k) {}
2: vm unique(mo(a
1
),...,mo(a
m
))
3: for i = 1 to |vm| do
4: if vm(i) / maxnO(S,α, k) then
5: //Create a list of likely occurrences
6: list vm(i) to vm(i + (m 1)k) for all vm.o /
maxnO(S,α,k)
7: //Sort the most probable serial episode
8: α
s
unique(list.e)
9: for j = 1 to |α| do
10: Oaux.α
j
list.o such that list.e = α
j
11: //Find the most properly occurrence
12: if α α
s
then
13: pO serialMethod(list, α
s
,Oaux,k)
14: if pO = {} then
15: for j = 2 to |α| 1 do
16: α
s
reorder(α
s
)
17: pO serialMethod(list, α
s
,Oaux,k)
18: if pO 6= {} then
19: break
20: if pO 6= {} then
21: Add pO to maxnO(S, α,k)
phases. In the first phase (line 6), for each occurrence
in vm.o(i) that has not been consideredin previousoc-
currences,a list with the occurrences between vm.o(i)
to vm.o(i+ (m 1)k) is created.
In the second phase (lines 8-10), the occurrences
of each event are saved in an auxiliary list Oaux. The
most probable serial episode α
s
is extracted (line 8)
from list.e using the function unique. This function
selects the first event of each type in α that appears in
list.e.
Finally, the most proper occurrence pO is ex-
tracted (lines 13-21) using the method for serial
episodes with α
s
, list, Oaux and k as inputs. Each
parallel occurrence pO is added to maxnO (line 21)
and constitutes the output of the algorithm.
3 SIGNIFICANT EPISODES
Frequent episodes are those that have a number of oc-
currencesgreater than a fixed threshold (min fr), usu-
ally predefined by the user; but only some of them are
really significant for knowledge discovery purposes.
Relevance of frequent episodes have to be evaluated
to find representative connections between events de-
scribing patterns. Confidence of the episode rule can
be used for that purpose. In the following paragraph
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
326
Table 1: Number of frequent and maximal episodes, and patterns found in the synthetic sequence for several values of k.
k=0.01 s k=0.02 s k=0.03 s k=0.04 s k=0.05 s
Frequent episodes 28 132 389 1149 4354
Maximal episodes 16 35 103 252 902
Patterns:
Q
f
{conf 0.8} 0 12 32 89 456
Q
f
{conf
B
0.5} 1 16 67 208 824
Q
f
{coh 0.8∧} 6 9 17 24 46
Q
f
{conf 0.8 conf
B
0.5} 0 5 20 77 413
Q
f
{coh 0.8conf
B
0.5} 1 4 6 13 32
Q
f
{conf 0.8 coh 0.8} 0 4 5 6 14
Q
f
{conf 0.8 coh 0.8 conf
B
0.5} 0 3 4 5 11
this evaluation criteria and two new ones (Level of
cohesion and the level of backward-confidence) pro-
posed by the authors are defined.
Confidence of an Episode: The confidence of an
episode α, conf(α) is the fraction between the fre-
quency of the episode and the frequency of its pre-
fix (Mannila et al., 1997), (Gan and Dai, 2011). The
episodes whose confidence is greater than a threshold,
min con f, are called episode rules and can be consid-
ered relevant for reasoning tasks. These rules can be
interpreted as the probability of occurrence of a new
episode once its prefix has occurred.
conf(α) =
fr(α)
fr(prefix(α))
(1)
Cohesion of an Episode: The cohesion of an episode
α, coh(α) is defined as the fraction between the num-
ber of serial and parallel occurrences. This index
measures the strength of order relation expressed by
the serial episode with respect to other episodes in the
sequence containing the same events in different or-
der (parallel episodes).
coh(α) =
fr serial(α)
fr parallel(α)
(2)
Backward-confidence of an Episode: Given that an
episode α is the backward-extention super episode
of its large suffix (Zhou et al., 2010), we define the
backward-confidence of an episode α, conf
B
(α) as
the fraction between the frequency of the episode and
the frequency of its large suffix. This index measures
the probability of occurrence of an episode given the
frequency of its large suffix i.e., reveals information
about the origin of the episode.
conf
B
(α) =
fr(α)
fr(lsuf fix(α))
(3)
Extraction of Patterns: The significance of fre-
quent episodes can be obtained from their correspond-
ing levels of confidence, cohesion and backward-
confidence as a quality factor Q
f
defined as:
Q
f
(α) = f (conf (α),coh(α), conf
B
(α))
(4)
Restrictions of this quality factor will be set by the
user according to discovery goals. The criterion
can include one or several indexes combined in
different ways. As example, a possible index
could be defined by Q
f min
{conf min conf}
or Q
f min
{conf min conf coh min coh
conf
B
min conf
B
}.
Finally, to avoid redundant information, only
maximal frequent episodes with significant Q
f
will
be retained as patterns, followingthe criteriaproposed
by (Doucet and Ahonen-Myka, 2006).
4 EXPERIMENTAL RESULTS
Frequent episodes and patterns are extracted using the
new frequency measure method and the indexes pro-
posed in this paper. Some results are presented and
discussed using a synthetic sequence as toy example.
Synthetic sequence was generated by embedding
two patterns hL,M,Ni and hE,F, G,Hi into a random
stream of events using an alphabet of 14 event types.
The total sequence time is 5 s and 0.01 s is the aver-
age time between events. The detail of the sequence
generator can be consulted in (Patnaik, 2006).
Table 1 summarises the main result of the cited
sequence using as minimum threshold min fr = 20
occurrences (the frequency of the less frequent event)
and several maximal gaps k between events. As qual-
ity factor Q
f
for pattern extraction, we have fixed
min conf = 0.8, min coh = 0.8 and min conf
B
= 0.5
(based on the corresponding values of the embedding
patterns). All the frequent episodes are evaluated,
using one or several indexes, but only the maximal
episodes are retained to avoid redundant information.
The first row of Table 1 shows the total num-
ber of frequent episodes for different values of k. It
is observed that their number increases rapidly as k
is relaxed. The corresponding number of maximal
episodes is show in the second row of the table. Their
number is still significantly high, since the number of
embedded patterns is only two.
From row three hereinafter,the number of patterns
FrequentandSignificantEpisodesinSequencesofEvents-ComputationofaNewFrequencyMeasurebasedonIndividual
OccurrencesoftheEvents
327
Table 2: Patterns extracted using Q
f
{conf 0.8
coh 0.8 conf
B
0.5} as selection criteria.
k=0.02 s k=0.03 s k=0.04 s
hL,M,Ni hL,M,Ni hL,M,Ni
hG,H,G,Hi hG,H,G,Hi hM,N, M,Ni
hE,F,G,Hi hF,G,F, Gi hE,F,G,Hi
hE,F,G,Hi hE, F,G, F,Gi
hG,H,G,H,G,Hi
extracted using one or several of the proposed crite-
ria are shown. For k not equal to 0.01 s the num-
ber of patterns using the criteria of min coh is smaller
than those extracted using criteria of min con f or
min con f
B
, respectively. With combination of two
criteria, the best result (smaller number of patterns) is
obtained using min con f and min coh. However, the
combination of the three criteria delivers much better
results.
Table 2 shows the patterns extracted with the com-
bination of the three criteria (conf coh conf
B
)
for k=0.02 s to k=0.04 s. The two patterns hL,M,Ni
and hE, F,G,Hi embedded in the sequence were ex-
tracted satisfactorily (except for k=0.01 s). As the
constrain of maximal gap is relaxed (k increases),
other frequent patterns involving mainly the frequent
events F, G, and H begins to be significant.
This example shows that the proposed indexes
of cohesion (coh) and backward-confidence (conf
B
)
may be helpful in the selection of the most significant
patterns, improving the results obtained by the simple
extraction of maximal episodes or episode rules.
5 CONCLUSIONS
The problem of discovering significance of episodes
(patterns) has been analysed and two new indexes
called cohesion and backward-confidence of the
episodes have been proposed to improve pattern dis-
covery from frequent episodes. A new method to find
the maximal number of serial and parallel occurrences
has also been presented. Experimental results using a
synthetic sequence show that both, the indexes and
the algorithms proposed, are useful to search signifi-
cant patterns in sequences of events.
Set the properties of the method as well as as-
sessing their performance against similar frameworks
using real and synthetic data, is part of the work in
progress.
ACKNOWLEDGEMENTS
This work was supported by the research project
“Monitorizaci
´
on Inteligente de la Calidad de la En-
erg
´
ıa El
´
ectrica” (DPI2009-07891) funded by the
Ministerio de Ciencia e Innovaci
´
on (Spain) and
FEDER. Also with the support of the
Comis-
sionat per a Universitats i Recerca del Departament
d’Innovacio, Universitats i Empresa
of the Generali-
tat de Catalunya and also the European Social Fund
under the FI grant 2012FI B2 00119.
REFERENCES
Agrawal, R. and Srikant, R. (1994). Fast algorithms for
mining association rules. In Int. Conf. Very Large
Data Bases (VLDB’94).
Agrawal, R. and Srikant, R. (1995). Mining sequential
patterns. In Int. Conf. Data Engineering (ICDE’95),
pages 3–14.
Casas-Garriga, G. (2003). Discovering unbounded episodes
in sequential data. In Lavrac, N., Gamberger, D.,
Todorovski, L., and Blockeel, H., editors, Knowledge
Discovery in Databases: PKDD 2003, volume 2838
of Lecture Notes in Computer Science, pages 83–94.
Springer Berlin / Heidelberg.
Doucet, A. and Ahonen-Myka, H. (2006). Fast extraction
of discontiguous sequences in text: a new approach
based on maximal frequent sequences. Proceedings
of IS-LTC, 2006:186–191.
Gan, M. and Dai, H. (2010). A study on the accuracy of fre-
quency measures and its impact on knowledge discov-
ery in single sequences. In Data Mining Workshops
(ICDMW), 2010 IEEE International Conference on,
pages 859–866.
Gan, M. and Dai, H. (2011). Fast mining of non-derivable
episode rules in complex sequences. In Torra, V.,
Narakawa, Y., Yin, J., and Long, J., editors, Model-
ing Decision for Artificial Intelligence, volume 6820
of Lecture Notes in Computer Science, pages 67–78.
Springer Berlin / Heidelberg.
Iwanuma, K., Ishihara, R., Takano, Y., and Nabeshima, H.
(2005). Extracting frequent subsequences from a sin-
gle long data sequence a novel anti-monotonic mea-
sure and a simple on-line algorithm. In Data Mining,
Fifth IEEE International Conference on, page 8 pp.
Laxman, S., Sastry, P., and Unnikrishnan, K. (2007). Dis-
covering frequent generalized episodes when events
persist for different durations. Knowledge and Data
Engineering, IEEE Transactions on, 19(9):1188–
1201.
Laxman, S., Sastry, P. S., and Unnikrishnan, K. P. (2004).
Fast algorithms for frequent episode discovery in
event sequences. Technical report, CL-2004-04/MSR,
GM R&D Center, Warren.
Mannila, H., Toivonen, H., and Verkamo, A. I. (1997). Dis-
covery of frequent episodes in event sequences. Data
Mining and Knowledge Discovery, 1(3):259–289.
Patnaik, D. (2006). Application of frequent episode frame-
work in microelectrode array data analysis. Master’s
thesis, Dept. Electrical Engineering, Indian Institute
of Science, Bangalore.
Zhou, W., Liu, H., and Cheng, H. (2010). Mining closed
episodes from event sequences efficiently. In Proceed-
ings of the 14th Pacific-Asia Conference on Knowl-
edge Discovery and Data Mining(1), pages 310–318.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
328