Frequent and Signiﬁcant Episodes in Sequences of Events

Computation of a New Frequency Measure based on Individual Occurrences of the

Events

Oscar Quiroga, Joaquim Mel´endez and Sergio Herraiz

Institute of Informatics and Applications, University of Girona, 17071 Girona, Catalonia, Spain

Keywords:

Data Mining, Event Sequences, Frequent Episodes, Pattern Discovery.

Abstract:

Pattern discovery in event sequences is based on the mining of frequent episodes. Patterns are the result of the

assessment of frequent episodes using episode rules. However, with a simple search usually a huge number

of frequent episodes and rules are found, then, methods to recognise the most signiﬁcant patterns and to

properly measure the frequency of the episodes, are required. In this paper, two new indexes called cohesion

and backward-conﬁdence of the episodes are proposed to help in the extraction of signiﬁcant patterns. Also,

two methods to ﬁnd the maximal number of non-redundant occurrences of serial and parallel episodes are

presented. Experimental results demonstrate the compactness of the mining result and the efﬁciency of our

mining algorithms.

1 INTRODUCTION

Frequent episodes mining in sequences of events can

be applied in many application domains, as for exam-

ple in telecommunication networks (Mannila et al.,

1997), web access pattern discovery, family protein

analysis (Casas-Garriga, 2003), fault prognosis based

on logs of a manufacturing plant (Laxman et al.,

2007), study of multi-neuronal spike train recordings

(Patnaik, 2006) or event tracking problems for news

stories (Iwanuma et al., 2005), etc. The starting point

are data sets organised as a single long sequence of

events where each event is described by its type and

its time of occurrence.

Pattern discovery in event sequences includes two

main steps: frequent episodes extraction and signi-

ﬁcant episodes recognition. Frequent episodes ex-

traction is usually a process that starts looking for

frequent single events (frequent events) and, follow-

ing an iterative procedure, candidate episodes are

build from frequent episodes found in the previous

level. Candidates who reach a minimum frequency

threshold are classiﬁed as frequent episodes(Agrawal

and Srikant, 1995). Several methods have been

proposed to cope with particularities of events and

episodes, considering for example duration, maximal

gap between events, overlapping/non-overlapping of

episodes, minimal occurrences, etc. The frequency

of an episode i.e., the number of occurrences over a

sequence, may vary among different algorithms and

results depend on these procedures but also on how

events are distributed in the sequence (Gan and Dai,

2010).

Relevance of frequent episodes have to be evalu-

ated to ﬁnd representativeconnectionsbetween events

describing patterns. Conﬁdence of the episode rule

can be used for that purpose (Agrawal and Srikant,

1994). However, with a simple search usually a huge

number of frequent episodes and rules are found.

Thus, to extract the most relevant information it is

necessary to use auxiliary criteria (Gan and Dai,

2011).

In this paper, an improved frequency measure

method and two new indexesto assess the signiﬁcance

of the episodes are suggested. The method locates

and counts the largest number of non-redundant oc-

currences of an episode, and is useful for both, se-

rial and parallel episodes. The new indexes are pro-

posed as complementary criteria to the conﬁdence of

the episode rules. The ﬁrst one, named cohesionof the

episode, is based on the comparison of the number of

serial and parallel occurrences, whereas the second,

named backward-conﬁdence of the episode, is analo-

gous to the conﬁdence of the episode rule but evalu-

ates the beginning of the episode.

The usefulness of the method and the indexes pro-

posed, are evaluated with synthetic event sequences.

324

Quiroga O., Meléndez J. and Herraiz S..

Frequent and Signiﬁcant Episodes in Sequences of Events - Computation of a New Frequency Measure based on Individual Occurrences of the Events.

DOI: 10.5220/0004118003240328

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 324-328

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

2 FMINEVENT

Fminevent is the short name of the frequency measure

method based on individual occurrences of the events

proposed in this paper. The method is described with

commonly used terms in literature of sequence data

mining.

Event: An event is deﬁned by the pair (e,t) where t

denotes the occurrence time (timestamp) and e repre-

sent the event attributes (one or several) that contain

the informationusefulto characterise the event. Event

attributes can be a single label or a vector of contin-

uous/discrete attribute-value pairs deﬁned in a given

range or set of predeﬁned values.

Sequence of Events: A sequence of events S is de-

ﬁned as an ordered list of events or a n-tuple S =

h(e

),(e

),..., (e

)i where t

≤ t

i+1

for all i ∈

{1,2,...,n − 1}. The length of S, |S|, is n. In single

sequence mining, the events are represented categori-

cally from a ﬁnite set of event types.

Episode: An episode α is orderedlist ofcharacterised

events E of the form α = ha

,..., a

i with a

∈ E

for all j = 1,...,m. The size of α is the number of

elements of α that is |α| = m. An episode α imposes

a constrain on relative order of occurrences of a

i.e.,

if the event type a

occurs before event type a

j+1

for

all j = 1,..., m− 1 is a serial episode. If there are

no constraints about the order of their appearances is

called parallel episode and will be denoted as α =

· a

· ... · a

Sub-episode, Super-episode and Maximal Fre-

quent Episode: An episode β = hb

,..., b

i is

a sub-episode of another episode α = ha

,..., a

if there exist 1 ≤ i

< i

< ... < i

≤ n such that

= a

for all j = 1,2,.. .,m. In this case, α is a

super-episode of β. If we have i

= 1,i

= 2,·· · ,i

m, then α is called the forward-extension super

episode of β. If we have i

= n − m + 1, i

= n −

m + 2,··· ,i

= n, then α is called the backward-

extension super episode of β. When an episode do

not have a super-episode it is called a maximal fre-

quent episode.

Occurrences: An episode α = ha

,..., a

occurs in a sequence of events S =

h(e

),(e

),..., (e

)i if there is at

least one ordered sequence of events S

′

h(e

),(e

),..., (e

)i such that S

′

⊆ S

and a

= e

for all j = 1,2,...,m. Usually an occur-

rence is denoted as o = hi

,..., i

i where o[ j] = i

and j = 1, 2,...,m.

Non Redundant Occurrences: A set of occurrences

of an episode α is called non-redundant if for any two

occurrences o = hi

,..., i

i and o

′

= hi

′

,..., i

′

no event occurs simultaneously in both, i. e., e

′

for all j ∈ {1,2,...,m}.

Sufﬁx, Preﬁx and Large Sufﬁx of an Episode: The

sufﬁx of an episode α is deﬁned as an episode com-

posed by the last element of α, the preﬁx of α is

the episode composed by all elements in α without

the last one and the large sufﬁx is the episode com-

posed by the elements in α without the ﬁrst one. That

is, if α = ha

,..., a

i, then the suf fix(α) = ha

the prefix(α) = ha

,..., a

m−1

i and the large sufﬁx

lsuf fix(α) = ha

,..., a

Anti-monotonicity: It is a common principle that fre-

quency measure methods should obey in frequent pat-

tern mining. This principle says that the frequency of

an episode must be less or equal to the frequency of its

sub-episodes i.e., two episodes α and β in a sequence

where α ⊆ β follow the principleof anti-monotonicity

if freq(β) ≤ freq(α).

2.1 Serial Occurrences

Given a sequence of events S =

h(e

),(e

),..., (e

)i, a candidate episode

α = ha

,..., a

i and a maximal gap between

events max gap = k, Algorithm 1 returns the set

of maximal non-redundant occurrences, maxnO.

First, for m = 1, the occurrences of the episode

are the same minimal occurrences of the event

, maxnO(S,α,k) = mo(a

). Then, for m > 1,

maxnO(S,α,k) is obtained by properly joining each

occurrence of a

with the occurrences of a

,..., a

located between the correspondingt

to t

+(m− 1)k.

For simplicity let each t

in S take values from

j = 1,2,..., and t

= j means the i-th data element

occurs at the j-th tiemestamp. The algorithm has a

two phase structure. In the ﬁrst phase (lines 4-9), a

list for each occurrence of a

, mo(a

)(i), is created

containing the occurrences of the other events (a

)

within the constraint k, where list.a

= mo(a

)(i) and

list.a

= mo(a

) such that list.a

j−1

(1) < mo(a

) ≤

list.a

j−1

(end) + k for j = 2, ...,m.

In the second phase (lines 11-17), the most proper

serial occurrence sO is selected from the list. The

most proper occurrence is composed by the most left

occurrence of each event found in list.a

that meets

the restrictions of k between events. Thisis donestart-

ing with the ﬁrst occurrence of the last event from the

list, that is list.a

(1) (line 12) and in an iterative pro-

cedure the left most occurrence of the other events

within k are located (lines 13-17). Each serial occur-

rence sO is added to maxnO (line 18) and constitutes

the output of the algorithm.

Note that to search the occurrences of an episode

(any size), the method requires only the single event

occurrences without using their sub-episodes.

FrequentandSignificantEpisodesinSequencesofEvents-ComputationofaNewFrequencyMeasurebasedonIndividual

OccurrencesoftheEvents

325

Algorithm 1: serialMethod.

Input: An event sequence S, a candidate episode α =

,...,a

i, the maximal gap k, occurrences of the

events in α i.e, mo(a

),...,mo(a

Output: The maximal non-redundant occurrences of α,

maxnO(S,α,k).

Procedure:

1: Initialise maxnO(S,α, k) ← {}

2: for i = 1 to |mo(a

)| do

3: //From each mo(a

) create a list of candidate occur-

rences.

4: if mo(a

)(i) /∈ maxnO(S,α, k) then

5: list.a

← mo(a

)(i)

6: for j = 2 to |α| do

7: oc ← mo(a

) such that list.a

j−1

(1) <

mo(a

) ≤ list.a

j−1

(end) + k and

mo(a

) /∈ maxnO(S,α,k)

8: if oc 6= {} then

9: list.a

← mo(a

)(oc)

10: //From list select the most proper occurrence.

11: if size(list) = |α| then

12: sO ← list.a

(1)

13: for j = m− 1 to 1 do

14: for kk = 1 to



list.a



15: if sO(1)−list.a

(kk) ≤ k then

16: sO ← [list.a

(kk) sO]

17: break

18: Add sO to maxnO(S,α, k)

2.2 Parallel Occurrences

In a parallel episode there are no constraints about the

partial orderof the events. The occurrences of a paral-

lel episode include the occurrences of serial episodes

composed by the same events, and its frequency is

equal or greater than any serial episode composed by

the same event types. Methods to measure frequency

based on Fixed window width and Non-overlapped

occurrences have been reported in (Mannila et al.,

1997) and (Laxman et al., 2004), respectively.

Given a sequence of events S, a candidate paral-

lel episode α = ha

· a

· ... · a

i and a maximal gap

between events max gap = k, the Algorithm 2 re-

turns the set of maximal non-redundant occurrences

maxnO. For m = 1, the occurrences of the episode

are the same minimal occurrences of the event a

maxnO(S,α,k) = mo(a

). For m > 1, the occur-

rences of all events in α are sorted in a structure

vm where vm.o = unique(mo(a

),..., mo(a

)) con-

tains the occurrences and vm.e contains the corres-

ponding event types, then the set maxnO(S,α,k) is

then obtained from it. For each occurrence vm.o(i)

the correspondingset of events located between tm(i)

to tm(i+ (m − 1)k) are evaluated to search the more

proper occurrence.

The structure of the algorithm is as follows. Each

parallel occurrence of an episode is selected in three

Algorithm 2: parallelMethod.

Input: An event sequence S, a candidate episode α =

· a

· ... · a

i, the maximal gap k, occurrences of the

events in α i.e, mo(a

),...,mo(a

Output: The maximal non-redundant occurrences of α,

maxnO(S,α,k).

Procedure:

1: Initialise maxnO(S,α,k) ← {}

2: vm ← unique(mo(a

),...,mo(a

))

3: for i = 1 to |vm| do

4: if vm(i) /∈ maxnO(S,α, k) then

5: //Create a list of likely occurrences

6: list ← vm(i) to vm(i + (m − 1)k) for all vm.o /∈

maxnO(S,α,k)

7: //Sort the most probable serial episode

8: α

← unique(list.e)

9: for j = 1 to |α| do

10: Oaux.α

← list.o such that list.e = α

11: //Find the most properly occurrence

12: if α ⊂ α

then

13: pO ← serialMethod(list, α

,Oaux,k)

14: if pO = {} then

15: for j = 2 to |α| − 1 do

16: α

← reorder(α

)

17: pO ← serialMethod(list, α

,Oaux,k)

18: if pO 6= {} then

19: break

20: if pO 6= {} then

21: Add pO to maxnO(S, α,k)

phases. In the ﬁrst phase (line 6), for each occurrence

in vm.o(i) that has not been consideredin previousoc-

currences,a list with the occurrences between vm.o(i)

to vm.o(i+ (m− 1)k) is created.

In the second phase (lines 8-10), the occurrences

of each event are saved in an auxiliary list Oaux. The

most probable serial episode α

is extracted (line 8)

from list.e using the function unique. This function

selects the ﬁrst event of each type in α that appears in

list.e.

Finally, the most proper occurrence pO is ex-

tracted (lines 13-21) using the method for serial

episodes with α

, list, Oaux and k as inputs. Each

parallel occurrence pO is added to maxnO (line 21)

and constitutes the output of the algorithm.

3 SIGNIFICANT EPISODES

Frequent episodes are those that have a number of oc-

currencesgreater than a ﬁxed threshold (min fr), usu-

ally predeﬁned by the user; but only some of them are

really signiﬁcant for knowledge discovery purposes.

Relevance of frequent episodes have to be evaluated

to ﬁnd representative connections between events de-

scribing patterns. Conﬁdence of the episode rule can

be used for that purpose. In the following paragraph

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

326

Table 1: Number of frequent and maximal episodes, and patterns found in the synthetic sequence for several values of k.

k=0.01 s k=0.02 s k=0.03 s k=0.04 s k=0.05 s

Frequent episodes 28 132 389 1149 4354

Maximal episodes 16 35 103 252 902

Patterns:

⇔ {conf ≥ 0.8} 0 12 32 89 456

⇔ {conf

≥ 0.5} 1 16 67 208 824

⇔ {coh ≥ 0.8∧} 6 9 17 24 46

⇔ {conf ≥ 0.8∧ conf

≥ 0.5} 0 5 20 77 413

⇔ {coh ≥ 0.8∧conf

≥ 0.5} 1 4 6 13 32

⇔ {conf ≥ 0.8∧ coh ≥ 0.8} 0 4 5 6 14

⇔ {conf ≥ 0.8∧ coh ≥ 0.8∧ conf

≥ 0.5} 0 3 4 5 11

this evaluation criteria and two new ones (Level of

cohesion and the level of backward-conﬁdence) pro-

posed by the authors are deﬁned.

Conﬁdence of an Episode: The conﬁdence of an

episode α, conf(α) is the fraction between the fre-

quency of the episode and the frequency of its pre-

ﬁx (Mannila et al., 1997), (Gan and Dai, 2011). The

episodes whose conﬁdence is greater than a threshold,

min con f, are called episode rules and can be consid-

ered relevant for reasoning tasks. These rules can be

interpreted as the probability of occurrence of a new

episode once its preﬁx has occurred.

conf(α) =

fr(α)

fr(prefix(α))

(1)

Cohesion of an Episode: The cohesion of an episode

α, coh(α) is deﬁned as the fraction between the num-

ber of serial and parallel occurrences. This index

measures the strength of order relation expressed by

the serial episode with respect to other episodes in the

sequence containing the same events in different or-

der (parallel episodes).

coh(α) =

fr serial(α)

fr parallel(α)

(2)

Backward-conﬁdence of an Episode: Given that an

episode α is the backward-extention super episode

of its large sufﬁx (Zhou et al., 2010), we deﬁne the

backward-conﬁdence of an episode α, conf

(α) as

the fraction between the frequency of the episode and

the frequency of its large sufﬁx. This index measures

the probability of occurrence of an episode given the

frequency of its large sufﬁx i.e., reveals information

about the origin of the episode.

conf

(α) =

fr(α)

fr(lsuf fix(α))

(3)

Extraction of Patterns: The signiﬁcance of fre-

quent episodes can be obtained from their correspond-

ing levels of conﬁdence, cohesion and backward-

conﬁdence as a quality factor Q

deﬁned as:

(α) = f (conf (α),coh(α), conf

(α))

(4)

Restrictions of this quality factor will be set by the

user according to discovery goals. The criterion

can include one or several indexes combined in

different ways. As example, a possible index

could be deﬁned by Q

f min

⇔ {conf ≥ min conf}

or Q

f min

⇔ {conf ≥ min conf ∧ coh ≥ min coh ∧

conf

≥ min conf

Finally, to avoid redundant information, only

maximal frequent episodes with signiﬁcant Q

will

be retained as patterns, followingthe criteriaproposed

by (Doucet and Ahonen-Myka, 2006).

4 EXPERIMENTAL RESULTS

Frequent episodes and patterns are extracted using the

new frequency measure method and the indexes pro-

posed in this paper. Some results are presented and

discussed using a synthetic sequence as toy example.

Synthetic sequence was generated by embedding

two patterns hL,M,Ni and hE,F, G,Hi into a random

stream of events using an alphabet of 14 event types.

The total sequence time is 5 s and 0.01 s is the aver-

age time between events. The detail of the sequence

generator can be consulted in (Patnaik, 2006).

Table 1 summarises the main result of the cited

sequence using as minimum threshold min fr = 20

occurrences (the frequency of the less frequent event)

and several maximal gaps k between events. As qual-

ity factor Q

for pattern extraction, we have ﬁxed

min conf = 0.8, min coh = 0.8 and min conf

= 0.5

(based on the corresponding values of the embedding

patterns). All the frequent episodes are evaluated,

using one or several indexes, but only the maximal

episodes are retained to avoid redundant information.

The ﬁrst row of Table 1 shows the total num-

ber of frequent episodes for different values of k. It

is observed that their number increases rapidly as k

is relaxed. The corresponding number of maximal

episodes is show in the second row of the table. Their

number is still signiﬁcantly high, since the number of

embedded patterns is only two.

From row three hereinafter,the number of patterns

FrequentandSignificantEpisodesinSequencesofEvents-ComputationofaNewFrequencyMeasurebasedonIndividual

OccurrencesoftheEvents

327

Table 2: Patterns extracted using Q

⇔ {conf ≥ 0.8 ∧

coh ≥ 0.8∧ conf

≥ 0.5} as selection criteria.

k=0.02 s k=0.03 s k=0.04 s

hL,M,Ni hL,M,Ni hL,M,Ni

hG,H,G,Hi hG,H,G,Hi hM,N, M,Ni

hE,F,G,Hi hF,G,F, Gi hE,F,G,Hi

hE,F,G,Hi hE, F,G, F,Gi

hG,H,G,H,G,Hi

extracted using one or several of the proposed crite-

ria are shown. For k not equal to 0.01 s the num-

ber of patterns using the criteria of min coh is smaller

than those extracted using criteria of min con f or

min con f

, respectively. With combination of two

criteria, the best result (smaller number of patterns) is

obtained using min con f and min coh. However, the

combination of the three criteria delivers much better

results.

Table 2 shows the patterns extracted with the com-

bination of the three criteria (conf ∧ coh ∧ conf

)

for k=0.02 s to k=0.04 s. The two patterns hL,M,Ni

and hE, F,G,Hi embedded in the sequence were ex-

tracted satisfactorily (except for k=0.01 s). As the

constrain of maximal gap is relaxed (k increases),

other frequent patterns involving mainly the frequent

events F, G, and H begins to be signiﬁcant.

This example shows that the proposed indexes

of cohesion (coh) and backward-conﬁdence (conf

)

may be helpful in the selection of the most signiﬁcant

patterns, improving the results obtained by the simple

extraction of maximal episodes or episode rules.

5 CONCLUSIONS

The problem of discovering signiﬁcance of episodes

(patterns) has been analysed and two new indexes

called cohesion and backward-conﬁdence of the

episodes have been proposed to improve pattern dis-

covery from frequent episodes. A new method to ﬁnd

the maximal number of serial and parallel occurrences

has also been presented. Experimental results using a

synthetic sequence show that both, the indexes and

the algorithms proposed, are useful to search signiﬁ-

cant patterns in sequences of events.

Set the properties of the method as well as as-

sessing their performance against similar frameworks

using real and synthetic data, is part of the work in

progress.

ACKNOWLEDGEMENTS

This work was supported by the research project

“Monitorizaci

on Inteligente de la Calidad de la En-

erg

ıa El

ectrica” (DPI2009-07891) funded by the

Ministerio de Ciencia e Innovaci

on (Spain) and

FEDER. Also with the support of the

Comis-

sionat per a Universitats i Recerca del Departament

d’Innovacio, Universitats i Empresa

of the Generali-

tat de Catalunya and also the European Social Fund

under the FI grant 2012FI B2 00119.

REFERENCES

Agrawal, R. and Srikant, R. (1994). Fast algorithms for

mining association rules. In Int. Conf. Very Large

Data Bases (VLDB’94).

Agrawal, R. and Srikant, R. (1995). Mining sequential

patterns. In Int. Conf. Data Engineering (ICDE’95),

pages 3–14.

Casas-Garriga, G. (2003). Discovering unbounded episodes

in sequential data. In Lavrac, N., Gamberger, D.,

Todorovski, L., and Blockeel, H., editors, Knowledge

Discovery in Databases: PKDD 2003, volume 2838

of Lecture Notes in Computer Science, pages 83–94.

Springer Berlin / Heidelberg.

Doucet, A. and Ahonen-Myka, H. (2006). Fast extraction

of discontiguous sequences in text: a new approach

based on maximal frequent sequences. Proceedings

of IS-LTC, 2006:186–191.

Gan, M. and Dai, H. (2010). A study on the accuracy of fre-

quency measures and its impact on knowledge discov-

ery in single sequences. In Data Mining Workshops

(ICDMW), 2010 IEEE International Conference on,

pages 859–866.

Gan, M. and Dai, H. (2011). Fast mining of non-derivable

episode rules in complex sequences. In Torra, V.,

Narakawa, Y., Yin, J., and Long, J., editors, Model-

ing Decision for Artiﬁcial Intelligence, volume 6820

of Lecture Notes in Computer Science, pages 67–78.

Springer Berlin / Heidelberg.

Iwanuma, K., Ishihara, R., Takano, Y., and Nabeshima, H.

(2005). Extracting frequent subsequences from a sin-

gle long data sequence a novel anti-monotonic mea-

sure and a simple on-line algorithm. In Data Mining,

Fifth IEEE International Conference on, page 8 pp.

Laxman, S., Sastry, P., and Unnikrishnan, K. (2007). Dis-

covering frequent generalized episodes when events

persist for different durations. Knowledge and Data

Engineering, IEEE Transactions on, 19(9):1188–

1201.

Laxman, S., Sastry, P. S., and Unnikrishnan, K. P. (2004).

Fast algorithms for frequent episode discovery in

event sequences. Technical report, CL-2004-04/MSR,

GM R&D Center, Warren.

Mannila, H., Toivonen, H., and Verkamo, A. I. (1997). Dis-

covery of frequent episodes in event sequences. Data

Mining and Knowledge Discovery, 1(3):259–289.

Patnaik, D. (2006). Application of frequent episode frame-

work in microelectrode array data analysis. Master’s

thesis, Dept. Electrical Engineering, Indian Institute

of Science, Bangalore.

Zhou, W., Liu, H., and Cheng, H. (2010). Mining closed

episodes from event sequences efﬁciently. In Proceed-

ings of the 14th Paciﬁc-Asia Conference on Knowl-

edge Discovery and Data Mining(1), pages 310–318.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

328