LEARNING DYNAMIC BAYESIAN NETWORKS

WITH THE TOM4L PROCESS

Ahmad Ahdab and Marc Le Goc

LSIS, UMR CNRS 6168, Université Paul Cézanne, Domaine Universitaire St Jérôme, 13397 Marseille cedex 20, France

Keywords: Machine Learning, Bayesian Network, Stochastic Representation, Data Mining, Knowledge Discovery.

Abstract: This paper addresses the problem of learning a Dynamic Bayesian Network from timed data without prior

knowledge to the system. One of the main problems of learning a Dynamic Bayesian Network is building

and orienting the edges of the network avoiding loops. The problem is more difficult when data are timed.

This paper proposes a new algorithm to learn the structure of a Dynamic Bayesian Network and to orient the

edges from the timed data contained in a given timed data base. This algorithm is based on an adequate

representation of a set of sequences of timed data and uses an information based measure of the relations

between two edges. This algorithm is a part of the Timed Observation Mining for Learning (TOM4L)

process that is based on the Theory of the Timed Observations. The paper illustrates the algorithm with a

theoretical example before presenting the results on an application on the Apache system of the Arcelor-

Mittal Steel Group, a real world knowledge based system that diagnoses a galvanization bath.

1 INTRODUCTION

The theory of Timed Observations is the

mathematical framework that defines a Knowledge

Engineering methodology called the Timed

Observation Modeling for Diagnosis methodology

(TOM4D) (Le Goc, 2008) (Figure 1) and a learning

process called Timed Observation Mining for

Learning (Le Goc, 2006). TOM4D and TOM4L are

defined to discover temporal knowledge about a set

of functions of the continuous time x

(t) considered

as a dynamic system X(t)={x

(t)} called a process.

According to TOM4D, a model of a process X(t)

is a quadruple <PM(X(t)), SM(X(t)), BM(X(t)),

FM(X(t))>. The Perception Model PM(X(t)) defines

the goals of the process X(t). The Structural Model

SM(X(t)) contains the knowledge about the

components and their organization in structures. The

Behavioral Model BM(X(t)) defines the states and

the state transitions that governs the process

evolution over time. Finally, the Functional Model

FM(X(t)) of the process X(t) defines the

mathematical functions linking the values of the

process variables x

(t) of X(t). We propose to

represent the Functional Model FM(X(t)) of a

process as a Bayesian Network.

The problem is then to define the learning

A Priori

Knowledge

Timed Observations

Database

TOM4D

Experts

TOM4L

θ(X, Δ)

PM(X(t))

SM(X(t)) BM(X(t))

FM(X(t))

A Priori

Knowledge

Timed Observations

Database

TOM4D

Experts

TOM4L

θ(X, Δ)

PM(X(t))

SM(X(t)) BM(X(t))

FM(X(t))

Figure 1: Global Structure of the Project.

principles of a Bayesian Network (BN) from a set of

sequences of timed data, without prior knowledge to

the process. Most of the proposed algorithm deals

with un-timed data and faces some difficulties in

orienting the edges of the resulting graph and

building the conditional probability tables. These

two problems are more difficult when data are

timed.

The theory of Timed Observations provides the

tools to solve these two problems: the BJ-Measure

of (Benayadi, 2008) and an adequate representation

of a set of sequences of timed data that is called the

Stochastic Representation (Le Goc, 2005). The BJ-

Measure is an informational measure designed to

evaluates the quantity of information flowing in the

Stochastic Representation of a set of sequences of

353

Ahdab A. and Le Goc M. (2010).

LEARNING DYNAMIC BAYESIAN NETWORKS WITH THE TOM4L PROCESS.

In Proceedings of the 5th International Conference on Software and Data Technologies, pages 353-363

DOI: 10.5220/0002928603530363

 SciTePress

timed data. This representation facilitates also the

building for the CPT tables.

The next section presents a short description of

the state of the art techniques concerning the

learning of Dynamic Bayesian Networks (DBN).

Section 3 introduces the basis of the Stochastic

Representation of TOM4L and the BJ-Measure.

Section 4 describes the learning principles we define

from the properties of the BJ-Measure, the DBN

learning algorithm that we proposes is proposed in

section 5 and an application to a theoretical example

is given in section 6 before showing a real life

application of the algorithm in section 7. Our

conclusions are presented in section 8.

2 RELATED WORKS

A BN is a couple <G, > where G denotes a Direct

Acyclic Graph in which the nodes represent the

variables and the edges represent the dependencies

between the variables (Pearl, 1988), and  is the

Conditional Probabilities Tables (CP Tables)

defining the conditional probability between the

values of a variable given the values of the upstream

variables of G. BN learning algorithms aims at

discovering the couple <G, > from a given data

base.

BN learning algorithms fall into two main

categories: “search and scoring” and “dependency

analysis” algorithms. The “search and scoring”

learning algorithms can be used when the knowledge

of the edge orientation between the variables of the

system is given (Cooper, 1992), (Heckerman, 1997).

To avoid this problem, dependency analysis

algorithms uses conditional independence tests

(Cheng, 1997), (Cheesseman, 1995), (Friedman

1998), (Meyrs et al, 1999). But the number of test

exponentially increases the computation time

(Chickering, 1994).

(1)

For example, Cheng’s algorithm (Cheng, 1997) for

learning a BN from data falls in the dependency

analysis category and is representative of most of the

proposed algorithms. It is based on the d-separation

concept of (Pearl, 1988) to infer the structure G of

the Bayesian Network, and the mutual information

to detect conditional independency relations. The

idea is that the mutual information I(X, Y) (eq. 1)

tells when two variables are (1) dependent and (2)

how close their relationship is. The algorithm

computes the mutual information I(X, Y) between all

the pairs of variables (X, Y) producing a list L sorted

in descending order: pairs of higher mutual

information are supposed to be more related than

those having low mutual information values. The

List L is then pruned given an arbitrary value of the

parameter

: each pair (X, Y) so that I(X, Y)< is

eliminated of L. In real world applications, list L

should be as small as possible using the  parameter.

This first step (Drafting) creates a structure to start

with but it might miss some edges or it might add

some incorrect edges.

(2)

The second step (Thickening) phase tries to separate

each pair (X, Y) in L using the conditional mutual

information I(X, Y | E) (eq. 2) where E is a set of

nodes that forms a path between the current tested

nodes X and Y from L. When I(X, Y | E)>, then the

edges of the path E should be added between the

current nodes X and Y. This phase continues until the

end of list L is reached. The last step of the

algorithm (Thinning) searchs, for each edge in the

graph, if there are other paths besides this edge

between these two nodes. In that case, the algorithm

removes this edge temporarily and tries to separate

these two nodes using equation (2). If the two nodes

cannot be separated, then the temporarily removed

edge will be returned. After building the DBN

structure, the orientation of the edges and the CP

Tables’ computation is to be done. The procedure

used by (Cheng, 1997) is based on the idea of

searching for the nodes forming a V-Structure

X→Y←Z using the conditional mutual information,

and then trying to deduce the other edges from the

discovered one. This procedure have a very big

limitation which is that if a network does not contain

a V-Structure, no edge can be oriented.

The two main limitations of the methods of the

dependency analysis category are so the need of

defining the  parameter and the exponential amount

of Conditional Independence tests to orient the edges

of the graph.

3 TOM4L FRAMEWORK

The Timed Observation Mining for Learning

process (TOM4L) proposes a solution to escape

from this problem (Le Goc, 2006).

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

354

The TOM4L framework defines a message timed at

contained in a database as an occurrence

)C

(k) of an observation class C

={(x

)}

which is an arbitrary set of couples (x

) where

one of the discrete values of a variable x

. An

observation class is often a singleton because in that

case, two classes C

= {(x

)} and C

= {(x

)}

are only linked with the variables x

and x

when the

constants

and

are independent (Le Goc, 2006).

The TOM4L framework represents a sequence

=(…, C

(k), …) of m occurrences C

(k) defining a

set Cl={ C

} of n timed observations under a

specific representation, called the Stochastic

Representation, that is made with a set of matrices.

The TOM4L framework proposes also the BJ-

Measure (Benayadi, 2008) that evaluates the

homogeneity of the crisscross of the occurrences of

two observation classes C

and C

in a sequence. This

measure considers two abstract binary variables X

and Y linked through a discrete binary memoryless

channel of a communication system (Shannon,

1949), where X(t

) takes a value C

in {C

,¬C

} and

Y(t

k+1

) a value C

in {C

,¬C

} when reading 

(Figure 2, where ¬C

denotes any class but C

With this model, a sequence  of m occurrences

(k) is a sequence of m-1 instances r(C

→C

) of a

relation r(X→Y)

Figure 2: Discrete Binary Memoryless Channel.

The BJ-measure is build on the Kullback-Leibler

distance D(P(Y|X=Ci)||P(Y)) that evaluates the

relation between the distribution of the conditional

probability of Y knowing that X(tk)=Ci and the prior

probability distribution of Y. One of the properties

of this distance is that D(P(Y|X=Ci)||P(Y))=0 when

P(Y| X=Ci)=P(Y) (i.e. P(Y) and P(X) are

independent). The BJ-measure decomposes the

Kullback-Leibler distance in two terms around the

independence point. The BJL-measure BJL(C

→C

)

of binary relation r(C

→C

) is the right part of the

Kullback-Leibler distance D(P(Y|X=C

)||P(Y)) so

that:

• P(Y=C

|X=C

)<P(Y=C

) ⇒ BJL(C

, C

)=0

• P(Y=C

|X=C

)≥P(Y=C

) ⇒

BJL(C

→C

)= D(P(Y|X=C

)||P(Y))

The BJL(C

→C

) is non-zero when the observation

(k) provides some information about the

observation C

(k). Symmetrically, when

BJL(C

→C

)<0, the observation C

(k) provides some

information about ¬C

(k). The BJL-measure

BJL(C

→¬C

) of a binary relation r(C

→¬C

) is then

the left part of the Kullback-Leibler distance:

• P(Y=C

|X=C

)<P(Y=C

) ⇒

BJL(C

→¬C

)=D(P(Y|X=C

)||P(Y))

• P(Y=C

|X=C

)≥P(Y=C

) ⇒ BJL(C

→¬C

)= 0

Consequently:

D(P(Y|X=C

)||P(Y))=BJL(C

→C

)+BJL(C

→¬C

) (3)

Similarly, the BJW-measure evaluates the

information distribution between the predecessors

(k) or ¬C

(k)) of an observation C

(k+1) at time

k+1

D(P(X|Y=C

)||P(X))=BJW(C

→C

)+BJW(C

→¬C

)(4)

Because (P(C

)<P(C

))⇔(P(C

)<P(C

)), the two

measures are null at the same independence point

and can be combined in a single measure called the

BJM-measure which is the norm vector of

BJL(C

→C

) and BJW(C

→C

(5)

The BJ-Measure is not justifiable when the

i,j

= n

is greater of 4 or less than ¼ (Benayadi, 2008). This

property is called the

property. In most real world

cases, when this condition is satisfied, the M(C

→C

)

value is not zero but the eventual relation r(C

→C

)

can not be justified with the BJ-measure.

The main property of the BJ-measure is the

following: M(C

→C

) > 0 means that knowing the

timed observation distribution of the C

class brings

information about the timed observation distribution

of the C

class. Consequently, when the BJ-measure

of a relation r(C

→C

)≤0, is negative or null, the

relations can not be used to build the structure of a

dynamic Bayesian network.

In other words, considering the positive values

only, the BJ-measure M(C

→C

) satisfies the three

following properties:

1. Dissymmetry:

M(C

→C

)≠M(C

→C

) (generally)

2. Positivity: ∀ C

, C

, M(C

→C

) ≥ 0

3. Independence:

M(C

→C

)=0 ⇔ C

and C

are independant (i.e.

P(C

)=P(C

))

LEARNING DYNAMIC BAYESIAN NETWORKS WITH THE TOM4L PROCESS

355

4. Triangular inequality:

M(C

→C

) < M(C

→C

) + M(C

→C

)

This latter property can be used to reason with the

BJ-measure to deduce the structure of a dynamic

Bayesian network.

4 LEARNING PRINCIPLES

Let us consider a set R={…, r(C

→C

), …} of n

binary relations. The operation that remove a binary

relation r(C

→C

) from the set R is denoted

Remove(r(C

→C

)): R ← R – { r(C

→C

) }.

The positivity property leads to remove the

r(C

→C

) relations having a negative value of the

BJ-measure (“Positivity rule”):

• Rule 1 : ∀r(C

→C

)∈R, M(C

→C

)≤0 ⇒

Remove(r(C

→C

))

The dissymmetry property allows deducing the

orientation of a hypothetical relation between two

timed observation classes C

and C

• Rule 2:

∀r(C

→C

), r(C

→C

)∈R,

M(C

→C

)>BJM(C

→C

) ⇒ Remove(r(C

→C

))

This rule means that when M(C

→C

)>M(C

→C

the C

class brings more information about the C

class than the reverse. The relation r(C

→C

) can

then be removed from the set R without any

consequence. This rule is so called the “orientation

rule”.

i+1

BJM(C

→C

i+1

)

BJM(C

i+1

→C

i+2

)

BJM(C

→C

)

i+2

i+n

BJM(C

i+n

→C

)

…

i+1

BJM(C

→C

i+1

)

BJM(C

i+1

→C

i+2

)

BJM(C

→C

)

i+2

i+n

BJM(C

i+n

→C

)

…

Figure 3: Loops.

Now, let us consider a set R={r(C

→C

i+1

r(C

i+1

→C

i+2

), ..., r(C

i+n

→C

), r(C

→C

)} of n+2

binary relations defining a loop (Figure 2) where:

• ∀r(C

→C

) ∈ R, M(C

→C

) > 0

The problem of the set R is that computing the

distribution of a class C

requires knowing its

distribution: loops must then be avoided. In other

words, a relation r(C

→C

) must be removed from R

to break the loop. To solve this problem, the idea is

to used the monotonous property of the BJ-measure:

finding two of class C

and C

so that the BJ-measure

of the relation r(C

→C

) is the lowest of the loop

(“Loop Rule”):

• Rule 3: ∀r(C

→C

)∈R, ∃r(C

→C

)∈R, x≠i, y≠j,

M(C

→C

)>BJM(C

→C

) ⇒ Remove(r(C

→C

))

When M(C

→C

)=M(C

→C

)), any of the relations

can be removed. The extreme case of loop can be

find in a set R containing a reflexive relation

r(C

→C

) where M(C

→C

)>0. Rule 3 must then be

adapted to this extreme (but frequent) case

(“Reflexivity rule”):

• Rule 4: ∀r(C

→C

) ∈ R, BJM(C

→C

) > 0 ⇒

Remove( r(C

→C

) )

Finally, to build naïve Bayesian Networks, the

algorithm must avoid the multiple paths leading to a

same C

class (Figure 3). To avoid this problem, as

for loops, the idea is to use the monotonous property

of the BJ-measure: finding two of class C

and C

that the BJ-measure of the relation r(C

→C

) is the

lowest of the paths. To use this idea, all the paths

leading to a particular C

class must be find in R. Let

us suppose that R contains n paths R

⊆R, R

⊆R, …,

⊆R leading to the C

class (i.e. each R

is of the

form R

={r(C

→C

k-n

), r(C

k-n

→C

k-n+1

), ..., r(C

→C

r(C

→C

l-n

), r(C

l-n

→C

l-n+1

), r(C

→C

)}. The algorithm

must find the r(C

→C

) relation with the lowest BJ-

measure to remove it in R (“Transitivity rule”):

• Rule 5: ∀r(C

→C

)∈R

∪R

∪…∪R

∃r(C

→C

) ∈ R

∪R

∪…∪R

, x≠i, y≠j,

M(C

→C

)>M(C

→C

)⇒Remove(r(C

→C

))

k-n

k-n-1

…

l-m

l-m-1

…

k-n

k-n-1

…

l-m

l-m-1

…

Figure 4: Multiple Paths.

Again, in case of equality (i.e. BJM(C

→C

) =

BJM(C

→C

)), any relation can be removed.

These five rules are necessary (but not sufficient)

to design an algorithm that builds naïve Bayesian

Networks from timed data, but its efficiency

depends mainly of the number of relation in the

initial set R. The TOM4L framework provides the

mathematical tools to remove the relations that can

not play a significant role in the building of a naïve

Bayesian Network.

Given the set R={…, r(C

→C

), …} of n binary

relations that can be build from a sequence

timed observation C

(k) defining a set C={C

} of

N(C) classes C

. The size of the Stochastic

Representation matrix of the TOM4L framework is

then N(C)

⋅

N(C)=N(C)

. This provides two ways to

eliminate a relation r(C

→C

) having no interest for

building a naïve Bayesian Network:

• Test 1: P(C

)⋅P(C

, C

)≤1/N(C)

⇒ Remove(r(C

→C

))

This first test compares b

≡P(C

)⋅P(C

, C

) with

the “absolute” hazard according to the discrete

binary memoryless chanel (Figure 2): because

defines N(C) classes, when supposing that all the

classes are independent and have the same Poisson

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

356

rate of occurrences, the probability of having an

occurrence C

(k) of a C

class followed by an

occurrence C

(k+1) of the C

class is simply

P(C

)=1/N(C) and the probability of reading a

couple (C

(k), C

(k+1)) in

is P(C

, C

)= 1 / N(C)

So each b

value can be compared with the

“absolute” hazard 1 / (N(C) N(C)

• Test 2: P(C

)⋅P(C

, C

)≤(1/(N(C)⋅P(C

)⋅P(C

))

⇒ Remove(r(C

→C

))

This second test defines the “relative” hazard when

supposing that C

and C

classes are independent. In

that case, the probability P((C

(k), C

(k+1)) in

having a couple (C

(k), C

(k+1)) in

is P(C

)⋅P(C

)

and having an occurrence C

(k) of a C

class, the

hazard is to read any occurrence C

(k+1) of the C

class: P(C

)=1/N(C). So each b

value can be

compared with the “relative” hazard

1/(N(C)⋅P(C

)⋅P(C

)).

The

property of the BJ-measure complete these

two tests to eliminate the relation having no meaning

according to the BJ-measure:

• Test 3:

i,j

>4 ∨

i,j

<1/4 ⇒ Remove(r(C

→C

))

Within the TOM4L framework, these tree tests

are implemented in the F0/1=[f

] matrix:

• (b

>1/N(C)

) ∧ (b

>(1/(N(C)⋅P(C

)⋅P(C

)) ∧ (1/4

≤

i,j

≤ 4) ⇔ f

= 1

So this lead to the rule number 6:

• Rule 6: ∀r(C

→C

)∈R,

= 0 ⇒ Remove( r(C

→C

) )

These six rules are used by the algorithm inspired

from Cheng’s method to build a naïve Bayesian

Network from timed data.

5 THE BJM4BN ALGORITHM

The proposed algorithm is called “BJM4BN” for

“BJ-Measure for Bayesian Networks”. This

algorithm takes as inputs a sequence ω of m timed

observation C

(k) defining a set Cl={C

} of N(Cl)

classes C

and an output C

class that is the class for

which the DBN is computed. It produces a set

G={…, r(C

→C

), …} of n binary relations that form

the structure of a naïve Bayesian Network (G, ).

The “BJM4BN” algorithm contains 5 stages. The

first stage computes the Stochastic Representation of

 to produce the initial M=[m

] matrix containing

the BJ-measure values m

of the N(Cl)

binary

relations r(C

→C

)) defined by  (line 1). Next, the

F0/1=[f

] matrix is computed using test 4 (line 3)

so rule 6 is applied (line 4). Finally, the M matrix is

normalized using rules 1 (line 5.1) and 4 (line 5.2).

Stage 2 computes the list L from the normalized

matrix M. Stage 3 builds recursively the initial G

graph from the C

class. This stage uses a recursive

function called “Build(G, C

)” where C

is the class

the graph of which is to build.

Stage 4 finds and removes the loops in G with

Rule 3. This stage finds all the loops R

in G of the

form R

≡{r(C

→C

i+1

), r(C

i+1

→C

i+2

), ..., r(C

i+n

→C

r(C

→C

)} and put them in a set R (line 12). Next, a

new list L

is build containing all the relation

r(C

→C

) in R with its associated m

BJ-measure

value (line 13). All loops R

in R are then removed

using Rule 3 (line 14). At the end of this stage, G

contains no more loops. Note that the L

list being

global (i.e. containing all the relations r(C

→C

)

participating in a loop), it is guaranty that the set of

removed relation r(C

→C

) is optimal: it is minimal

and the removed relations are the smallest of the G

graph.

Similarly, stage 5 removes the multiple paths in

the G graph with Rule 5, but the R set contains only

paths R

of the form R

≡{r(C

→C

k-n

), r(C

k-n

→C

k-n+1

..., r(C

→C

), r(C

→C

l-n

), r(C

l-n

→C

l-n+1

), ...,

r(C

→C

)} (line 16). It is guaranty that the set of

removed relation r(C

→C

) is optimal.

// Stage 1

1. Compute the

M=[m

] matrix

2. ∀i=0…N(Cl), ∀j=0…N(Cl), f

3. ∀i=0…N(Cl), ∀j=0…N(Cl),

>1/N(C)

)∧(b

>(1/(N(C)⋅P(C

)⋅P(C

)

)∧(1/4≤

i,j

≤ 4) ⇒ f

M=M⋅F0/1

5. ∀i=0…N(Cl), ∀j=0…N(Cl),

5.1. m

≤0 ⇒ m

=0 // rule 1

5.2. i=j ⇒ m

=0 // rule 4

// Stage 2

L={

}

∀i=0…N(Cl), ∀j=0…N(Cl), m

>0,⇒

L=L+{(r(C

→C

), m

)}

// Stage 3

8. C

, G={

}

9. ∀

r(C

→C

))∈L ⇒ G=G+{r(C

→C

)}

10. Build(G, C

){

∀

r(C

→C

))∈G, ∀r(C

→C

))∈L,

G=G+{

r(C

→C

)}

Build(G, C

)

}// End Build Function

// Stage 4

11. R={

}

12. ∀R

⊆G, R

≡{r(C

→C

i+1

), r(C

i+1

→C

i+2

..., r(C

i+n

→C

), r(C

→C

)} ⇒

R=R+{R

}

13. ∀R

∈R, ∀r(C

→C

)∈R

, r(C

→C

)∉L

⇒

+{(r(C

→C

), m

)}

LEARNING DYNAMIC BAYESIAN NETWORKS WITH THE TOM4L PROCESS

357

14. While R≠{

} repeat

. ∃r(C

→C

)∈L

, m

= Min(m

, L

)

∀R

∈R, ∃r(C

→C

)∈R

⇒

R=R-{R

}

G=G–{r(C

→C

)}

-{(r(C

→C

), m

)}

//Stage 5

15. R={

}

16. ∀R

⊆G,

≡{r(C

→C

k-n

), r(C

k-n

→C

k-n+1

), ...,

r(C

→C

), r(C

→C

l-n

), r(C

l-n

→C

l-n+1

..., r(C

→C

)} ⇒ R=R+{R

}

17. ∀R

∈R,

∀r(C

→C

)∈R

, r(C

→C

)∉L

⇒

+{(r(C

→C

), m

)}

18. While R≠{

} repeat

∃r(C

→C

)∈L

, m

= Min(m

, L

)

∀R

∈R, ∃r(C

→C

)∈R

⇒

R=R-{R

}

G=G–{r(C

→C

)}

-{(r(C

→C

), m

)}

Stage 6 computes the conditional probabilities tables

for G and finalizes the algorithm. The computing of

the Conditional Probabilities Tables ( CP Tables) is

based on the numbering table N=[n

] of the

Stochastic Representation of the  sequence. The

following property is a consequence of the model of

the discrete memoryless communication channel

(Figure 2):

• P(Y=C

| X=C

) + P(Y=¬C

| X=C

) = 1.

The computing of  uses this property (for

simplicity, P(C

) is rewritten P(y|x)). For a root

node C

• P(x)=(Σ

) / Σ

For a single relation r(C

→C

P(y|x) = n

/ (Σ

)

P(y|¬x) = ((Σ

)-n

) / (Σ

–(Σ

))

For a set R={r(C

→

), r(C

→C

)} of two relations

converging to the same C

class:

• P(y|x,z) = (n

) / (Σ

+Σ

)

• P(y|¬x,z) = (Σ

-n

) / (Σ

-Σ

)

• P(y|x,¬z) = (Σ

-n

) / (Σ

-Σ

)

• P(y |¬x,¬z) = (Σ

-n

) / (Σ

-Σ

)

6 A THEORETICAL EXAMPLE

This section illustrates the usage of the proposed

algorithm on the theoretical car example of (Le Goc,

2007). This example is inspired from the (simple)

car technical diagnosis knowledge base of

(Schreiber, 2000) (Figure 5).

Figure 5 shows a knowledge base of 9 rules that can

be used to diagnose a (very simplified) car. These

rules provide the reasons that might affect the car to

stop functioning: a car might “stops” or “does

fuel tank

empty

battery

low

battery dial

zero

gas dial

zero

power

off

engine behavior

does not start

engine behavio

stops

gas in engine

false

fuse

blow n

fuse inspection

broken

78 9

Figure 5: Car Diagnosis Knowledge Base.

not start” if the fuse is blown or the battery is low or

the fuel tank is empty.

Using the TOM4D methodology, the underlying

structural model of the system considered in this

knowledge base is provided in Figure 6. This figure

shows a set of connected components c

and defines

a set of variables x

. The evolution of variable x

denoted with functions of time x

(t).

electric_

alimentation

gas_

alimentation

engine

gas_dial

battery_dial

fuse_inspection

fuse

battery

fuel_tank

x1(t)

x2(t)

x3(t)

x9(t)

x8(t)

x7(t)

x6(t)

x5(t)

x4(t)

electric_

alimentation

gas_

alimentation

engine

gas_dial

battery_dial

fuse_inspection

fuse

battery

fuel_tank

x1(t)

x2(t)

x3(t)

x9(t)

x8(t)

x7(t)

x6(t)

x5(t)

x4(t)

Figure 6: Structural Model of the Car Example.

TOM4D methodology considers that the variables

, x

, and x

are associated to the sensor components

, c

and c

and these components never failed. So

this figure defines a set X={ x

, x

} of 6

variables. The values of these variables are a set of

constants: Δ={ Δx1={Blown, Not_Blown},

Δx2={Low, Not_Low}, Δx3={Empty, Not_Empty},

Δx7={On, Off}, Δx8={True, False}, Δx9={Start,

Does_Not_Start} }. It is to note that (Le Goc, 2007)

eliminates the constant “Stops”: in the TOM4D

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

358

framework, this corresponds to rewrite the constant

“Stops” as “Does_Not_Start”.

Figure 7 shows the functional model of the car

according to the TOM4D methodology.

Figure 7: Functional Model of the Car Example.

A functional model is an organized set of logical

relations between the possible values the variables

can take over time. The functions are denoted at the

right of Figure 7 and are specified at the left

(function f

and f

being similar to function f

, they

are not specified in the figure).

The role of the Bayesian Network <G,  >to

learn is to make a link between the probabilities of

the value of a set X of variables. The discovered

Bayesian network must then be compatible with the

functional model of Figure 7.

Because the “BJM4BN” algorithm works on

timed data, a sequence  must be built according to

rules 2, 3, 6, 7 and 8 of the knowledge base of

Figure 5. To this aim, let us suppose that the car is

monitored, the abnormal behaviour of the car can be

defined with a set Ω={

} of three models of

sequences:

•

= {x

)=Blown, x

+Δt

)=Off,

+Δt

)=Does_Not_Start}

•

= {x

)=Low, x

+Δt

)=Off,

+Δt

)=Does_Not_Start}

•

= {x

)=Empty, x

+Δt

)=False,

+Δt

)=Does_Not_Start}

These sequences define a set Cl={C

} of 6

observation classes, each being a singleton: Cl={C

= {(x

, Blown)}, C

= {(x

, Low)}, C

= {(x

Empty)} C

= {(x

, Off)}, C

= {(x

, false)}, C

{(x

, Does_Not_Start)}}

Consequently, each constant

of Δ being linked

with a unique variable x

of X, there is a bijection

between a class C

and a variable x

. The three

sequences of the set Ω can be rewritten in terms of

the class occurrences:

•

= {C

), C

+Δt

), C

+Δt

)}

•

= {C

), C

+Δt

), C

+Δt

)}

•

= {C

), C

+Δt

), C

+Δt

)}

These sequences will be used to produce a

theoretical sequence according to the method

described in [Bouché, 2008]. To this aim, let us

assign hand probabilities to the occurrence of each

observation classes with the following principle: the

observations of the C

class (x

)=Blown) are less

probable to happen than the observations of the C

class (x

)=Low), while the occurrences of the C

class (x

)=Empty) are more frequent (with

carefree driver for example). This lead for example

to the probabilities of Table 1:

Table 1: Prior Probabilities of the car example.

)P(

) P(

)

0.05 0.15 0.3 0.2 0.2 0.1

According to the method of [Bouché, 2008], these

probabilities and the three models of sequence of Ω

allow building a sequence

of 100 occurrences

(Figure 8) that satisfies the probabilities of Table 1.

Figure 8: Theoretical  Sequence (beginning).

Table 2: The numbering table N between variables.

N x1x2x3x7x8x9TOTAL

x1 1004005

11291014

168411030

12822520

13803520

x9 02413010

Total

5143020201099

The numbering matrix N computed from the

sequence

is given in Table 2. Next, the M matrix

(Table 3), the B matrix (Table 4, where each cells is

multiplied by 1000 to have readable values) and the

θ matrix (Table 5) are computed to produce the F0/1

matrix (Table 6) that implements the rule 6.

LEARNING DYNAMIC BAYESIAN NETWORKS WITH THE TOM4L PROCESS

359

The M⋅F0/1 matrix is then computed and normalized

(Table 7) using rule 1 (pink cells) and rule 4

Table 3: The M Matrix for the Car Example.

M x1x2x3x7x8x9

0.3112 -0.7698 -1.4340 0.3319 -1.0257 -0.5980

0.0217 -0.0629 -0.0851 0.1386 -0.1434 -0.7709

-0.0386 0.0156 -0.0016 -0.0214 0.0343 -1.2864

0.0000 -0.0168 0.0081 -0.0639 -0.0639 0.1087

0.0000 0.0005 0.0081 -0.9955 -0.0112 0.1087

-0.5980 0.0177 0.0124 -0.0749 0.0231 -0.6683

(the yellow cells defines the orientation of a relation

r(x

→x

)). This achieves the first stage of the

BJT4BN algorithm.

Table 4: The B(*1000) matrix for the car Example.

B*1000 x1 x2 x3 x7 x8 x9

20.20 0.00 0.00 323.23 0.00 0.00

7.22 7.22 28.86 584.42 7.22 0.00

3.37 121.21 215.49 53.87 407.41 0.00

5.05 20.20 323.23 20.20 20.20 126.26

5.05 45.45 323.23 0.00 45.45 126.26

0.00 40.40 161.62 10.10 90.91 0.00

Table 5: The Matrix for the car Example.

T x1x2x3x7x8x9

1 0.35714 0.16667 0.25 0.25 0.5

x2 2.8 1 0.46667 0.7 0.7 1.4

x3 6 2.14286 1 1.5 1.5 3

4 1.42857 0.66667 1 1 2

x8 4 1.42857 0.66667 1 1 2

x9 2 0.71429 0.33333 0.5 0.5 1

Table 6: The F0/1 Matrix for the car Example.

F0/1 x1 x2 x3 x7 x8 x9

100100

000100

010010

001001

011010

Table 7: Normalized M Matrix for the Car Example.

Norm Mx1x2x3x7x8x9

0.0000 0.0000 0.0000 0.3319 0.0000 0.0000

0.0000 0.0000 0.0000 0.1386 0.0000 0.0000

x3 0.0000 0.0156 0.0000 0.0000 0.0343 0.0000

x7 0.0000 0.0000 0.0081 0.0000 0.0000 0.1087

x8 0.0000 0.0000 0.0081 0.0000 0.0000 0.1087

0.0000 0.0177 0.0124 0.0000 0.0231 0.0000

The second stage of the algorithm computes the L

list that contains only the yellow cells of the

normalized M matrix. The Table 8 provides the L list

when sorted with decreasing values of the BJ-

measure m

= M(x

→x

) of the corresponding

relation r(x

→x

). The third stage transforms the

normalized M matrix in the initial G graph with a

depth first algorithm (Figure 9).

Stage 4 is dedicated to find and remove the

eventualloops in the initial G graph.

Table 8: The L list of the Car Example.

x1 x7 0.3319

x2 x7 0.1386

x7 x9 0.1087

x8 x9 0.1087

x3 x8 0.0343

x9 x2 0.0177

x3 x2 0.0156

x9 x3 0.0124

x7 x3 0.0081

L = {r(i, j), mij}

The initial G graph of the car example contains a lot

of loops: all the relations participates at least one

loop except the r(x

→x

) relation. The loops are

suppressed with the iterative removing of the

r(x

→x

) relations with the minimal BJ-measure m

To this aim, the algorithm duplicates the L list

without the r(x

→x

) relation to constitute the L

list.

It is then easy to see that the firstly removed relation

is r(x

→x

), and that the algorithm will successively

remove the r(x

→x

) and the r(x

→x

) relations

(Figure 10).

Figure 9: Initial G Graph for the Car Example.

The resultant G graph having no multiple paths, the

stage 5 modifies nothing. The final stage of the

algorithm is dedicated to the computing of the CP

Tables from the N matrix (Table 2) with the

equations provided in section 5. For the two root

nodes x

, x

and x

• P(x

) = 5 / 99 ≈ 0.050

• P(x

) = 14 / 99 ≈ 0.141

• P(x

) = 30 / 99 ≈ 0.303

For the single relation r(x

→x

• P(x

) = 11 / 30 ≈ 0.366

• P(x

|¬x

) = (20-11) / (99-30) ≈ 0.130

For the converging node x

corresponding to the set

={r(x

→x

), r(x

→x

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

360

• P(x

) = (4+9) / (5+14) ≈ 0.684

• P(x

|¬x

) = (20-4) / (99-5) ≈ 0.170

• P(x

,¬x

) = (20-9) / (99-14) ≈ 0.129

• P(x

|¬x

,¬x

) = (20-4-9) / (99-5-14) ≈ 0.087

For the converging node x

corresponding to the set

={r(x

→x

), r(x

→x

• P(x

) = (5+5) / (20+20) ≈ 0.250

• P(x

|¬x

) = (10-5) / (99-20) ≈ 0.063

• P(x

,¬x

) = (10-5) / (99-20) ≈ 0.063

• P(x

|¬x

,¬x

) = (10-5-5) / (99-20-20) =0

This leads to the final naïve Bayesian Network for

the car example (Figure 10), which is compatible

with the functional model of Figure 7.

P(x2) = 0,14

P(x1) = 0,05

P(x3) = 0,30

P(x7 | x1, x2) = 0.684

P(x7 | x1, ¬x2) = 0.129

P(x7 | ¬ x1, x2) = 0.170

P(x7 | ¬x1, ¬x2) = 0.087

P(x9 | x7, x8) = 0.250

P(x9 | x7, ¬x8) = 0.063

P(x9 | ¬ x7, x8) = 0.063

P(x9 | ¬x7, ¬x8) = 0.000

P(x8 | x3) = 0.367

P(x8 | ¬x3) = 0.130

P(x2) = 0,14

P(x1) = 0,05

P(x3) = 0,30

P(x7 | x1, x2) = 0.684

P(x7 | x1, ¬x2) = 0.129

P(x7 | ¬ x1, x2) = 0.170

P(x7 | ¬x1, ¬x2) = 0.087

P(x9 | x7, x8) = 0.250

P(x9 | x7, ¬x8) = 0.063

P(x9 | ¬ x7, x8) = 0.063

P(x9 | ¬x7, ¬x8) = 0.000

P(x8 | x3) = 0.367

P(x8 | ¬x3) = 0.130

Figure 10: Bayesian Network for the Car Example.

Naturally, this result comes from the way the 

sequence has been made: the Figure 11 shows the

signature tree of the C

class as provided by the

“BJT4S” algorithm of the TOM4L tools. This tree

allows recognizing the 10 observations of the C9

class in the  sequence (i.e. the cover rate is equal to

100%): 5 observations of the C

class are recognized

by the {r(C

, C

, [0, 6s]), r(C

, C

, [0, 4s]), r(C

, C

[0, 6s])} signature, the other 5 observations being

recognized by the {r(C

, C

, [0, 6s]), r(C

, C

, [0,

12s])} signature.

Figure 11: Signature tree of the C

class.

Despite of the simplicity of the knowledge base of

the car example (Figure 5), this shows a posteriori

the difficulty to compute the car example Bayesian

network from the observations of the  sequence.

The Bayesian Network of Figure 11 allows the

building of the functional model for the car example

of Figure 12: this functional model is identical to the

functional model Figure 7, but it adds probabilities

to the functions f

, f

and f

. These probabilities

provide some confidence about the existence of the

corresponding functions. This example show the

way the TOM4D methodology and the TOM4L

process complete together.

8 P (

Empty False 37%

Not_Empty True 87%

x 8=f 8(x 3)

x1 x2 x7

P (

Not_Blown Not_Low On 91%

Not_Blown Low Off 17%

Blown Not_Low Off 13%

Blown Low Off 68%

x 7=f 7(x 1, x 2)

x7 x8 x9

P (

On True Start 100%

On False Does_Not_Start 6%

Off True Does_Not_Start 6%

Off False Does_Not_Start 25%

x 9=f 7(x 7, x 8)

8 P (

Empty False 37%

Not_Empty True 87%

x 8=f 8(x 3)

x1 x2 x7

P (

Not_Blown Not_Low On 91%

Not_Blown Low Off 17%

Blown Not_Low Off 13%

Blown Low Off 68%

x 7=f 7(x 1, x 2)

x7 x8 x9

P (

On True Start 100%

On False Does_Not_Start 6%

Off True Does_Not_Start 6%

Off False Does_Not_Start 25%

x 9=f 7(x 7, x 8)

Figure 12: Functional Model for the Car Example.

This example shows the simplicity and the

efficiency of the BJM4BN algorithm. The

complexity of the algorithm is proportional with the

number of timed data in the given set of sequences

 and the number of class N(C) in . This is to be

compared with the exponential complexity of the

methods of the dependency analysis category.

The next section shows the use of the

“Transitivity rule” and the ease of use of the

proposed algorithm in an industrial environment.

7 REAL WORLD APPLICATION

The Apache system is a clone of Sachem, the

knowledge based systems that The Arcelor Group,

one of the most important steal companies in the

world, has developed to monitor and diagnose its

production tools (Le Goc, 2004). Apache aims at

controlling a zinc bath, a hot bath containing a liquid

mixture of aluminum and zinc continuously fed with

aluminum and zinc ingots in which a hot steel strip

is immerged. Apache monitors and diagnoses around

11 variables and is able to detect around 24 types of

alarms. The analyzed sequence ω contains 687

events of 13 classes for 11 discrete variables. The

counting matrix N contains then 156 cells n

(Bouché, 2005), (Le Goc, 2005).

The node of interest being 1006, the initial G graph

resulting from stage 3 of the “BJM4BN” algorithm

is given in Figure 13. This graph having no loops,

LEARNING DYNAMIC BAYESIAN NETWORKS WITH THE TOM4L PROCESS

361

the stage 4 modify noting and stage 5 builds the final

G graph of Figure 14.

1006

1001

1020

1031

10261004 10141002 1024

1006

1001

1020

1031

10261004 10141002 1024

Figure 13: Initial G graph.

P(1031) = 0,176

P(1004) = 0,026

P(1014 | 1002) = 0.385

P(1014 | ¬1002) = 0.024

P(1026 | 1014) = 0.400

P(1026 | ¬1014) = 0.010

P(1024 | 1026) = 0.059

P(1024 | ¬1026) = 0.025

P(1001 | 1031, 1002) = 0.088

P(1020 | 1031, ¬1002) = 0.053

P(1020 | ¬ 1031, 1002) = 0.053

P(1020 | ¬1031, ¬1002) = 0.048

P(1020 | 1024) = 0.120

P(1020 | ¬1024) = 0.033

P(1006 | 1001) = 0.385

P(1006 | ¬1001) = 0.037

1006

1001

1020

1031

10261004 10141002 1024

P(1031) = 0,176

P(1004) = 0,026

P(1014 | 1002) = 0.385

P(1014 | ¬1002) = 0.024

P(1026 | 1014) = 0.400

P(1026 | ¬1014) = 0.010

P(1024 | 1026) = 0.059

P(1024 | ¬1026) = 0.025

P(1001 | 1031, 1002) = 0.088

P(1020 | 1031, ¬1002) = 0.053

P(1020 | ¬ 1031, 1002) = 0.053

P(1020 | ¬1031, ¬1002) = 0.048

P(1020 | 1024) = 0.120

P(1020 | ¬1024) = 0.033

P(1006 | 1001) = 0.385

P(1006 | ¬1001) = 0.037

1006

1001

1020

1031

10261004 10141002 1024

1006

1001

1020

1031

10261004 10141002 1024

Figure 14: Final G graph.

Figure 15 shows a handmade sketch of the

functional model of the galvanization bath problem

produced by the experts of the Arcelor Group in

2003. The dotted lines in the graph indicate the

expert’s relations that are not included in the final G

graph of Figure 14: the G graph is all contained in

the expert’s graph. But the expert’s graph does not

contain the 1024 class: corresponding to an operator

query for a chemical analysis, this class has been

removed by experts.

1020

1014

1026

1006

1002

1031 1001

1004

1020

1014

1026

1006

1002

1031 1001

1004

Figure 15: Handmade graph of experts.

Finally, the Figure 16 shows the graph made with

the signatures of the 1006 class that can be found in

(Bouché, 2005) and (Le Goc, 2005). The signatures

are produced with specific Timed Data Mining

techniques. This graph is identical to the initial G

graph of Figure 13.

10201014 1026

1006

1002

1031 1001

.1004 10201014 1026

1006

1002

1031 1001

.1004

Figure 17: Handmade graph from 1006 signatures.

The main problem is the disappearance of the

r(1020, 1006) relation in the final G graph. The first

analysis seem to lead to the conclusion that the 1024

class is responsible of the r(1020, 1006) removing.

8 CONCLUSIONS

This paper proposes a new algorithm called the

“BJT4BN” algorithm to learn a Bayesian Network

from timed data.

The originality of the algorithm comes from the

fact that it is designed in the framework of the

Timed Observation Theory. This theory represents a

set of sequences of timed data in a structure, the

Stochastic Representation that allows the definition

of a new information measure called the BJ-

Measure. This paper defines the principles of a

learning algorithm that are based on the BJ-Measure.

The BJT4BN algorithm is efficient both in terms

of pertinence and simplicity. These properties come

from the BJ-measure that provides an operational

way to orient the edges of a Bayesian Network

without the exponential CI Tests of the methods of

the dependency analysis category.

Our current works are concerned with the

combination of the Timed Data Mining techniques

of the TOM4L framework with the “BJT4BN”

algorithm to define a global validation of the

TOM4L learning process.

REFERENCES

Benayadi, N., Le Goc, M., (2008). Discovering Temporal

Knowledge from a Crisscross of Timed Observations.

To appear in the proceedings of the 18th European

Conference on Artificial Intelligence (ECAI'08),

University of Patras, Patras, Greece.

Bouché, P., Le Goc, M., Giambiasi, N., (2005). Modeling

discrete event sequences for discovering diagnosis

signatures. Proceedings of the Summer Computer

Simulation Conference (SCSC05) Philadelphia, USA.

Cheeseman, P., Stutz, J., (1995). Bayesian classification

(Auto-Class): Theory and results. Advances in

Knowledge Discovery and Data Mining, AAAI Press,

Menlo Park, CA, p. 153-180.

Cheng, J., Bell, D., Liu, W., (1997). Learning Bayesian

Networks from Data An Efficient Approach Based on

Information Theory.

Cheng, J., Greiner, R., Kelly, J., Bell, D., Liu, W., (2002).

Learning Bayesian Networks from Data: An

Information-Theory Based Approach. Artificial

Intelligence, 137, 43-90.

Chickering, D. M., Geiger, D., Heckerman, D., (1994).

Learning Bayesian Networks is NP-Hard. Technical

Report MSR-TR-94-17, Microsoft Research, Microsoft

Corporation.

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

362

Cooper, G. F., Herskovits, E., (1992). A Bayesian Method

for the induction of probabilistic networks from data.

Machine Learning, 9, 309-347.

Friedman, N., (1998). The Bayesian structural EM

algorithm. Proceedings of the 14th Conference on

Uncertainty in Artificial Intelligence. Morgan

Kaufmann, San Francisco, CA, p. 129-138.

Heckerman, D., Geiger, D., Chickering, D. M., (1997).

Learning Bayesian Networks: the combination of

knowledge and statistical data. Machine Learning

Journal, 20(3).

Le Goc, M., Bouché, P., and Giambiasi, N., (2005).

Stochastic modeling of continuous time discrete event

sequence for diagnosis. Proceedings of the 16th

International Workshop on Principles of Diagnosis

(DX’05) Pacific Grove, California, USA.

Le Goc, M., (2006). Notion d’observation pour le

diagnostic des processus dynamiques: Application a

Sachem et a la découverte de connaissances

temporelles. Hdr, Faculté des Sciences et Techniques

de Saint Jérôme.

Le Goc, M., Masse, E., (2007). Towards A Multimodeling

Approach of Dynamic Systems For Diagnosis.

Proceedings of the 2

International Conference on

Software and Data Technologies (ICSoft'07),

Barcelona, Spain.

Le Goc, M., Masse, E., (2008). Modeling Processes from

Timed Observations. Proceedings of the 3rd

International Conference on Software and Data

Technologies (ICSoft 2008), Porto, Portugal.

Myers, J., Laskey, K., Levitt, T., (1999). Learning

Bayesian Networks from Incomplete Data with

Stochastic Search Algorithms.

Pearl, J., (1988). Probabilistic Reasoning in Intelligent

Systems: Networks of Plausible Inference. San Mateo,

Calif.: Morgan Kaufmann.

Shannon, C., Weaver, W., (1949). The mathematical

theory of communication. University of Illinois Press,

27:379–423.

LEARNING DYNAMIC BAYESIAN NETWORKS WITH THE TOM4L PROCESS

363