A Statistic Criterion for Reducing Indeterminacy

in Linear Causal Modeling

Gianluca Bontempi

Machine Learning Group, Computer Science Department, Faculty of Sciences, Universit

´

e Libre de Bruxelles,

Brussels, Belgium

Keywords:

Graphical Models, Causal Inference, Feature Selection.

Abstract:

Inferring causal relationships from observational data is still an open challenge in machine learning. State-of-

the-art approaches often rely on constraint-based algorithms which detect v-structures in triplets of nodes in

order to orient arcs. These algorithms are destined to fail when confronted with completely connected triplets.

This paper proposes a criterion to deal with arc orientation also in presence of completely linearly connected

triplets. This criterion is then used in a Relevance-Causal (RC) algorithm, which combines the original causal

criterion with a relevance measure, to infer causal dependencies from observational data. A set of simulated

experiments on the inference of the causal structure of linear networks shows the effectiveness of the proposed

approach.

1 INTRODUCTION

One of the most difﬁcult aspect of causal inference

from observational data is the indeterminacy of causal

structures, due to the existence of dependency struc-

tures implying different causal directions but which

are indistinguishable in terms of statistical likelihood

or ﬁt indexes. For instance it is well known that the

detection of causal directionality requires strong as-

sumptions (e.g. nonlinearity, high dimensional obser-

vations) in a bivariate (i.e., single cause single effect)

context (Janzing et al., 2010; Janzing et al., 2011).

This is the reason why existing techniques address

triplet conﬁgurations to reconstruct the directional-

ity of the causal relationships. Well-known exam-

ples are the algorithms which infer causal structures

in Bayesian networks by searching for unshielded col-

liders (Spirtes et al., 2000), i.e. patterns where two

variables are both direct causes of a third one, with-

out being each a direct cause of the other. Under as-

sumptions of Causal Markov Condition and Faithful-

ness, this structure is statistically distinguishable and

so-called constraint based algorithms (notably the PC

and the SGS algorithms) rely on conditional indepen-

dence tests to orient at least partially a graph (Koller

and Friedman, 2009).

Other research works take advantage of condi-

tional independence and propose information theo-

retic methods for network inference and feature selec-

tion (Brown, 2009; Watkinson et al., 2009; Bontempi

and Meyer, 2010; Bontempi et al., 2011). In particu-

lar these works use the notion of feature interaction,

a three-way mutual information that differs from zero

when group of attributes are complementary, which

allows to prioritize causes with respect to irrelevant

and effect variables.

However trivariate settings may present strong

problems of indeterminacy, too. Think for instance

to a fully connected triplet made of two causes and

one common effect. In this case the lack of indepen-

dency makes ineffective the adoption of conditional

independency tests or interaction measures to infer

the direction of the arrows. As stressed in (Guyon

et al., 2007) when there are no independencies, the di-

rection of the arrows can be anything. Though a pos-

sible remedy to indeterminacy comes from the use of

additional instrumental variables (IV) (Bowden and

Turkington, 1984), this strategy is not always feasible

in real settings with lack of a priori knowledge.

This paper focuses on the deﬁnition of a data-

dependent measure able to reduce the statistical indis-

tinguishability of completely and linearly connected

triplets. In particular, we propose a modiﬁcation

of the covariance formula of a structural equation

model (Bollen, 1989; Mulaik, 2009) which results in a

statistic taking opposite signs for different causal pat-

terns when the unexplained variations of the variables

are of the same magnitude. Though this assumption

159

Bontempi G. (2013).

A Statistic Criterion for Reducing Indeterminacy in Linear Causal Modeling.

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 159-166

DOI: 10.5220/0004254301590166

Copyright

c

SciTePress

could appear as too limiting, our rationale is that as-

sumptions of comparable strength (e.g. the existence

of unshielded colliders) have been commonly used so

far in causal inference. We expect that this alterna-

tive approach could shed additional light on the issue

of causality in the perspective of extending it to more

general conﬁgurations.

For this reason the paper proposes also a Rele-

vance Causal (RC) inference algorithm which inte-

grates the proposed causal measure with a relevance

measure to prioritize direct causes for a given target

variable. In order to assess the effectiveness of the al-

gorithm with respect to state-of-the-art algorithms a

set of experiments aiming to infer linear and nonlin-

ear networks from observed data are carried out. The

experimental comparison with state-of-the-art tech-

niques shows that such approach is promising for re-

ducing indeterminacy in causal inference.

2 COVARIANCE OF A LINEARLY

CONNECTED TRIPLET

The use of directed acyclic graphs (DAG) to encode

causal dependencies and independencies is common

to the two most known formalisms for causal mod-

eling (Anderson and Vastage, 2004): Bayesian net-

works and structural equation models (SEM). These

formalisms can accommodate both nonlinear and lin-

ear causal relationships. Here we will restrict our at-

tention to the linear causal structure represented in

Figure 1 where the variables x

1

and x

2

are causes of

the random variable x

3

. Since a DAG can always be

translated into a set of recursive structural equations,

this linear dependency can be written as

x

1

= w

1

x

2

= b

1

x

1

+ w

2

x

3

= b

3

x

1

+ b

2

x

2

+ w

3

(1)

where it is assumed that each variable has mean 0,

the b

i

6= 0 are also known as structural coefﬁcients

and the disturbances, supposed to be independent, are

designated by w. This set of equations can be put in

the matrix form

x = Ax + w (2)

where x = [x

1

,x

2

,x

3

]

T

,

A =

0 0 0

b

1

0 0

b

3

b

2

0

and w = [w

1

,w

2

,w

3

]

T

. The multivariate variance-

covariance matrix (Mulaik, 2009) has no zero entries

b

2

x

1

x

2

x

3

b

3

b

1

w

1

w

2

w

3

Figure 1: Collider pattern: completely connected triplet

where the variable x

3

is a common effect of x

1

and x

2

.

and is given by

Σ = (I − A)

−1

G((I − A)

T

)

−1

= (3)

=

"

σ

2

1

b

1

σ

2

1

b

1

σ

2

1

b

2

1

σ

2

1

+ σ

2

2

b

3

σ

2

1

+ b

1

b

2

σ

2

1

b

1

b

3

σ

2

1

+ b

2

(b

2

1

σ

2

1

+ σ

2

2

)

b

3

σ

2

1

+ b

1

b

2

σ

2

1

b

1

b

3

σ

2

1

+ b

2

(b

2

1

σ

2

1

+ σ

2

2

)

(b

2

1

σ

2

1

+ σ

2

2

)b

2

2

+ 2b

1

b

2

b

3

σ

2

1

+ b

2

3

σ

2

1

+ σ

2

3

#

(4)

where I is the identity matrix and

G =

σ

2

1

0 0

0 σ

2

2

0

0 0 σ

2

3

is the diagonal covariance matrix of the disturbances.

It is worthy noting here that the lack of zero en-

tries in the covariance matrix (as well as in the in-

verse) illustrate the lack of conditional or uncondi-

tional independencies in the data. Constraint-based

approaches (Spirtes et al., 2000) which rely on in-

dependence tests to retrieve the v-structure are con-

sequently useless in this context. In the following

section we will discuss whether SEM techniques can

tackle such case.

3 INDETERMINACY IN A

CONNECTED TRIPLET

Structural equation modeling techniques for causal

inference proceed by 1) making some assumptions

on the structure underlying the data, 2) perform the

related parameter estimation, usually based on maxi-

mum likelihood and 3) assessing by signiﬁcance test-

ing the discrepancy between the sample covariance

matrix and the covariance matrix implied by the hy-

pothesis.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

160

This section shows that, as known in littera-

ture (Stelzl, 1986; Hershberger, 2006; Mulaik, 2009),

conventional SEM is not able to reconstruct the right

directionality of the connections in a completely con-

nected triplet. Let us observe a set of data gener-

ated according to the dependency illustrated in Figure

1 and described algebrically by the set of structural

equations (1). Suppose we want to test two alternative

hypothesis, represented by the two directed graphs in

Figure 2a and Figure 2b, respectively. Note that the

hypothesis of Figure 2a is correct while the hypothe-

sis illustrated by Figure 2b inverses the directionality

of the link between x

2

and x

3

and consequently misses

the causal role of the variable x

2

. Let us consider

the following question: is it possible to discriminate

between the structures 1 and 2 by simply relying on

parameter estimation (in this case regression ﬁtting)

according to the hypothesized dependencies? The an-

swer is unfortunately negative. Suppose we assess the

hypothesis 1, by performing the two linear ﬁttings im-

plied by the hypothesis itself

(

x

2

=

ˆ

b

1

x

1

+ w

2

x

3

=

ˆ

b

3

x

1

+

ˆ

b

2

x

2

+ w

3

where (Graybill, 1976)

ˆ

b

1

= Σ

12

/Σ

11

= b

1

,

ˆ

b

3

ˆ

b

2

=

Σ

11

Σ

12

Σ

21

Σ

22

−1

Σ

13

Σ

23

=

b

3

b

2

Since the above estimators are unbiased, if we com-

pute the triplet covariance matrix by plugging the

above estimates within the formula (3) we obtain

again the variance in (4).

Let us consider now the second hypothesis (Fig-

ure 2b) and perform the two least-squares ﬁttings

(

x

2

=

ˆ

b

1

x

1

+

ˆ

b

2

x

3

+ w

2

x

3

=

ˆ

b

3

x

1

+ w

3

where the estimates are returned by

ˆ

b

3

= Σ

13

/Σ

11

= b

3

+ b

1

b

2

,

ˆ

b

1

ˆ

b

2

=

Σ

11

Σ

13

Σ

13

Σ

33

−1

Σ

12

Σ

23

=

(b

1

∗σ

2

3

−b

2

b

3

σ

2

2

)

b

2

2

σ

2

2

+σ

2

3

b

2

σ

2

2

b

2

2

σ

2

2

+σ

2

3

)

Standard results give also the variance of the residu-

als. For instance the variance of w

2

is returned by

ˆ

σ

2

= Σ

22

−

Σ

12

Σ

23

Σ

11

Σ

13

Σ

13

Σ

33

−1

Σ

12

Σ

23

We remark that, though the estimation of the parame-

ters differs from the real structural coefﬁcients, if we

b

2

x

1

x

2

x

3

b

3

b

1

b

2

x

1

x

2

x

3

b

3

b

1

Figure 2: a) Hypothesis 1. b) Hypothesis 2.

compute the complete covariance matrix by using (3)

where

ˆ

A =

0 0 0

ˆ

b

1

0

ˆ

b

2

ˆ

b

3

0 0

,

ˆ

G =

ˆ

σ

1

2

0 0

0

ˆ

σ

2

2

0

0 0

ˆ

σ

2

3

we obtain again the expression (4). In other terms ﬁt-

ting different causal structures to the connected triplet

does not allow to distinguish between the conﬁgura-

tion in Figure 2a and the one in Figure 2b.

4 A CRITERION TO DETECT

CAUSAL ASYMMETRY

A main characteristic of a causal relationship is its

asymmetry. For this reason, if we wish to infer causal

directionality from observational data we need to de-

ﬁne discriminant criteria able to distinguish causes

from effects. Let us suppose we want to discrimi-

nate between the causal patterns in Figure 1 where

both x

1

and x

2

are direct causes of x

3

and alternative

patterns like the ones in Figure 3a and Figure 3b. As

we have seen in the previous section, the conventional

SEM procedure does not allow to distinguish between

different patterns. What we propose here is an alter-

native criterion able to perform such distinction.

The computation of our criterion requires the ﬁt-

ting of the two hypothetical structures in Figures 2a

and 2b to the data, as done in the previous section.

What is different is that, instead of computing the

term (3), we consider the term

S = (I − A)

−1

((I − A)

T

)

−1

. (5)

Let us see the impact of such modiﬁcation in the de-

tection of causality by analyzing in the following sec-

tions three different causal patterns. In all cases we

will make the assumption that σ

1

= σ

2

= σ

3

= σ, i.e.

that the unexplained variations of the variables are of

comparable magnitude. Though we are aware that

this assumption is quite speciﬁc, some considerations

are worthy to be made. So far, most of the approaches

of causal inference from data have relied on similar,

if not stronger, assumptions like postulating the exis-

tence of unshielded colliders. At the same time the

AStatisticCriterionforReducingIndeterminacyinLinearCausalModeling

161

b

2

x

1

x

2

x

3

b

3

b

1

w

2

w

1

w

3

b

2

x

1

x

2

x

3

b

3

b

1

w

1

w

2

w

3

a) b)

Figure 3: a) Chain pattern, completely connected triplet

where the variable x

2

is the common effect of x

1

and x

3

. b)

Fork pattern: completely connected triplet where the vari-

ables x

2

and x

1

have the common cause x

3

.

following derivation is expected to shed a new light

on the issue of causality with the aim of applying it to

more general conﬁgurations.

4.1 Collider Pattern

Let us suppose that data are generated according to

the structure in Figure 1 where the node x

3

is a col-

lider.

If we ﬁt the hypothesis 1 to the data and we com-

pute the term (5) we obtain

ˆ

S

1

= (I − A

1

)

−1

((I − A

1

)

T

)

−1

=

"

1 b

1

b

1

b

2

1

+ 1

(b

3

+ b

1

b

2

) b

1

(b

3

+ b

1

b2) + b

2

]

(b

3

+ b

1

b

2

)

b

1

(b

3

+ b

1

b

2

) + b

2

(b

3

+ b

1

b

2

)

2

+ 1 + b

2

2

#

If we ﬁt the hypothesis 2 to data and we compute

the term (5) we obtain

ˆ

S

2

= (I − A

2

)

−1

((I − A

2

)

T

)

−1

=

"

1 b

1

b

1

b

2

2

(b

2

2

+1)

2

+ b

2

1

+ 1

b

3

+ b

1

b

2

b

2

(b

2

2

+1)

+ b

1

(b

3

+ b

1

b

2

)

b

3

+ b

1

b

2

b

2

(b

2

2

+1)

+ b

1

(b

3

+ b

1

b

2

)

(b

3

+ b

1

b

2

)

2

+ 1

#

Let us denote by S[i, j] the i jth element of a matrix

S. Since ∀i b

i

6= 0, it follows that the quantity

C(x

1

,x

2

,x

3

) =

=

ˆ

S

1

[3,3] −

ˆ

S

2

[3,3] +

ˆ

S

1

[2,2] −

ˆ

S

2

[2,2]

=

=

b

2

4

b

2

2

+ 2

b

2

2

+ 1

2

(6)

is greater than zero for any sign of the structural co-

efﬁcients. Interestingly enough, the sign is preserved

for σ

1

= σ

2

= σ

3

also when the direction of the link

between x

1

and x

2

is inverted (see (18).

4.2 Chain Pattern

We suppose here that the data have been generated by

the triplet in Figure 3, where x

3

is part of the chain

pattern x

1

→ x

3

→ x

2

. This conﬁguration is repre-

sented by the matrix

A =

0 0 0

b

1

0 b

2

b

3

0 0

Let us proceed by computing the quantity C in (6)

for such generative model under the assumption σ

1

=

σ

2

= σ

3

. For the sake of space we will report here

only the components of the submatrices

ˆ

S

1

[2 : 3,2 : 3]

and

ˆ

S

2

[2 : 3,2 : 3]. If data have been generated accord-

ing to the structure in Figure 3 and we ﬁt the hypoth-

esis 1 we obtain

ˆ

S

1

[2 : 3,2 : 3] =

"

(b

1

+ b

2

b

3

)

2

+ 1

b

2

(b

2

2

+1)

+ b

3

(b

1

+ b

2

b

3

)

b

2

(b

2

2

+1)

+ b

3

(b

1

+ b

2

b

3

)

b

2

2

(b

2

2

+1)

2

+ b

2

3

+ 1

#

(7)

If data have been generated according to the struc-

ture in Figure 3 and we ﬁt the hypothesis 2 we obtain

ˆ

S

2

[2 : 3,2 : 3] =

"

(b

1

+ b

2

b

3

)

2

+ b

2

2

+ 1

b

2

+ b

3

(b

1

+ b

2

b

3

)

b

2

+ b

3

(b

1

+ b

2

b

3

)

b

2

3

+ 1

#

(8)

It follows that

C(x

1

,x

2

,x

3

) =

=

ˆ

S

1

[3,3] −

ˆ

S

2

[3,3] +

ˆ

S

1

[2,2] −

ˆ

S

2

[2,2]

=

= −

(b

4

2

(b

2

2

+ 2))

(b

2

2

+ 1)

2

(9)

This term is less than zero whatever the sign of the

structural coefﬁcients b

i

in Figure 3.

Note that we do not discuss here the conﬁguration

with the edge pointing from x

2

to x

1

since this is a

cyclic one.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

162

4.3 Fork Pattern

Suppose now that observations are generated by the

triplet in Figure 3b, corresponding to the matrix

A =

0 0 b

3

b

1

0 b

2

0 0 0

Like in the previous section we report the compo-

nents of the submatrices

ˆ

S

1

[2 : 3,2 : 3] and

ˆ

S

2

[2 : 3,2 :

3]:

ˆ

S

1

[2 : 3,2 : 3] =

=

"

(b

1

b

2

3

+b

2

b

3

+b

1

)

2

(b

2

3

+1)

2

+ 1

b

2

(b

2

2

+b

2

3

+1)

+

(b

3

(b

1

b

2

3

+b

2

b

3

+b

1

))

(b

2

3

+1)

2

b

2

(b

2

2

+b

2

3

+1)

+

(b

3

(b

1

b

2

3

+b

2

b

3

+b

1

))

(b

2

3

+1)

2

b

2

2

(b

2

2

+b

2

3

+1)

2

+

b

2

3

(b

2

3

+1)

2

+ 1

#

(10)

ˆ

S

2

[2 : 3,2 : 3] =

"

(b

1

+ b

2

b

3

)

2

+ b

2

2

+ 1

b

2

+ b

3

(b

1

+ b

2

b

3

)

b

2

+ b

3

(b

1

+ b

2

b

3

)

b

2

3

+ 1

#

(11)

It follows that

C(x

1

,x

2

,x

3

) =

=

ˆ

S

1

[3,3] −

ˆ

S

2

[3,3] +

ˆ

S

1

[2,2] −

ˆ

S

2

[2,2]

=

= b

2

2

1

(b

2

2

+ b

2

3

+ 1)

2

− 1

(12)

This term is less than zero whatever the sign of the

structural coefﬁcients b

i

6= 0 in Figure 3b. In the sup-

plementary material we compute the value of C when

the direction of the link between x

1

and x

2

is reversed

(matrix (19) in the supplement). From (22) we obtain

that this value remains negative when (b

2

2

+ b

2

3

) > b

2

1

,

for instance when the absolute value of one of the co-

efﬁcients associated to the edges leaving x

3

is big-

ger than |b

1

|. In plain words if the cause-effect rela-

tionship between x

3

and the other variables is strong

enough, the statistics C takes a negative value.

The equations (6), (9) and (12) show that the com-

putation of the quantity C on the basis of observa-

tional data only can help in discriminating between

the collider conﬁguration in Figure 1 where the nodes

x

1

and x

2

are direct causes of x

3

(C > 0) and non col-

lider conﬁgurations (i.e. fork or chain) (C < 0) in Fig-

ure 3a and 3b.

In other terms, given a completely connected

triplet of variables, the quantity C(x

1

,x

2

,x

3

) returns

useful information about the causal role of x

1

and x

2

with respect to x

3

whatever is the strength or the di-

rection of the link between x

1

and x

2

.

5 A RELEVANCE CAUSAL

ALGORITHM TO INFER

DIRECTIONALITY

The properties of the quantity C encourage its use

in an algorithm to infer directionality from obser-

vational data. We propose then a RC (Relevance

Causal) algorithm for linear causal modeling inspired

to the mIMR causal ﬁlter selection algorithm (Bon-

tempi and Meyer, 2010). The mIMR algorithm is

characterized by two terms, a relevance term, assess-

ing the relevance of each input variable with respect

to a target variable and a causation term, aiming to

prioritize causal variables by minimizing the interac-

tion of triplets of variables. The causation term is de-

signed in order to reward variables which belong to

a collider pattern and penalize variables within a fork

pattern. Let us suppose that we want to identify the

set of causes of a target variable y among a set X of

inputs. The mIMR is a forward selection algorithm

which given a set X

S

of d already selected variables,

updates this set by adding the d + 1th variable which

satisﬁes

x

∗

d+1

= arg max

x

k

∈X−X

S

h

(1 − λ)I(x

k

;y)

−

λ

d

∑

x

i

∈X

S

I(x

i

;x

k

;y)

i

(13)

where I(x

k

;y) denotes the mutual information be-

tween x

k

and y, I(x

i

;x

k

;y) denotes the interaction

information and the coefﬁcient λ ∈ [0,1] is used to

weight the mutual information and the interaction

term.

As discussed previously, this algorithm might suf-

fer of bad performance when common causes are di-

rectly connected since the interaction term I(x

i

;x

k

;y)

could take positive values for a v-structure x

i

→ y ←

x

k

. For that reason we propose to replace the inter-

action term (to be minimized) with the criterion C (to

be maximized) to infer causal dependency from ob-

served data also in presence of completed connected

triplets. The resulting algorithm is a reformulation of

the mIMR where the update formula is now

x

∗

d+1

= arg max

x

k

∈X

−

X

S

h

(1 − λ)R({X

S

,x

k

};y)+

λ

d

∑

x

i

∈X

S

C(x

i

;x

k

;y)

i

(14)

AStatisticCriterionforReducingIndeterminacyinLinearCausalModeling

163

where λ ∈ [0,1] weights the R and the C contribu-

tion, the R term quantiﬁes the relevance of the subset

{X

S

,x

k

} and the C term quantiﬁes the causal role of

an input x

k

with respect to the set of selected variables

x

i

∈ X

S

.

The proposed RC algorithm is then a forward se-

lection algorithm which sequentially adds variables

according to the update rule (14). Note that for λ = 0

the algorithm boils down to a conventional forward

selection wrapper which assesses the subsets accord-

ing to the measure R. The RC algorithm is initialized

by selecting the couple of variables {x

i

,x

j

} maximiz-

ing the quantity

(1 − λ)R({x

i

,x

j

};y) +

λ

d

∑

x

i

∈X

S

C(x

i

;x

j

;y)

In the implementation used in the experimental

section, we adopt a linear leave-one-out measure to

quantify the relevance of a subset, i.e. R(X,y) is set

equal to the negative of linear leave-one-out mean-

squared-error of the regression with input X and tar-

get y. Also in order to have comparable values for the

R and the C terms, at each step these quantities are

normalized over the interval [0,1] before performing

their weighted sum.

6 EXPERIMENTS

In this section we assess the efﬁcacy of the RC algo-

rithm by performing a set of causal network inference

experiments. The aim of the experiment is to reverse

engineer both linear and nonlinear scale-free causal

networks, i.e. networks where the distribution of the

degree follows a power law, from a limited amount

of observational data. We consider a set of networks

with a large number n = 5000 of nodes and where

the degree α of the power law ranges between 2.1

and 3. The inference is done on the basis of a small

amount of N = 200 observations. The structural co-

efﬁcients of the linear dependencies have an absolute

value distributed uniformly between 0.5 and 0.8, and

the measurement error follows a standard Normal dis-

tribution. Nonlinear networks are obtained by trans-

forming the linear dependencies between nodes with

a sigmoid function.

We compare the accuracy of several algorithms in

terms of the mean F-measure (the higher, the better)

averaged over 10 runs and over all the nodes with a

number of parents and children larger equal than two.

The F-measure, also known as balanced F-score, is

the weighted harmonic mean of precision and recall

and is conventionally used to provide a compact mea-

sure of the quality of a network inference algorithm.

We considered the following algorithms for compari-

son: the IAMB algorithm (Tsamardinos et al., 2003)

implemented by the Causal Explorer software (Alif-

eris et al., 2003) which estimates for a given variable

the set of variables belonging to its Markov blanket,

the mIMR (Bontempi and Meyer, 2010) algorithm,

the mRMR (Peng et al., 2005) algorithm and three

versions of the RC algorithm with three different val-

ues λ = 0, 0.5, 1. Note that the RC algorithm with

λ = 0 boils down to a conventional wrapper algo-

rithm based on the leave-one-out assessment of the

variables’ subsets.

We also remark that the RC algorithms aims to

return for a given node a prioritization of the other

nodes according to their causal role while the Causal

Explorer implementation of IAMB returns a speciﬁc

subset (for a given pvalue). For the sake of compari-

son, we decided to compute the F-measure by setting

the number of putative causes to the number of vari-

ables returned by IAMB.

Tables 1 and 2 report the average F-measures for

different values of α in the linear and nonlinear case,

respectively.

The results show the potential of the criterion C

and of the RC algorithm in network inference tasks

where dependencies between parents are frequent be-

cause of direct links or common ancestors. According

to the F-measures reported in the Tables the RC accu-

racy with λ = 0.5 and λ = 1 is coherently better than

the ones of mIMR, mRMR and IAMB algorithms for

all the considered degrees distributions. However the

most striking result is the clear improvement with re-

spect to a conventional wrapper approach which tar-

gets only prediction accuracy (λ = 0 ) when a causal

criterion C is taken into account together with a pre-

dictive one (λ = 0.5). These results conﬁrm previous

results (Bontempi and Meyer, 2010; Bontempi et al.,

2011) putting into evidence that an effective causal

inference task should combine a relevance criterion

targeting prediction accuracy with a causal term able

to prioritize direct cause and penalize effects.

7 CONCLUSIONS

Causal inference from complex large dimensional

data is taking a growing importance in machine learn-

ing and knowledge discovery. Currently, most of the

existing algorithms are limited by the fact that the dis-

covery of causal directionality is submitted to the de-

tection of a limited set of distinguishable patterns, like

unshielded colliders. However the scarcity of data and

the intricacy of dependencies in networks could make

the detection of such patterns so rare that the resulting

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

164

Table 1: Linear case: F-measure (averaged over all nodes with a number of parents and children ≥ 2 and over 10 runs) of the

accuracy of the inferred networks on the basis of N = 100 observations.

α IAMB mIMR mRMR RC

0

RC

0.5

RC

1

2.2 0.375 0.324 0.319 0.386 0.421 0.375

2.3 0.378 0.337 0.333 0.387 0.437 0.401

2.4 0.376 0.342 0.342 0.385 0.441 0.414

2.5 0.348 0.322 0.313 0.358 0.422 0.413

2.6 0.347 0.318 0.311 0.355 0.432 0.414

2.7 0.344 0.321 0.311 0.352 0.424 0.423

2.8 0.324 0.304 0.293 0.334 0.424 0.422

2.9 0.342 0.333 0.321 0.353 0.448 0.459

3.0 0.321 0.319 0.297 0.326 0.426 0.448

Table 2: Nonlinear case: F-measure (averaged over all nodes with a number of parents and children ≥ 2 and over 10 runs) of

the accuracy of the inferred network on the basis of N = 100 observations.

α IAMB mIMR mRMR RC

0

RC

0.5

RC

1

2.2 0.312 0.310 0.304 0.314 0.356 0.324

2.3 0.317 0.328 0.316 0.320 0.375 0.349

2.4 0.304 0.317 0.304 0.306 0.366 0.351

2.5

0.321 0.327 0.328 0.325 0.379 0.359

2.6 0.306 0.325 0.306 0.309 0.379 0.365

2.7 0.313 0.319 0.303 0.316 0.380 0.359

2.8 0.297 0.326 0.300 0.300 0.392 0.382

2.9 0.310 0.329 0.313 0.313 0.389 0.377

3.0 0.299 0.324 0.300 0.303 0.399 0.392

precision would be unacceptable. This paper shows

that it is possible to identify new statistical measures

helping in reducing indistinguishability under the as-

sumption of equal variances of the unexplained vari-

ations of the three variables. Though this assumption

could be questioned, we deem that it is important to

deﬁne new statistics to help discriminating between

causal structures for completely connected triplets in

linear causal modeling. Future work will focus on as-

sessing whether such statistic is useful in reducing in-

determinacy also when the assumption of equal vari-

ance is not satisﬁed.

REFERENCES

Aliferis, C., Tsamardinos, I., and Statnikov, A. (2003).

Causal explorer: A probabilistic network learning

toolkit for biomedical discovery. In The 2003 Inter-

national Conference on Mathematics and Engineer-

ing Techniques in Medicine and Biological Sciences

(METMBS ’03).

Anderson, R. and Vastage, G. (2004). Causal modeling

alternatives in operations research: overview and ap-

plication. Eurpean Journal of Operational Research,

156:92–109.

Bollen, K. (1989). Structural equations with latent vari-

ables. John Wiley and Sons.

Bontempi, G., Haibe-Kains, B., Desmedt, C., Sotiriou, C.,

and Quackenbush, J. (2011). Multiple-input multiple-

output causal strategies for gene selection. BMC

bioinformatics, 12(1):458.

Bontempi, G. and Meyer, P. (2010). Causal ﬁlter selection

in microarray data. In Proceeding of the ICML2010

conference.

Bowden, R. and Turkington, D. (1984). Instrumental Vari-

ables. Cambridge University Press.

Brown, G. (2009). A new perspective for information theo-

retic feature selection. In Proceedings of the 12th In-

ternational Conference on Artiﬁcial Intelligence and

Statistics (AISTATS).

Graybill, F. (1976). Theory and Application of the Linear

Model. Duxbury Press.

Guyon, I., Aliferis, C., and Elisseeff, A. (2007). Compu-

tational Methods of Feature Selection, chapter Causal

Feature Selection, pages 63–86. Chapman and Hall.

Hershberger, S. (2006). Structural equation modeling: a

second course, chapter The problems of equivalent

structural models, pages 13–41. Springer.

Janzing, D., Hoyer, P. O., and Scholkopf, B. (2010). Telling

cause from effect based on high-dimensional observa-

tions. In Proceeding of the ICML2010 conference.

Janzing, D., Sgouritsa, E., Stegle, O., Peters, J., and

Scholkopf, B. (2011). Detecting low-complexity un-

observed causes. In Conference on Uncertainty in Ar-

tiﬁcial Intelligence (UAI2011).

Koller, D. and Friedman, N. (2009). Probabilistic graphical

models. The MIT Press.

AStatisticCriterionforReducingIndeterminacyinLinearCausalModeling

165

Mulaik, S. (2009). Linear Causal Modelling with Structural

Equations. CRC Press.

Peng, H., Long, F., and Ding, C. (2005). Fea-

ture selection based on mutual information: Cri-

teria of max-dependency,max-relevance, and min-

redundancy. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 27(8):1226–1238.

Spirtes, P., Glymour, C., and Scheines, R. (2000). Causa-

tion, Prediction and Search. Springer Verlag, Berlin.

Stelzl, I. (1986). Changing a causal hypothesis without

changing the ﬁt: Some rules for generating equiva-

lent path models. Multivariate Behavioral Research,

21:309?331.

Tsamardinos, I., Aliferis, C., and Statnikov, A. (2003). Al-

gorithms for large scale markov blanket discovery. In

Proceedings of the 16th International FLAIRS Con-

ference (FLAIRS 2003).

Watkinson, J., Liang, K., Wang, X., Zheng, T., and Anas-

tassiou, D. (2009). Inference of regulatory gene in-

teractions from expression data using three-way mu-

tual information. Annals of N.Y. Academy of Sciences,

1158:302–313.

APPENDIX

Let

A =

0 b

1

0

0 0 0

b

3

b

2

0

(15)

be the matrix associated to the collider pattern with

a edge heading from x

2

to x

1

. We compute the quan-

tity C for such generative model under the assumption

σ

1

= σ

2

= σ

3

.

If data have been generated according to the struc-

ture (15) and we ﬁt the hypothesis 1 we obtain

ˆ

S

1

[2 : 3,2 : 3] =

"

b

2

1

(b

2

1

+1)

2

+ 1

b

2

+

(b

1

(b

3

b

2

1

+b

2

b

1

+b

3

))

(b

2

1

+1)

2

b

2

+

(b

1

(b

3

b

2

1

+b

2

b

1

+b

3

))

(b

2

1

+1)

2

(b

3

b

2

1

+b

2

b

1

+b

3

)

2

(b

2

1

+1)

2

+ b

2

2

+ 1

#

(16)

If we ﬁt the hypothesis 2 we obtain

ˆ

S

2

[2 : 3,2 : 3] =

"

b

2

2

(b

2

1

+b

2

2

+1)

2

+

b

2

1

(b

2

1

+1)

2

+ 1

b

2

(b

2

1

+b

2

2

+1)

+

(b

1

(b

3

b

2

1

+b

2

b

1

+b

3

))

(b

2

1

+1)

2

b

2

(b

2

1

+b

2

2

+1)

+

(b

1

(b

3

b

2

1

+b

2

b

1

+b

3

))

(b

2

1

+1)

2

(b

3

b

2

1

+b

2

b

1

+b

3

)

2

(b

2

1

+1)

2

+ 1

#

(17)

It follows that

C(x

1

,x

2

,x

3

) =

=

ˆ

S

1

[3,3] −

ˆ

S

2

[3,3] +

ˆ

S

1

[2,2] −

ˆ

S

2

[2,2]

=

= b

2

2

−

b

2

2

(b

2

1

+ b

2

2

+ 1)

2

> 0 (18)

In other words the sign is positive also in case of a

link from x

2

to x

1

.

Let us consider now the fork pattern described by

the matrix

A =

0 b

1

b

3

0 0 b

2

0 0 0

(19)

If data have been generated according to the struc-

ture (19) and we ﬁt the hypothesis 1 we obtain

ˆ

S

1

[2 : 3,2 : 3] =

"

(b

1

b

2

2

+b

3

b

2

+b

1

)

2

(b

2

1

b

2

2

+b

2

1

+2b

1

b

2

b

3

+b

2

3

+1)

2

+ 1

(b

2

−b

1

b

3

)

(b

2

2

+b

2

3

+1)

+

((b

3

+b

1

b

2

)(b

1

b

2

2

+b

3

b

2

+b

1

))

(b

2

1

b

2

2

+b

2

1

+2b

1

b

2

b

3

+b

2

3

+1)

2

(b

2

−b

1

b

3

)

(b

2

2

+b

2

3

+1)

+

((b

3

+b

1

b

2

)(b

1

b

2

2

+b

3

b

2

+b

1

))

(b

2

1

b

2

2

+b

2

1

+2b

1

b

2

b

3

+b

2

3

+1)

2

(b

3

+b

1

b

2

)

2

(b

2

1

b

2

2

+b

2

1

+2b

1

b

2

b

3

+b

2

3

+1)

2

+

(b

2

−b

1

b

3

)

2

(b

2

2

+b

2

3

+1)

2

+ 1

#

(20)

If we ﬁt the hypothesis 2 we obtain

ˆ

S

2

[2 : 3,2 : 3] =

"

(b

2

−b

1

b

3

)

2

(b

2

1

+1)

2

+

(b

1

b

2

2

+b

3

b

2

+b

1

)

2

(b

2

1

b

2

2

+b

2

1

+2b

1

b

2

b

3

+b

2

3

+1)

2

+ 1

(b

2

−b

1

b

3

)

(b

2

1

+1)

+

((b

3

+b

1

b

2

)(b

1

b

2

2

+b

3

b

2

+b

1

))

(b

2

1

b

2

2

+b

2

1

+2b

1

b

2

b

3

+b

2

3

+1)

2

(b

2

−b

1

b

3

)

(b

2

1

+1)

+

((b

3

+b

1

b

2

)(b

1

b

2

2

+b

3

b

2

+b

1

))

(b

2

1

b

2

2

+b

2

1

+2b

1

b

2

b

3

+b

2

3

+1)

2

(b

3

+b

1

b

2

)

2

(b

2

1

b

2

2

+b

2

1

+2b

1

b

2

b

3

+b

2

3

+1)

2

+ 1

#

(21)

It follows that

C(x

1

,x

2

,x

3

) =

=

ˆ

S

1

[3,3] −

ˆ

S

2

[3,3] +

ˆ

S

1

[2,2] −

ˆ

S

2

[2,2]

=

=

(b

2

− b

1

b

3

)

2

(b

2

2

+ b

2

3

+ 1)

2

−

(b

2

− b

1

b

3

)

2

(b

2

1

+ 1)

2

=

= (b

2

− b

1

b

3

)

2

1

(b

2

2

+ b

2

3

+ 1)

2

−

1

(b

2

1

+ 1)

2

(22)

Note that this quantity is negative when (b

2

2

+ b

2

3

) >

b

2

1

.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

166