that the marginal likelihood of each CEG is therefore
the product of the marginal likelihoods of its compo-
nent florets. Explicitly, the marginal likelihood of a
CEG C is
∏
u
Γ(
∑
n
α
un
)
Γ(
∑
n
(α
un
+ x
un
))
∏
n
Γ(α
un
+ x
un
)
Γ(α
un
)
where, as above
• u indexes the stages of C
• n indexes the outgoing edges of each stage
• α
un
are the exponents of our Dirichlet priors
• x
un
are the data counts
As we are actually interested in p(model | data),
and this is proportional to p(data | model) ×
p(model), we need to set both parameter priors and
prior probabilities for the possible models.
Exactly analogously with BNs, parameter modu-
larity in CEGs implies that whenever CEG models
share some aspect of their topology, we assign this as-
pect the same prior distribution in each model. When
such priors reflect our beliefs in a given context, this
can reduce our problem dramatically to one of sim-
ply expressing prior beliefs about the possible floret
distributions (ie. the local differences in model struc-
ture). As each CEG model is essentially a partition of
the vertices in the underlying tree into sets of stages,
this requirement ensures that when two partitions dif-
fer only in whether or not some subset of vertices be-
long to the same stage, the prior expressions for the
models differ only in the term relating to this stage.
The separation of the likelihood means that this local
difference property is retained in the posterior distri-
bution.
Now, our candidate set is much richer than the
corresponding candidate BN set, and will probably
contain models we have not previously considered in
our analysis. Again, evoking modularity, if we have
no information to suggest otherwise, we follow stan-
dard BN practice and let p(model) be constant for all
models in the class of CEGs. We now use the loga-
rithm of the marginal likelihood of a CEG model as
its score, and maximise this score over our set of can-
didate models to find the MAP model.
Our expression has the nice property that the
difference in score between two models which are
identical except for a particular subset of florets, is
a function of the subscores only of the probabil-
ity tables on the florets where they differ. Vari-
ous fast deterministic and stochastic algorithms can
therefore be derived to search over the model space,
even when this is large – see (Freeman and Smith,
2009). This property is of course shared by the class
of BNs.
We set the priors of the hyperparameters so that
they correspond to counts of dummy units through the
graph. This can be done by setting a Dirichlet distri-
bution on the root-to-sink paths, and for simplicity we
choose a uniform distribution for this. It is then easy
to check that in the special case where the CEG is ex-
pressible as a BN, the CEG score above is equal to the
standard score for a BN using the usual prior settings
as recommended in, for example, (Cooper and Her-
skovits, 1992; Heckerman et al., 1995). As a compar-
ison with our CEG-expression; given Dirichlet priors
and a multivariate likelihood, the marginal likelihood
on a BN is expressible as
∏
i∈V
"
∏
m
Γ(
∑
n
α
imn
)
Γ(
∑
n
(α
imn
+ x
imn
))
∏
n
Γ(α
imn
+ x
imn
)
Γ(α
imn
)
#
where
• i indexes the set of variables of the BN
• n indexes the levels of the variable X
i
• m indexes vectors of levels of the parental vari-
ables of X
i
The importance of this result is that were we first
to search the space of BNs for the MAP model, then
we could seamlessly refine this model using the CEG
search score described above. Such embellishments
will allow us to search over models containing signif-
icant amounts of context specific information. Fur-
thermore any model we find will have an associated
interpretation which can be stated in common lan-
guage, and can be discussed and critiqued by our
client/expert for its phenomenological plausibility.
Example 1 Continued
For the CEG in Figure 2, we put a uniform prior
over the 11 root-to-leaf paths, which in turn allows
us to assign our stage priors as follows: we assign
a Di(3, 4, 4) prior to the stage identified by w
0
, a
Di(3, 4) prior to the stage u
1
≡ (w
1
, w
2
), a Di(2, 2)
prior to each of the stages identified by w
3
and w
5
,
and a Di(3, 3) prior to the stage identified by w
4
. We
would then have a marginal likelihood of
KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development
394