tell us which sentences are well-formed and which
sentences are not. If there are no constraints on how
different actions can follow each other, then any se-
quence of events is possible and (for finite numbers
of possible actions) the sequences of actions follows
a multinomial statistics. This in turn allows one to
predict the most likely distribution function of ob-
served events by minimizing the Kullback-Leibler di-
vergence (Kullback and Leibler, 1951; Hanel, 2014).
Alternatively one can maximize Shannon entropy,
(Shannon, 1948), under constraining conditions im-
plemented by so called cross-entropy terms (Hanel,
2014; Hanel, 2015). Entropy emerges asymptotically,
as the logarithm of the multiplicity of a given se-
quence divided by the effective number of degrees of
freedom, which in the multinomial case is the number
of observations N. Since in the purely multinomial
case all permutation of a sequence are well formed
and have identical probabilities, the multiplicity of
such sequences is given by the multinomial factor.
To sketch how Shannon entropy and Kullback-Leibler
divergence depend on the multinomial statistics of
the underlying process we may consider a Bernoulli
process the states i = 1,··· ,W with prior probabili-
ties q = (q
1
,··· , q
W
). We note that 1 = (
∑
i
q
i
)
N
=
∑
|k|
1
=N
N
k
∏
i
q
k
i
i
, where k = (k
1
,··· , k
W
) is the his-
togram of the process after N observations, i.e. k
i
is
the number of occurrences of state i. In particular the
probability P(k|q) = M(k)G(k|q) of the histogram k
factorizes into the probability to find a sequence with
histogram k given by G(k|q) =
∏
i
q
k
i
i
and the mul-
tiplicity of such sequences given by the multinomial
coefficient, M(k) =
N
k
. It follows that Shannon en-
tropy asymptotically (for large N) is given by H(p) =
−
∑
i
p
i
log p
i
=
1
N
logM(k), where p = k/N, are the
relative frequencies of observing states i. Similarly
−
1
N
logG(k|q) = −
∑
p
i
logq
i
is the cross entropy, and
D
KL
(p||q) =
∑
i
p
i
(log p
i
− log q
i
) = −
1
N
logP(k|q) is
the Kullback-Leibler divergence. Maximum entropy
estimates therefore correspond to the so called max-
imum configuration, the most likely histogram of a
process after N observations. If generative rules con-
strain sequences, the rules of a regular grammar for
instance, then the number of sequences with identi-
cal histograms becomes smaller than the multinomial
factor, directly affecting the functional form of en-
tropy (the scaled logarithm of multiplicity) and diver-
gence, (Hanel, 2014; Hanel, 2015).
If we think of networks (e.g. the streets of Lon-
don), consisting of nodes and sets of links connecting
those nodes. A walk on such a network can be in-
terpreted as a process composing elementary actions
symbolized by links i → j from a node i to another
node j. Typically, not all actions can be freely com-
posed. We can only compose those actions where the
end-node j of one link i → j is the starting-node of
another link j → k. If one moves from one place,
X, in town to another, Y, then the next move has
to start in Y. In different processes of even higher
complexity, language for instance, well formed se-
quences of states may follow different rules of suc-
cession that may become more complicated than the
simple groupoid induced by an underlying network
topology. In order to develop the information the-
ory of such processes we need an appropriate gener-
alization of the multinomial coefficient counting only
well-formed sequences. In other words, the syntactic
rules governing a complex process become important
for correctly counting the numbers of well-formed se-
quences of length N. We note, that beneath the statis-
tical description of a system we again require a struc-
tural one that allows us to identify well formed or typ-
ical sequences. The efforts required to identify a par-
ticular process, or at least the class a process belongs
to, can not be avoided, reminding us of the non exis-
tence of a free lunch (Wolpert and Macready, 1995).
In the following we show that if a process pos-
sesses a description in terms of a directed multi-graph,
i.e. if the process can be understood with a finite num-
ber of states for sequences of arbitrary length (regu-
lar grammars, finite automatons), then enough of the
multinomial structure of the process is preserved, to
implement decision rules into a matrix representation
of words a ∈ A, which automatically takes care of
the syntactic rules. - We will demonstrate the power
of the methodology in an example, providing an ele-
gant way for counting the number of stable attractor
states that exist in the Oslo sandpile model, (Corral,
2004), depending on the size of the basis of the Oslo-
sandpile.
2 FINITE STATE TRANSITION
SYSTEMS
Let σ
0
be the initial state of a process before we
sample the first step. Let A be the lexicon, each
word in the lexicon representing a possible event. Let
λ = (λ
1
,··· , λ
N
) be a sequence of events λ
n
∈ A. Any
sequence λ can either be well formed or not. Those
sequences that are not well-formed again have to be
distinguished into two sub-classes. The first class
contains transient sequences that are not well formed
at length N but are part of a well formed sequence, i.e.
there exists a λ
0
with length N
0
> N such that λ
0
n
= λ
n
for all n = 1,· · · ,N and λ
0
is well formed. What re-
mains are sequences that are not well formed and do
not form the beginning of a longer well formed se-
Matrix Multinomial Systems with Finite Syntax
27