The Integer Approximation of Undirected Graphical Models

Nico Piatkowski, Sangkyun Lee and Katharina Morik

Artiﬁcial Intelligence Group, TU Dortmund University, 44227 Dortmund, Germany

Keywords:

Graphical Models, Approximate Inference.

Abstract:

Machine learning on resource constrained ubiquitous devices suffers from high energy consumption and slow

execution time. In this paper, it is investigated how to modify machine learning algorithms in order to reduce

the number of consumed clock cycles—not by reducing the asymptotic complexity, but by assuming a weaker

execution platform. In particular, an integer approximation to the class of undirected graphical models is

proposed. Algorithms for inference, maximum-a-posteriori prediction and parameter estimation are presented

and approximation error is discussed. In numerical evaluations on synthetic data, the response of the model

to several inﬂuential properties of the data is investigated. The results on the synthetic data are conﬁrmed

with a natural language processing task on an open data set. In addition, the runtime on low-end hardware

is regarded. The overall speedup of the new algorithms is at least 2× while overall loss in accuracy is rather

small. This allows running probabilistic methods on very small devices, even if they do not contain a processor

that is capable of executing ﬂoating point arithmetic at all.

1 INTRODUCTION

Data analytics for streaming sensor data brings chal-

lenges for the resource efﬁciency of algorithms in

terms of execution time and the energy consumption

simultaneously. Fortunately, optimizations which re-

duce the number of CPU cycles also reduce energy

consumption. When reviewing the speciﬁcations of

processing units, one ﬁnds that integer arithmetic is

usually cheaper in terms of instruction latency, i.e. it

needs a small number of clock cycles until the result

of an arithmetic instruction is ready. This motivates

the reduction of CPU cycles in which code is executed

when designing a new, resource-aware learning algo-

rithm. Beside clock cycle reduction, limited memory

usage is also an important factor for small devices.

Outsourcing parts of data analysis from data cen-

ters to ubiquitous devices that actually measuredata

would reduce the communication costs and thus en-

ergy consumption. If, for instance, a mobile medical

device or smartphone can build a probabilistic model

of the usage behavior of its user, energy models can

be made more accurate and power management can

be more efﬁcient. The biggest hurdle in doing this,

are the heavily restricted computational capabilities

of very small devices—some do not even have a ﬂoat-

ing point processor. Consequently, computationally

simple machine learning approaches have to be con-

sidered. Low complexity of machine learning mod-

els is usually achieved by independence assumptions

among features or labels. In contrast, the joint predic-

tion of multiple dependent variables based on multi-

ple observed inputs is an ubiquitous subtask in real

world problems from various domains. Probabilistic

graphical models are well suited for such tasks, but

they suffer from the high complexity of probabilistic

inference.

In the paper at hand, it is shown that the frame-

work of undirected graphical models (Wainwright

and Jordan, 2007) can be mapped to an integer do-

main. Inference algorithms and a new optimization

scheme are proposed, that allow the learning of inte-

ger parameters without the need for any ﬂoating point

computation. This opens up the opportunity of run-

ning machine learning tasks on very small, resource-

constrained devices. To be more precise, based only

on integers, it is possible to compute approximations

to marginal probabilities, to maximum-a-posteriori

(MAP) assignments and maximum likelihood esti-

mate either via an approximate closed form solution

or an integer variant of the stochastic gradient de-

scent (SGD) algorithm. It turns out that the integer

approximations use less memory and deliver a rea-

sonable quality while being around twice as fast as

their ﬂoating point counterparts. To the best of our

knowledge, there is nothing like an integer undirected

model so far. The remainder of this paper is organized

as follows. This Section continues with an overview

296

Piatkowski N., Lee S. and Morik K..

The Integer Approximation of Undirected Graphical Models.

DOI: 10.5220/0004831202960304

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 296-304

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

on related work. A short introduction to probabilistic

graphical models is given in Section 2. In Section 3,

the intuition behind integer undirected graphical mod-

els is explained, and the corresponding algorithms are

derived. Furthermore, a bound on the training error is

presented. Two instances of the integer framework,

Integer Markov Random Fields and Integer Condi-

tional Random Fields, are evaluated in Section 4 on

synthetic and real world data. Finally, Section 5 con-

cludes this work.

1.1 Related Work

Many approximate approaches to probabilistic infer-

ence based on Belief Propagation (BP) (Kschischang

et al., 2001; Pearl, 1988) were proposed in the last

decade. Among them Counting BP (Kersting et al.,

2009), Lifted BP (Ahmadi et al., 2012), Stochastic BP

(Noorshams and Wainwright, 2011), Tree-reweighted

BP (Wainwright et al., 2003), Tree Block Coordinate

Descent (Sontag and Jaakkola, 2009) or Particle BP

(Ihler and McAllester, 2009). Unfortunately, most of

these methods are by no means suited for embedded

or resource constraint environments. In contrast to

these approaches, the model class that is proposed

in the paper at hand has the same asymptotic com-

plexity as the vanilla inference methods, but it uses

cheaper operations. Inspired by work from the signal

processing community (Hassibi and Boyd, 1998), the

underlying model class is restricted to the integers,

which results in a reduced runtime and energy sav-

ings, while keeping a good performance. This new

approach should not be confused with models that

are designed for integer state spaces, in which case

the state space X is a subset of the natural numbers

or, more generally, is a metric space. Here, the state

space may be an arbitrary discrete space without any

additional constraints.

Estimation in discrete parameter models was re-

cently investigated in (Choirat and Seri, 2012). They

discuss consistency, asymptotic distribution theory,

information inequalities and their relations with efﬁ-

ciency and super-efﬁciency for a general class of m-

estimators. Unfortunately, the authors do not consider

the case when the true estimator is not included in the

search space and therefore, their analysis cannot be

used to estimate the error in a situation when the op-

timizer has to be approximated.

Bayesian network classiﬁers with reduced preci-

sion parameters have been introducedrecently(Tschi-

atschek et al., 2012). The authors evaluate empir-

ically the classiﬁcation performance when reducing

the ﬂoating-point precision of probability parameters

of Bayesian networks. After learning the parameters

as usual in R (represented as 64 bit double precision

ﬂoating point numbers), they varied the bit-width of

mantissa and exponent, and reported the prediction

accuracy in terms of the normalized number of cor-

rectly classiﬁed test instances. They found that af-

ter learning, the parameters may be multiplied by a

sufﬁciently large integer constant (10

) to convert the

probabilities into integer numbers. Therefore, their

method still relies on ﬂoating point arithmetic for

learning and prediction. However, Tschiatschek et al.

missed the point that real valued probability parame-

ters are necessary for Bayesian networks but not for

all classes of probabilistic graphical models.

2 PROBABILISTIC GRAPHICAL

MODELS

The basic notation and concepts of probabilistic

graphical models in this Section are based on (Wain-

wright and Jordan, 2007), which is an excellent intro-

duction to this topic. Let G = (V,E) be an undirected

graph with |V| = n vertices, edge set E ⊂ V × V and

:= {w ∈ V : {v,w} ∈ E} the neighbors of vertex

v ∈ V. Each vertex v ∈ V corresponds to a random

variable X

with a realization x

and a domain X

with

at least two different states, i.e. |X

| ≥ 2. Consider

an n-dimensional random variable X = (X

)

v∈V

with

realization x ∈ X = ⊗

v∈V

. The probability of the

event {X = x} is denoted by p(X = x). p(x) is used

as a shortcut for p(X = x) in the remainder . For a

set of vertices A ⊆ V, X

is addressing the compo-

nents of X that correspond to the vertices in A. For

ease of notation, X

and X

{v}

are regarded the same.

For undirected graphical models, the joint probability

mass function of X is given by

(x) =

Z(θ)

∏

C∈C (G)

) (1)

Z(θ) =

∑

x∈X

∏

C∈C (G)

) (2)

where C (G) is the set of all cliques

in G and Z(θ)

is the normalization constant (since it does not de-

pend on x). Let C be a clique of G and X

the cor-

responding joint domain of all vertices in C. The set

Ω is the domain of the parameters θ ∈ Ω

which is

usually the real line Ω = R. The parameter vector

contains |X

| weights for each clique C ∈ C (G), i.e.

θ = (θ

)

C∈C (G)

, which results in d =

∑

C∈C (G)

The compatibility functions ψ

(also known as fac-

A clique corresponds to a fully connected subgraph.

TheIntegerApproximationofUndirectedGraphicalModels

297

tors) are typically chosen to be

) = exp(hθ

,φ

(x)i)

since this ensures positivity of p

and leads to a

canonical form of the corresponding exponential fam-

ily member.

(x) = exp(hθ,φ(x)i − A(θ))

The vector-valued function φ is a sufﬁcient statistic

of X and may be understood as transformation of x

into a binary valued feature space φ : X → {0, 1}

and

A(θ) = logZ(θ). Sufﬁcient statistics are called over-

complete, when there exists a vector a ∈ R

and a con-

stant b ∈ R, such that ha,φ(x)i = b, ∀x ∈ X . For con-

venience, the components of θ and φ are indexed by

C to denote the subvector of weights or features that

corresponds to a clique C. To address a certain com-

ponent of θ or φ, the corresponding event {X

= x

}

is used as an index, i.e. θ

or even θ

C=x

If the parameters are known, the MAP prediction

for the joint state of all vertices can be computed by

∗

= argmax

x∈X

(x) = argmax

x∈X

hθ,φ(x)i. (3)

A common choice for learning the parameters θ of an

undirected model is the maximum likelihood estima-

tion (MLE), where the likelihood (4) of the parame-

ters θ for given i.i.d. data

D is maximized.

L(θ | D) =

∏

x∈D

(x) (4)

The MLE θ

∗

, i.e. the solution that maximizes L, has

a closed form, if and only if the underlying graphi-

cal structure is a tree or a triangulated graph. In this

case, θ

∗

is induced by the empirical expectation of the

sufﬁcient statistic φ(x).

∗

v=x

= logE

[φ

v=x

(x)], (5)

∗

vu=xy

= log

[φ

vu=xy

(x)]

[φ

v=x

(x)]E

[φ

u=y

(x)]

(6)

The MLE θ

∗

for partially observed data and certain

classes of graphical models like Conditional Random

Fields (CRF) (Sutton and McCallum, 2012) can be

found with gradient based methods. Taking the loga-

rithm of (4), dividing by |D| and substituting (1) for

(x) yields the average log-likelihood ℓ (7). Since

the logarithm is monotonic, maximizing ℓ will reveal

the optimizer of L. Since E

[φ(x)] =

|D|

∑

x∈D

φ(x),

ℓ is given by

ℓ(θ | D) = hθ,E

[φ(x)]i − lnZ(θ). (7)

It is assumed that every training instance in D is fully

observed.

Taking the natural logarithm to form the log-

likelihood is an arbitrary choice that may be replaced

with any other log

if desired. Since the second term

is the cumulant generating function of p

, its par-

tial derivative is the expected sufﬁcient statistic for a

given θ. This is plugged into the partial derivative of

ℓ w.r.t. θ

(7) to obtain the expression (8).

∂ℓ(θ | D)

∂θ

= E

[φ

(x)] − E

[φ

(x)] (8)

Here, E

[φ

(x)] denotes the empirical expecta-

tion of φ

(x), i.e. its average value in D. By

using (8), the model parameters θ can be estimated

by any ﬁrst-order optimization technique. In the fol-

lowing, it is explained shortly how E

[φ

(x)] is

computed with BP. From now on, assume that the un-

derlying graphical structure is a tree. The maximum

clique size is thus 2. The message update rule is

v→u

) =

∑

∈X

)ψ

)

∏

w∈N

\{u}

w→v

(9)

The messages m

v→u

) are computed for all pairs of

vertex v ∈ V and neighbor u ∈ N

until convergence.

Converged messages are denoted by m

∗

v→u

). The

product of all incoming messages of a vertex is given

by M

(x) :=

∏

u∈N

u→v

(x). After convergence, the

vertex marginal probabilities p

) that are implied

by θ can be computed with

) =

∗

)

∑

x∈X

∗

(x)

(10)

whereas M

∗

(x) is the product of converged mes-

sages m

∗

v→u

). The interested reader is referred to

(Kschischang et al., 2001) for a discussion on BP and

related algorithms.

3 THE INTEGER

APPROXIMATION

In their book on graphical models, Wainwright and

Jordan (Wainwright and Jordan, 2007) stated that ”It

is important to understand that for a general undi-

rected graph the compatibility functions ψ

need not

have any obvious or direct relation to marginal or con-

ditional distributions deﬁned over the graph cliques.

This property should be contrasted with the directed

factorization, where the factors correspond to condi-

tional probabilities over the child-parent sets.” This

explains why it could be possible to have an undi-

rected graphical model that is parametrized by inte-

gers. However, there is some work to do. For ex-

cluding every ﬂoating point computation, the identi-

ﬁcation of integer parameters is not enough. That is,

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

298

the computations for training and prediction have to

be based on integer arithmetic. Lastly, integer ap-

proximation should still deliver a reasonable quality

in terms of training error and test error.

The ﬁrst step towards integer models is directly

related to the above statement. Strictly speaking, the

parameter domain Ω is restricted to the set of inte-

gers N and a new potential function is deﬁned as

) := 2

hθ

,φ

(x)i

= exp(ln(2)hθ

,φ

(x)i). (11)

Considering parameters θ ∈ R

of a model that has

potential function ψ

), it is easy to see that replac-

ing ψ

) with ψ

) does not alter the marginal

probabilities as long as the parameters are scaled

/ln2. By this, it is possible to convert integer pa-

rameters that are estimated with

) to ψ

)

(and vice versa), without altering the resulting proba-

bilities. Notice that

) can be computed by log-

ical bit shift operations which consume which does

consume less clock cycles than the corresponding

transcendental functions required to compute ψ

As already mentioned above, it requires θ ∈ N

for

) to be integer and hence the product of com-

patibility functions in Eq. (1) and the normalization

constant Eq. (2) are computable by means of non-

negative integer arithmetic. By this, the probabili-

ties that are computed by the model are rational, i.e.

p(x) ∈ [0, 1] ∩ Q and may be represented as a pair of

natural numbers. Nevertheless, actual probability val-

ues are not required at all for estimating the integer

model parameters and to compute MAP predictions.

3.1 Inference

Recalling the message update Eq. (9), one sees that

all messages are integer valued, if Ω ⊆ N, ψ

) is

replaced by

) and the initial messages are set to

1. Thus, the whole message computation and propa-

gation procedure is already stated without any ﬂoating

point computation. Nevertheless, recall that the inte-

ger width of a CPU is constrained by its wordsize ω.

v→u

(x) may exceed the machine integer precision

limit 2

quite easily. Thus, many overﬂows could

occur during message computations which destroy

the semantics of the messages and the results are no

longer meaningful. First attempts to make the compu-

tation more robust against overﬂows relied on the fact

that messages m

v→u

(x) may be scaled arbitrarily with-

out changing the resulting marginal probabilities as

long as the same scale is used for all x. Nevertheless,

the messages cannot be simply divided by their sum

as is the case with ﬂoating point arithmetic, since inte-

ger division will pin all messages down to 0. Numer-

ous attempts to scale the integer messages by bit shift

= x

)

0.25

0.5

0.75

0 0.25 0.5 0.75 1

ˆp

= x

)

Figure 1: Estimates of edge marginal probabilities for

50 random trees with 50 nodes and 2 states per node.

Marginals are computed by the bit length approximation ( ˆp)

and vanilla BP (p) on the same parameter vector θ.

operations only worked on relatively small graphical

structures, but all those approaches suffered from the

loss of information that occurred whenever too many

bits had to be shifted out in order to prevent overﬂows.

As a solution to this problem, new messages are

deﬁned. Instead of computing the original sum-

product messages, we proposeto compute an approxi-

mation to the integer message bit length. The approxi-

mate bit length β

(y) and the corresponding message

ˆm

(y) := 2

(y)

are given by

(y) := max

vu=xy

+θ

v=x

∑

w∈N

\{u}

(x). (12)

How m and ˆm are related to each other is a natu-

ral question. The messages ˆm that result from the

bit length approximation resemble max-product mes-

sages (Kschischang et al., 2001). Their magnitude is

related to the original messages m through the follow-

ing lemma.

Lemma 1. Let (v,u) ∈ E be an edge of tree G =

(V, E), h

:= |X

| the size of vs state space and n



the number of its neighbors. If ∀y ∈ X

: ∃x ∈ X

vu=xy

+ θ

v=x

> 0, then ˆm

(x) < m

(x) ≤ ˆm

(x)

This statement can be proven by induction over

the degree n

of v. The fact that the exclusion of all-0

vertex and edge parameters is a rather safe assump-

tion, will be revealed later, when the integer parame-

ter estimation is described.

When it comes to the point-wise estimates of the

marginal probabilities, one ﬁnds that due to the ap-

proximate messages some marginal probabilities sim-

ply cannot be present. In Figure 1, edge marginal

probabilities Eq. (10) are plotted that are computed

with m and ˆm, respectively, while using the same pa-

rameters for both models. One clearly sees how the

probability space is discretized by the approximate

messages. One can also see that there is an error in the

TheIntegerApproximationofUndirectedGraphicalModels

299

approximate marginal probabilities computed with ˆm.

In case of zero error all points would lie on the diag-

onal.

The previous lemma helps to derive an estimate

of the distance between the true outcome of the infer-

ence and the one that results from the message update

Eq. (12).

Theorem 1. Let β

∗

:= max

max

(y) be the max-

imum incoming bit length at v and assume that the

preconditions of Lemma 1 hold, then

D(p

k ˆp

) ∈ o



∗



where D(p

k ˆp

) denotes the Kullback-Leibler (KL)

divergence between the marginal probability mass

function p

, computed with the message update

(y), and ˆp

, computed with ˆm

(y).

This result can be derived by plugging the BP

marginals (10) into the deﬁnition of the KL diver-

gence and applying Lemma 1 two times. The KL is

still unbounded, since there is no bound on β

∗

. Never-

theless, it indicates a dependence of the KL of p

and

ˆp

on the state space size |X

| and the neighborhood

size |N

|. This relation can also be observed in the nu-

merical experiments in Section 4. A comprehensive

discussion of how message errors affect the result of

belief propagation can be found in (Ihler et al., 2005).

3.2 Parameter Estimation

In the following, an integer parameter estimation

method based on the closed form solution to the MLE

is derived. Recall that E

[φ(x)] =

|D|

∑

x∈D

φ(x)

and let f :=

∑

x∈D

φ(x) and bla := ⌊log

a⌋ + 1 the

bit length of a. By approximating the logarithm in

Eqs. (5) and (6) with bl, an integer approximation to

the optimal parameters can be found:

v=x

:= bl f

v=x

− bl|D| ≈ log

[φ

v=x

(x)]

vu=xy

:= bl f

vu=xy

− bl f

v=x

− bl f

u=y

+ bl|D|

≈ log

[φ

vu=xy

(x)]

Unfortunately, most of those estimates are negative

which is not allowed due to the integer restriction. In

case of negative parameters, let λ := | max

1≤i≤d

−

be the absolute value of the smallest component of

θ.

Now, consider the weights

v=x

:= s+

v=x

vu=xy

:= s+

vu=xy

with s := (λ,λ,...,λ)

⊤

∈ R

. The new parameters

are non-negative, but an error is induced into

θ by

replacing log

with bl. The following lemma shows

that shifting

θ by s introduces no new error.

Lemma 2. Let s := (c,c,.. . ,c)

⊤

∈ R

with arbitrary

c and let φ be an overcomplete sufﬁcient statistic, then

ℓ(θ+ s) = ℓ(θ).

The statement follows from the deﬁnitions of log-

likelihood and overcompleteness. This shows that it

is safe to assume ∀y ∈ X

: ∃x ∈ X

: θ

vu=xy

+ θ

v=x

0 for Lemma 1, since we can enforce all parameters

to be positive without touching the likelihood. It is

also possible to bound the training error of the shifted

integer parameters

Theorem 2. Let θ

∗

be as deﬁned by Eqs. (5) and

(6) and

as deﬁned above, then ℓ(θ

∗

) − ℓ(

) ≤

k∇f(

The result follows from the previous lemma and

the deﬁnition of convexity.

Either due to restrictions in wordsize ω or for en-

larging the number of representable marginal prob-

abilities, a ﬁnal scaling of the parameters might

be desired. To allow an appropriate integer scal-

ing, a parameter K := |Ω| is introduced: Let γ :=

max

1≤i≤d

−

be the negative component of

θ with

the largest magnitude and κ := max

1≤i≤d

be the

component of

θ with the largest magnitude. The ﬁ-

nal integer parameters are computed by

v=x



κ− γ

v=x



vu=xy



κ− γ

vu=xy



(13)

Thus,

is rescaled such that

θ ∈ {0, 1, . ..,K}

which may also be interpreted as implicit base

change. Note that unless K = (κ − γ), the parameter

vector is scaled and an additional approximation error

is added. Hence, the impact of K is empirically evalu-

ated in Section 4. The method of choosing parameters

according to (13) is called direct integer estimation.

3.3 Gradient based Estimation

As mentioned in Section 2, in certain situations, it

might be desired to estimate the parameters with

gradient based methods. Unfortunately, the partial

derivatives from Eq. (8) are not integral. The expres-

sion must be rearranged to obtain an integer form. Let

f :=

∑

x∈D

φ(x) and

∗

∑

y∈X

∗

(y), it is

∗

|D|

∂ℓ(θ | D)

∂θ

∗

v=x

− |D|

∗

This scaled version of the partial derivative is an in-

teger expression that can be computed by using only

integer addition, multiplication and binary bit shift.

The common gradient descent update makes use of

a stepsize η to determine how far the current weight

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

300

vector should move into the direction of the gradi-

ent. The smallest possible step size in integer space

is 1. This means that any parameter can either be in-

creased or decreased by 1. In the beginning of an inte-

ger gradient based optimization, the gradient will tell

to increase a quite large number of parameters. This

results in rather slow convergence, since due to the

ﬁxed step size of 1, most of the parameters are worse

than before the update. To compensate for this, we

suggest to update, for each clique, only the parameter

for which the corresponding partial derivative has the

largest magnitude. This method is used when estimat-

ing the CRF parameters in the following section.

4 NUMERICAL RESULTS

The previous sections pointed out various factors that

may have an inﬂuence on training error, test perfor-

mance or runtime of the integer approximation. In

order to show that integer undirected models are a

quite general approach for approximate learning in

discrete state spaces, generative and discriminative

variants of undirected models are evaluated on syn-

thetic data and real world data. In particular the fol-

lowing methods are considered: RealMRF: The clas-

sic generative undirected model as described in Sec-

tion 2. RealCRF: The discriminative classiﬁer as it

is deﬁned in (Lafferty et al., 2001; Sutton and Mc-

Callum, 2012). IntMRF: The integer approximation

of generative undirected models as described in Sec-

tion 3. IntCRF: The integer approximation of dis-

criminative undirected models. Further details are ex-

plained in Section 4.4. Both real variants are based

on ﬂoating point arithmetic. In the MRF experiments,

the model parameters are estimated from the empiri-

cal expectations by Eqs. (5), (6) and (13). Parameters

of discriminative models are estimated by stochas-

tic gradient methods (Sutton and McCallum, 2012).

Each MRF experiment was repeated 100 times on ran-

dom input distributions and graphs. In most cases,

only the average is reported, since the standard devi-

ation was too small to be visualized in a plot. When-

ever MAP accuracy is reported, it corresponds to the

percentage of correctly labeled vertices, where the

prediction is computed with Eq. (3).

The implementations

of all evaluated methods

are equally efﬁcient, e.g. the message computation

(and therefore the probability computation) executes

exactly the same code, except for the arithmetic in-

structions. Unless otherwise explicitly stated, the ex-

periments are done on an Intel Core i7-2600K 3.4GHz

For reproducibility, all data and code is available at

http://sfb876.tu-dortmund.de/intmodels.

(Sandy Bridge architecture) with 16GB 1333MHz

DDR3 main memory.

Synthetic Data. In order to achieve robust results

that capture the average behavior of the integer ap-

proximation, a synthetic data generator has been im-

plemented that samples random empirical marginals

with corresponding MAP states. Therefore, a sequen-

tial algorithm for random trees with given degrees

(Blitzstein and Diaconis, 2011) generates random tree

structured graphs. For a random graph, the weights

∗

∼ N (0,1) are sampled from a Gaussian distribu-

tion. Additionally, for each vertex, a random state is

selected that gets a constant extra amount of weight,

thus enforcing low entropy. The weights are then used

to generate marginals and MAP states with the double

precision ﬂoating point variant of belief propagation.

The generated marginals serve as empirical input dis-

tribution and the MAP state is compared to the MAP

state that is estimated by IntMRF and RealMRF.

CoNLL-2000 Data. This data set was proposed for

the shared task at the Conference on Computational

Natural Language Learning in 2000 and is based on

the Wall Street Journal corpus. The latter contains

word-features and one label, called chunk tag, per

word. In total, there are 22 chunk tags that correspond

to the vertex states, i.e. |X | = 22. For the computa-

tion of per chunk F

-score, a chunk is only treated as

correct, if and only if all consecutive tags that belong

to the same chunk are correct. The data set contains

8936 training instances and 2012 test instances. Be-

cause of the inherent dependency between neighbor-

ing vertex states, this data set is well suited to evaluate

whether the dependency structure between vertices is

preserved by the integer approximation.

4.1 The Impact of |X | and |N

| on

Quality and Runtime

In Section 3 an estimate of the error in marginal prob-

abilities that are computed with bit length BP (Sec-

tion 3.1) indicates that the size of the vertex state

space |X

| and the degree |N

| have an impact on the

training error. In Figure 2, the training error in terms

of normalized negative log-likelihood, the test error

in terms of MAP accuracy and the runtime in seconds

for two values of |X

| and |N

| for an increasing num-

ber of vertices on the synthetic data are shown. Each

point in each curve is the average over 100 random

trees with random parameters. The results with vary-

ing |X

| are generated with a maximum degree of 8

and the ones for varying |N

| with |X

| = 4.

In terms of training error, the mid-right plot shows

a clear offset between integer and ﬂoating point es-

timates for the same number of states. In terms of

TheIntegerApproximationofUndirectedGraphicalModels

301

MAP Accuracy

0.85

0.9

0.95

10 100 1000

Int, deg=4

Int, deg=16

Real, deg=4

Real, deg=16

0.4

0.6

0.8

10 100 1000

Int, |X|=4

Int, |X|=16

Real, |X|=4

Real, |X|=16

−ℓ(θ)/(|V| + |E|)

0.4

0.5

0.6

0.7

10 100 1000

0.75

1.5

2.25

10 100 1000

Seconds

0.07

0.14

0.21

0 2500 5000 7500

0.75

1.5

2.25

0 2500 5000 7500

|V| |V|

Figure 2: MRF test accuracy (top), training error (center)

and runtime in seconds (bottom) for different choices of

maximum vertex degree (left column) and state space size

(right column) as a function of the number of vertices (x-

axis). The plots in each column share the same key that is

deﬁned in the top row. The plots in the ﬁrst two rows have

their x-axis in logarithmic scale. IntMRF results are gener-

ated with K = 8.

varying degrees (mid-left), the training error of the in-

teger model responds to different neighborhood sizes

whereas the likelihood of the ﬂoating point model is

invariant against the degrees. A similar picture is

drawn for the dependence of the test accuracy with

|X | in the top-left and |N

| in the top-right plot, re-

spectively. The ﬂoating point MAP estimate is al-

ways correct and not changed by an increasing num-

ber of states and neighbors, whereas the performance

of IntMRF drops with an increasing number of ver-

tices but actually increases with increasing degrees.

The curves that correspond to RealMRF are not

visible in the top row of Figure 2, since this model

gets 100% accuracy on the synthetic data in nearly

all runs and therefore, its curve lies close to the hor-

izontal 1-line. However, in the two plots at the bot-

tom of Figure 2, one can see that the resource con-

sumption in terms of clock cycles is largely reduced

by the integer model. The time therein is measured

for estimating parameters, computing the likelihood,

and performing a MAP prediction. Since both al-

gorithms (RealMRF and IntMRF) share exactly the

same asymptotic complexity for these procedures, the

substantial reduction in runtime that is shown in the

results must be due to the reduction in clock cycles.

4.2 The Contribution of K to Quality,

Gradient and Memory

Consumption

As described in Section 3.2, the integer parameter

vector

θ is scaled to be in the set Ω

:= {0, 1, . ..,K}

The effect of such scaling is illustrated by the re-

sponse of the integer model to the magnitude of K

in terms of training quality and test error, shown in

the two plots on the top of Figure 3. The training er-

ror seems to be a quite smooth function of K whereas

the MAP accuracy is sensitive to choices of K. This

could be expected, since a large K means that a larger

number of marginal probabilities can be represented.

One can also see that, as soon as K is large enough,

(i.e. K = 8 in the ﬁrst two plots in Figure 3) a further

increase in K does not show any signiﬁcant impact on

either training error and test accuracy. Both results

are generated on graphs with a maximum degree of 8,

but as already known from the previous experiment,

the effect of different degrees on the model quality is

negligible. In the third plot of Figure 3, the width of

the intrinsic parameter space is shown, i.e. the differ-

ence of the largest and the smallest parameter before

rescaling. The difference (κ− γ) seems to convergeto

the same value for various conﬁgurations of n and X .

Plotting κ and γ separately shows that the dynamics

in (κ − γ) are mainly inﬂuenced by the smallest pa-

rameter γ, i.e., the width of the parameter space must

increase in order to represent small probabilities.

As indicated by the analysis of the training error

in Section 3, the distance between the maximum like-

lihood estimate and the result of the direct integer pa-

rameter estimation is bounded by the gradient norm

of the integer parameters

θ. Since the components

of the gradient are differences of probabilities, their

value cannot exceed 1 and a trivial upper bound for

the ℓ

-norm of the gradient is therefore d. An impor-

tant observation can be made in the rightmost picture

of Figure 3 which shows the relative gradient norm

for an increasing number of vertices and various val-

ues K. This result suggests that there exists a bound

on the relative gradient norm that is independent of

the number of vertices and that this bound decreases

with increasing K.

Furthermore, K has a strong impact on the over-

all memory consumption of the model. Let ω be the

ﬂoating point wordsize of the underlying architecture.

The Real model will require d × ω bit, whereas the

Int model has a size of d × blK (cf. Section 3.2) bit.

Thus, the Int model will use less memory whenever

blK < ω. In practice, a specialized data structure is

required to observe the reduced memory consump-

tion. In the second plot of Fig. 3 and the last two plots

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

302

−ℓ(θ)/(|V| + |E|)

0.7

1.4

2.1

4 8 12 16

Int, |X|=4

Int, |X|=16

MAP Accuracy

0.4

0.8

1.2

4 8 12 16

Int, |X|=4

Int, |X|=16

κ− γ

10 100 1000

Int, |X|=2

Int, |X|=4

Int, |X|=8

Int, |X|=16

k∇ℓ(θ)k

0.016

0.032

0.048

10 100 1000

Int, K=1

Int, K=2

Int, K=4

Int, K=6

K K |V| |V|

Figure 3: From left to right: (I) negative log-likelihood of IntMRF as a function of K, i.e. the size of the parameter domain.

(II) MAP accuracy of MRF as a function of K. (III) Natural size of the parameter domain as a function of the number of

vertices (in log-scale) for different state space sizes, whereas γ is the smallest and κ the largest element of the corresponding

estimated parameter vector. (IV) The relative ℓ

-norm of the gradient for various values of K as a function of the number of

vertices (in log-scale).

of Fig. 4, one can see that K ∈ {6, 8} seems to be a

reasonable choice for these data sets, which reduces

the memory consumption of the model by a factor of

20, compared to a 64 bit ﬂoating point model.

4.3 Integer Models on Resource

Constrained Devices

The motivation for the integer model was to save re-

sources in terms of clock cycles. Here, it is demon-

strated that the impact of this reduction is larger, if

the underlying architecture is weaker, i.e. has slower

ﬂoating point arithmetic. A runtime comparison of

the integer MRF on two different CPU architectures

is shown in Figure 4. One is the Sandy Bridge, which

has also been the platform for all the other experi-

ments, and the second one is a Raspberry Pi device

with the ARM11 architecture. As expected, the in-

teger model actually speeds up the execution on the

Pi device more than on the other architecture, i.e. the

Raspberry Pi gains a speedup of 2.56× and the Sandy

Bridge a speedup of 2.34×. In terms of standard de-

viation, the ARM11 architecture is more stable than

the Sandy Bridge, which might be a result of a more

sophisticated out-of-order execution in the latter ar-

chitecture.

4.4 Training Integer CRF with

Stochastic Integer Gradient Descent

In the last evaluation, discriminative models for chain

structured data are investigated. A linear-chain CRF

is constructed on the CoNLL-2000 data and trained

by a SGD algorithm. The ﬂoating point CRF is

trained with stepsize η = 10

−1

. Integer parameter

updates are computed by means of the scaled integer

gradient (cf. the end of Section 3). Both algorithms

perform 20 passes over the training data, each pass

looping through the training instances in random or-

der, whereby IntCRF was trained with three different

parameter domain sizes K ∈ {2,4,6}. This was re-

peated 50 times in order to compute an estimate of

the expected quality of the randomized training pro-

cedure. Training error and test error are reported in

Figure 4 as a function of runtime in seconds. The neg-

ative log-likelihood is averaged over all training in-

stances and the accuracy is computed w.r.t. the chunk

type. One clearly sees that K = 2 results in poor train-

ing and test performances. Nevertheless, IntCRF with

K ∈ {4,6} reaches nearly the same performanceas the

double precision model, while using less computa-

tional resources per SGD iteration. Furthermore, the

average precision, recall and F

-score were computed

for each chunk type. As desired, the performance of

the integer approximation is reasonable. Except for

one chunk type (VP), the difference in F

-score be-

tween both methods is below 5%. Surprisingly, the

IntCRF has a higher precision than the RealCRF for

three chunk types. Averaged over all 22 chunk types,

the IntCRF has < 1% less precision, recall and F

score than RealCRF.

5 CONCLUSIONS AND FUTURE

WORK

In this work, integer undirected graphical models

have been introduced together with algorithms for

probabilistic inference and parameter estimation that

rely only on integer arithmetic. Generative and dis-

criminative models have been evaluated in terms of

prediction quality and runtime. On different archi-

tectures, an average speedup of at least 2× can be

achieved while accepting a reasonable loss in accu-

racy. One of the ﬁndings from the empirical evalu-

ation was the fact, that the model parameters have a

rather small magnitude. Thus, only a few bits are re-

quired for each parameter, which reduces the models

size by more than 20× compared to the eight bytes

that are required to store 64 bit double precision pa-

rameters. Overall, the results show that our method is

TheIntegerApproximationofUndirectedGraphicalModels

303

Seconds

0.4

0.8

1.2

2 4 8 16

Int

Real

Seconds

0.015

0.03

0.045

2 4 8 16

Int

Real

−ℓ(θ)

0.75

1.5

2.25

0 125 250 375

Int, K=2

Int, K=4

Int, K=6

Real

MAP Accuracy

0.923

0.936

0.949

0.962

0 125 250 375

Int, K=2

Int, K=4

Int, K=6

Real

| |X

| Seconds Seconds

Figure 4: From left to right: Runtime comparison of integer and ﬂoating point MRF on two architectures for a varying number

of states. (I) Raspberry PI @ 700MHz (ARM11). (II) Intel Core i7-2600K @ 3.4GHz (Sandy Bridge). Progress of stochastic

gradient training in terms of (III) training error and (IV) test accuracy as a function of runtime.

especially well suited for small, mobile devices. Be-

side the introduction of integer models, average run-

times of probabilistic models have been investigated

and optimized in the context of resource constraint de-

vices. The experimental results indicate that a tight

bound on the training error might exist. Therefore,

future work will focus on tighter error bounds.

ACKNOWLEDGEMENTS

This work has been supported by Deutsche

Forschungsgemeinschaft (DFG) within the Col-

laborative Research Center SFB 876 “Providing

Information by Resource-Constrained Data Analy-

sis”, projects A1 and C1.

REFERENCES

Ahmadi, B., Kersting, K., and Natarajan, S. (2012). Lifted

online training of relational models with stochastic

gradient methods. In Machine Learning and Knowl-

edge Discovery in Databases, volume 7523 of Lecture

Notes in Computer Science, pages 585–600.

Blitzstein, J. and Diaconis, P. (2011). A sequential im-

portance sampling algorithm for generating random

graphs with prescribed degrees. Internet Mathmatics,

6(4):489–522.

Choirat, C. and Seri, R. (2012). Estimation in discrete pa-

rameter models. Statistical Science, 27(2):278–293.

Hassibi, A. and Boyd, S. (1998). Integer parameter estima-

tion in linear models with applications to gps. IEEE

Transactions on Signal Processing, 46(11):2938–

2952.

Ihler, A. and McAllester, D. (2009). Particle belief propaga-

tion. In Proceedings of the Twelfth International Con-

ference on Artiﬁcial Intelligence and Statistics, pages

256–263, JMLR: W&CP 5.

Ihler, A. T., III, J. W. F., and Willsky, A. S. (2005). Loopy

belief propagation: Convergence and effects of mes-

sage errors. Journal of Machine Learning Research,

6:905–936.

Kersting, K., Ahmadi, B., and Natarajan, S. (2009). Count-

ing belief propagation. In Proceedings of the Twenty-

Fifth Conference on Uncertainty in Artiﬁcial Intelli-

gence, pages 277–284,

Kschischang, F. R., Frey, B. J., and Loeliger, H.-A. (2001).

Factor graphs and the sum-product algorithm. IEEE

Transactions on Information Theory, 47(2):498–519.

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001).

Conditional random ﬁelds: Probabilistic models for

segmenting and labeling sequence data. In Proceed-

ings of the Eighteenth International Conference on

Machine Learning, pages 282–289

Noorshams, N. and Wainwright, M. (2011). Stochas-

tic belief propagation: Low-complexity message-

passing with guarantees. In Communication, Control,

and Computing 49th Annual Allerton Conference on,

pages 269–276.

Pearl, J. (1988). Probabilistic reasoning in intelligent sys-

tems: networks of plausible inference. Morgan Kauf-

mann Publishers Inc., San Francisco, CA, USA.

Sontag, D. and Jaakkola, T. (2009). Tree block coordinate

descent for MAP in graphical models. In Proceedings

of the Twelfth International Conference on Artiﬁcial

Intelligence and Statistics , volume 8, pages 544–551.

JMLR: W&CP.

Sutton, C. and McCallum, A. (2012). An introduction to

conditional random ﬁelds. Foundations and Trends in

Machine Learning, 4(4):267–373.

Tschiatschek, S., Reinprecht, P., M¨ucke, M., and Pernkopf,

F. (2012). Bayesian network classiﬁers with reduced

precision parameters. In European Conference on Ma-

chine Learning and Principles and Practice of Knowl-

edge Discovery in Databases.

Wainwright, M. J., Jaakkola, T. S., and Willsky, A. S.

(2003). Tree-reweighted belief propagation algo-

rithms and approximate ML estimation by pseudo-

moment matching. In 9th Workshop on Artiﬁcial In-

telligence and Statistics.

Wainwright, M. J. and Jordan, M. I. (2007). Graphical Mod-

els, Exponential Families, and Variational Inference.

Foundations and Trends in Machine Learning, 1(1–

2):1–305.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

304