vector should move into the direction of the gradi-
ent. The smallest possible step size in integer space
is 1. This means that any parameter can either be in-
creased or decreased by 1. In the beginning of an inte-
ger gradient based optimization, the gradient will tell
to increase a quite large number of parameters. This
results in rather slow convergence, since due to the
fixed step size of 1, most of the parameters are worse
than before the update. To compensate for this, we
suggest to update, for each clique, only the parameter
for which the corresponding partial derivative has the
largest magnitude. This method is used when estimat-
ing the CRF parameters in the following section.
4 NUMERICAL RESULTS
The previous sections pointed out various factors that
may have an influence on training error, test perfor-
mance or runtime of the integer approximation. In
order to show that integer undirected models are a
quite general approach for approximate learning in
discrete state spaces, generative and discriminative
variants of undirected models are evaluated on syn-
thetic data and real world data. In particular the fol-
lowing methods are considered: RealMRF: The clas-
sic generative undirected model as described in Sec-
tion 2. RealCRF: The discriminative classifier as it
is defined in (Lafferty et al., 2001; Sutton and Mc-
Callum, 2012). IntMRF: The integer approximation
of generative undirected models as described in Sec-
tion 3. IntCRF: The integer approximation of dis-
criminative undirected models. Further details are ex-
plained in Section 4.4. Both real variants are based
on floating point arithmetic. In the MRF experiments,
the model parameters are estimated from the empiri-
cal expectations by Eqs. (5), (6) and (13). Parameters
of discriminative models are estimated by stochas-
tic gradient methods (Sutton and McCallum, 2012).
Each MRF experiment was repeated 100 times on ran-
dom input distributions and graphs. In most cases,
only the average is reported, since the standard devi-
ation was too small to be visualized in a plot. When-
ever MAP accuracy is reported, it corresponds to the
percentage of correctly labeled vertices, where the
prediction is computed with Eq. (3).
The implementations
3
of all evaluated methods
are equally efficient, e.g. the message computation
(and therefore the probability computation) executes
exactly the same code, except for the arithmetic in-
structions. Unless otherwise explicitly stated, the ex-
periments are done on an Intel Core i7-2600K 3.4GHz
3
For reproducibility, all data and code is available at
http://sfb876.tu-dortmund.de/intmodels.
(Sandy Bridge architecture) with 16GB 1333MHz
DDR3 main memory.
Synthetic Data. In order to achieve robust results
that capture the average behavior of the integer ap-
proximation, a synthetic data generator has been im-
plemented that samples random empirical marginals
with corresponding MAP states. Therefore, a sequen-
tial algorithm for random trees with given degrees
(Blitzstein and Diaconis, 2011) generates random tree
structured graphs. For a random graph, the weights
θ
∗
i
∼ N (0,1) are sampled from a Gaussian distribu-
tion. Additionally, for each vertex, a random state is
selected that gets a constant extra amount of weight,
thus enforcing low entropy. The weights are then used
to generate marginals and MAP states with the double
precision floating point variant of belief propagation.
The generated marginals serve as empirical input dis-
tribution and the MAP state is compared to the MAP
state that is estimated by IntMRF and RealMRF.
CoNLL-2000 Data. This data set was proposed for
the shared task at the Conference on Computational
Natural Language Learning in 2000 and is based on
the Wall Street Journal corpus. The latter contains
word-features and one label, called chunk tag, per
word. In total, there are 22 chunk tags that correspond
to the vertex states, i.e. |X | = 22. For the computa-
tion of per chunk F
1
-score, a chunk is only treated as
correct, if and only if all consecutive tags that belong
to the same chunk are correct. The data set contains
8936 training instances and 2012 test instances. Be-
cause of the inherent dependency between neighbor-
ing vertex states, this data set is well suited to evaluate
whether the dependency structure between vertices is
preserved by the integer approximation.
4.1 The Impact of |X | and |N
v
| on
Quality and Runtime
In Section 3 an estimate of the error in marginal prob-
abilities that are computed with bit length BP (Sec-
tion 3.1) indicates that the size of the vertex state
space |X
v
| and the degree |N
v
| have an impact on the
training error. In Figure 2, the training error in terms
of normalized negative log-likelihood, the test error
in terms of MAP accuracy and the runtime in seconds
for two values of |X
v
| and |N
v
| for an increasing num-
ber of vertices on the synthetic data are shown. Each
point in each curve is the average over 100 random
trees with random parameters. The results with vary-
ing |X
v
| are generated with a maximum degree of 8
and the ones for varying |N
v
| with |X
v
| = 4.
In terms of training error, the mid-right plot shows
a clear offset between integer and floating point es-
timates for the same number of states. In terms of
TheIntegerApproximationofUndirectedGraphicalModels
301