Word Alignment Quality in the IBM 2 Mixture Model
Jorge Civera and Alfons Juan
ITI/DSIC, Universidad Polit´ecnica de Valencia, Spain
Abstract. Finite mixture modelling is a standard pattern recognition technique.
However, in statistical machine translation (SMT), the use of mixture modelling
is currently being explored. Two main advantages of the mixture approach are
first, its flexibility to find an appropriate tradeoff between model complexity and
the amount of training data available and second, its capability to learn specific
probability distributions that better fit subsets of the training dataset. This latter
advantage is even more important in SMT, since it is widely accepted that most
state-of-the-art translation models proposed have limited application to restricted
semantic domains. In this work, we revisit the mixture extension of the well-
known M2
1
translation model. The M2 mixture model is evaluated on a word
alignment large-scale task obtaining encouraging results that prove the applica-
bility of finite mixture modelling in SMT.
1 Introduction
Finite mixture modelling is a popular approach for density estimation in many scientific
areas [1]. On the one hand, mixtures are flexible enough for finding an appropriate
tradeoff between model complexity and the amount of training data available. Usually,
model complexity is controlled by varying the number of mixture components while
keeping the same parametric form for all components. On the other hand, maximum
likelihood estimation of mixture parameters can be reliably accomplished by the well-
known Expectation-Maximisation (EM) algorithm [2,3].
One of the most interesting properties of mixture modelling is its capability to learn
a specific probability distribution in a multimodal dataset that better explains the general
data generation process. In translation tasks, these multimodal datasets are not an ex-
ception, but the general case. Indeed, it is easy to find corpora from which several topics
could be drawn. These topics define sets of topic-specific lexicons that need to be trans-
lated taking into the Semitic context in which they are found. This semantic ambiguity
problem could be overcome by learning topic-dependent translation models that cap-
ture together the semantic context and the translation process. The application of finite
mixture modelling to SMT is currently being explored with successful results [4–6].
Previous work on finite mixture modelling applied to SMT has mainly focused on
the mixture extension of word-based alignment models, more precisely, the well-known
Work partially supported by Ministerio de Educaci´on y Ciencia, the EC (FEDER) and the
Spanish MEC under grant TIN2006-15694-CO2-01, and the Spanish research programme
Consolider Ingenio 2010: MIPRCV (CSD2007-00018).
1
Known as IBM 2 model in the literature.
Civera J. and Juan A. (2008).
Word Alignment Quality in the IBM 2 Mixture Model.
In Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems, pages 93-102
Copyright
c
SciTePress
IBM alignment models [7,8]. In [4], a mixture extension of the M2 model is proposed,
reporting appealing results on a small synthetic task [4]. However, the question that
arises is whether these positive results on a small task can be extrapolated to large-scale
tasks. This paper presents an alternative evaluation of the M2 mixture model on a word
alignment shared task that serves as a reference task in SMT [9–13].
Indeed, word alignment is the first step towards the construction of modern phrase-
based SMT systems [14–17]. It involves the induction of a word mapping from a
(source) language into another (target) language over bilingual sentences. The second
phase uses statistics over these learnt word alignments to translate new sentences.
In this paper, we first review the M2 in Section 2, before deriving the M2 mixture
model in Section 3. In Section 4, we introduce the evaluation metrics that are used to
assess word alignment quality of the proposed model on the shared task presented in
Section 5. Section 6 is devoted to experimental results and Section 7 concludes and
provides an outlook on future work.
2 The M2 model
2.1 The Model
Let (x, y) be a pair of source-target sentences; i.e. x is a sentence in a certain source
language and y is its corresponding translation in a different target language. Let X and
Y denote the source and target vocabularies, respectively. The IBM alignment models
are parametric models for the translation probability p(x | y); i.e., the probability that x
is the source sentence from which we get a given translation y.
The IBM alignment models assume that each source word is connected to exactly
one target word. Also, it is assumed that the target sentence has an initial NULL or
empty word to which source words with no direct translation are connected. Formally, a
hidden variable a = a
1
a
2
· · · a
|x|
is introduced to reveal, for each source word position
j, the target word position a
j
{0, 1, . . . , |y|} to which it is connected. Thus,
p(x | y) =
X
a∈A(x,y)
p(x, a | y) (1)
where A(x, y) denotes the set of all possible alignments between x and y. The term
p(x, a | y) can be factorised as source position-dependent probabilities
p(x, a | y) =
|x|
Y
j=1
p(x
j
, a
j
| a
j1
1
, x
j1
1
, y) (2)
In the case of the IBM model 2, it is assumed that a
j
only depends on j and |y|, and
that x
j
only depends on the target word to which it is connected, y
a
j
. Hence,
p(x
j
, a
j
|a
j1
1
, x
j1
1
, y) := p(a
j
|j, |y|) p(x
j
|y
a
j
) (3)
and the set of unknown parameters Θ comprises
Θ =
p(i | j, |y|) i {0, 1, . . . , |y|}, j {1, . . . , |x|} and |y|
p(u | v) u X , v Y.
(4)
Note that the alignment parameters defined here are slightly different from those defined
in the original parametrisation [8], which also depend on |x|, p(i | j, |x|, |y|).
Putting Eqs. (1), (2) and (3) together, we define the M2 model, after some straight-
forward manipulations, as follows:
p(x | y) =
|x|
Y
j=1
|y|
X
i=0
p(i | j, |y|) p(x
j
|y
i
). (5)
2.2 Maximum Likelihood Estimation
It is not difficult to derive an EM algorithm to perform maximum likelihood estima-
tion of Θ with respect to a collection of N independent training samples (X, Y ) =
{(x
1
, y
1
), . . . , (x
N
, y
N
)}. The log-likelihood function is:
L(Θ) =
N
X
n=1
log
X
a
n
p(x
n
, a
n
|y
n
) (6)
with
p(x
n
, a
n
|y
n
) =
|x
n
|
Y
j=1
p(a
nj
|j, |y
n
|) p(x
nj
|y
na
nj
)
=
|x
n
|
Y
j=1
|y
n
|
Y
i=0
[p(i | j, |y
n
|) p(x
nj
|y
ni
)]
a
nji
(7)
where, for convenience, the alignment variable, a
nj
{0, 1, . . . , |y
n
|}, has been rewrit-
ten as an indicator vector in Eq. (7), a
nj
= (a
nj0
,. . . ,a
nj|y
n
|
), with 1 in position a
nji
and zeros elsewhere.
Now, we can define A as the set of alignment indicator vectors associated with the
bilingual pairs (X, Y ) with
A = (a
1
, . . . , a
n
, . . . , a
N
)
t
(8)
where variable A is the missing data in the M2 model.
The EM algorithm maximises Eq. (6) iteratively, through the application of two
basic steps in each iteration: the E(xpectation) step and the M(aximisation) step.
The E step computes the expected value of the logarithm of p(X, A | Y ), given the
(incomplete) data samples (X, Y ) and a current estimate of Θ, Θ
(k)
. Given that the
alignment variables in A are independent from each other, we can compute the E step
as the Q function in the EM terminology,
Q(Θ | Θ
(k)
) =
N
X
n=1
E(log p(x
n
, a
n
| y
n
; Θ) | x
n
, y
n
, Θ
(k)
) (9)
=
N
X
n=1
|x
n
|
X
j=1
|y
n
|
X
i=0
a
(k)
nji
[log p(i | j, |y
n
|) + log p(x
nj
| y
ni
)] (10)
with
a
(k)
nji
=
p(i | j, |y
n
|)
(k)
p(x
nj
| y
ni
)
(k)
|y
n
|
P
i
=0
p(i
| j, |y
n
|)
(k)
p(x
nj
| y
ni
)
(k)
. (11)
That is, the expectation of word x
nj
to be connected to y
ni
is our current estimation
of the probability of x
nj
to be translated into y
ni
, instead of any other word in y
n
(including the NULL word).
Then, the M step finds a new estimate of Θ, Θ
(k+1)
, by maximising Eq. (9), using
Eq. (11) instead of the missing a
nji
. This results in:
p(i | j, |y|)
(k+1)
=
N
P
n=1
j≤|x
n
|
|y
n
|=|y|
a
(k)
nji
|y|
P
i
=0
N
P
n=1
j≤|x
n
|
|y
n
|=|y|
a
(k)
nji
i, j and |y|; (12)
and
p(u|v)
(k+1)
=
N
P
n=1
|x
n
|
P
j=1
x
nj
=u
|y
n
|
P
i=0
y
ni
=v
a
(k)
nji
P
u
∈X
N
P
n=1
|x
n
|
P
j=1
x
nj
=u
|y
n
|
P
i=0
y
ni
=v
a
(k)
nji
u X and v Y. (13)
An initial estimate for Θ, Θ
0
, is required for the EM algorithm to start. In the case
of the M2 model, we use the initial solution given by the M1 model, which is a particular
case of the M2 model in which alignment probabilities are uniformly distributed; i.e.,
p(i | j, |y|)
(k+1)
=
1
|y| + 1
i, j and |y|. (14)
3 Mixture of M2 models
3.1 The Model
A finite mixture model is a probability (density) function of the form:
p(z) =
T
X
t=1
p(t) p(z | t) (15)
where T is the number of mixture components and, for each component t, p(t) [0, 1]
is its prior or coefficient and p(z | t) is its component-conditional probability (density)
function. It can be seen as a generative model that first selects the tth component with
probability p (t) and then generates z in accordance with p(z | t). It is clear that finite
mixture modelling allows generalisation of any given probabilistic model by simply
using more than one component.
In this work, we are interested in modelling the translation probability p(x | y) using
a T -component, y-conditional mixture of M2 models:
p(x | y) =
T
X
t=1
p(t) p(x | y, t) (16)
where
p(x|y, t) =
|x|
Y
j=1
|y|
X
i=0
p(i | j, |y|, t) p(x
j
|y
i
, t) (17)
Note that we could have made p(t) to depend on y in Eq. 16 but, for simplicity, this is
left for future work. Thus, the global vector of parameters Θ is
Θ = (p(1), . . . , p(t), . . . , p(T ); Θ
1
, . . . Θ
t
, . . . , Θ
T
)
t
. (18)
where for each component t, p(t) is its mixture prior or coefficient and Θ
t
comprises
the component-conditional parameters
Θ
t
=
p(i | j, |y|, t) i {0, 1, . . . , |y |}, j {1, . . . , |x|} and |y|
p(u | v, t) u X , v Y.
(19)
It is easy to extend the EM algorithm developed in the previous section to the case
of M2 mixtures. The log-likelihood function of Θ with respect to N training samples is
L(Θ) =
N
X
n=1
log
X
z
n
X
a
n
p(x
n
, z
n
, a
n
|y
n
) (20)
where z
n
= (z
n1
, . . . , z
nT
) is an indicator vector for the component generating x
n
, and
p(x
n
, z
n
, a
n
| y
n
) =
T
Y
t=1
[p(t) p(x
n
, a
n
| y
n
, t)]
z
nt
(21)
with
p(x
n
, a
n
| y
n
, t) =
|x
n
|
Y
j=1
|y
n
|
Y
i=0
[p(i | j, |y
n
|, t)p(x
nj
| y
ni
, t)]
a
nji
where, as in the previoussection, a
nji
= 1 means that the nth training pair has its source
position j connected to target position i. Note that data completion in the mixture case
includes the alignments A and the component labels
Z = (z
1
, . . . , z
n
, . . . , z
N
)
t
(22)
as well. Thus, the Q function for the M2 mixture model becomes
Q(Θ | Θ
(k)
) =
N
X
n=1
T
X
t=1
z
(k)
nt
log p(t)
+
|x
n
|
X
j=1
|y
n
|
X
i=0
(z
nt
a
nji
)
(k)
[log p(i | j, |y
n
|, t) + log p(x
nj
| y
ni
, t)] . (23)
with
z
(k)
nt
=
p(t)
(k)
p(x
n
| y
n
, t)
(k)
T
P
t
=1
p(t
)
(k)
p(x
n
| y
n
, t
)
(k)
(24)
and the expected value of z
nt
a
nji
,
(z
nt
a
nji
)
(k)
= z
(k)
nt
a
(k)
njit
(25)
with
a
(k)
njit
=
p(i | j, |y
n
|, t)
(k)
p(x
nj
| y
ni
, t)
(k)
|y
n
|
P
i
=0
p(i
| j, |y
n
|, t)
(k)
p(x
nj
| y
ni
, t)
(k)
(26)
Note that Eq. (26) is just a component-conditional version of Eq. (11).
The M step now includes an updating rule for the mixture coefficients,
p(t)
(k+1)
=
1
N
N
X
n=1
z
(k)
nt
t (27)
and component-conditional versions of Eq. (12) and (13):
p(i | j, |y|, t)
(k+1)
=
N
P
n=1
j≤|x
n
|
|y
n
|=|y|
z
(k)
nt
a
(k)
njit
|y|
P
i
=0
N
P
n=1
j≤|x
n
|
|y
n
|=|y|
z
(k)
nt
a
(k)
nji
t
t, i, j and |y| (28)
and
p(u | v, t)
(k+1)
=
N
P
n=1
|x
n
|
P
j=1
x
nj
=u
|y
n
|
P
i=0
y
ni
=v
z
(k)
nt
a
(k)
njit
P
u
∈X
N
P
n=1
|x
n
|
P
j=1
x
nj
=u
|y
n
|
P
i=0
y
ni
=v
z
(k)
nt
a
(k)
njit
t, u and v. (29)
The initialisation technique for the M2 model can be easily extended to the mixture
case; i.e. by using a solution from a simpler mixture of IBM1 models.
3.2 Viterbi Alignment
In Eq. (1), we introduced the concept of alignment as an assignment between source
and target words, more precisely between source and target positions. However, this
alignment information was missing in the translation process, and we had to marginalise
over all possible values of the alignment variable.
In practise, we are interested in the most probable alignment, also known as the
Viterbi alignment,
ˆa = argmax
a
p(x, a | y; Θ). (30)
Assuming a conventional M2 model, Eq. (30) can be trivially maximised
ˆa = argmax
a
|x|
Y
j=1
max
a
j
p(a
j
| j, |y|) p(x
j
| y
a
j
). (31)
In other words, the Viterbi alignment for the M2 model is computed as a local maximi-
sation for each source position, being its asymptotic cost O(|x| · |y|).
Nevertheless, the computation of the Viterbi alignment for the M2 mixture model is
approximated by maximising over the components in the mixture,
ˆa argmax
a
max
t=1,...,T
p(t)
|x|
Y
j=1
max
a
j
p(a
j
| j, |y|, t) p(x
j
| y
a
j
, t) (32)
being its asymptotic cost O(T · |x| · |y|).
4 Evaluation Metrics
Word alignment is considered to be a complex and ambiguous task [18], and therefore
we need an annotation scheme that allows ambiguous alignments to be defined. The
experts conducting the annotation process are permitted to use two types of alignments:
S (sure) and P (probable), such that S P . Both of them may contain many-to-one
and one-to-many relationships. P alignments are specially useful in cases like idiomatic
expressions, free translations and missing function words.
Given a Viterbi alignment A defined as
A = {(j, a
j
) | 1 a
j
|y|} j 1 j |x| (33)
where the NULL alignments has been intentionally left out of the evaluation, precision
and recall measures can be computed
recall =
|A S|
|S|
, precision =
|A P |
|A|
(34)
as well as the alignment error rate (AER) [9] that is related to the well-known F-measure
AER = 1
|A S| + |A P |
|A| + |S|
(35)
These definitions of precision, recall and AER are based on the assumption that a
recall error can occur only if an S alignment is not found and a precision error can occur
only if the found alignment is not even P .
AER has been widely used in the scientific community to evaluate word alignment
quality until very recently [9–13,19]. However in [20], Fraser and Marcu claim that
AER, though derived from the F-measure, does not penalise unbalanced precision and
recall, where S P . As a result, AER is low correlated with translation quality, as pre-
viously reported in [21]. For this reason, they suggest to use an α-optimised F-measure
that controls the contribution of precision and recall,
F-measure(α) =
precision · recall
α · recall + (1 α) · precision
(36)
so that this metric is highly correlated with SMT performance.
5 Corpora
The corpus employed in the experiments was the French-English Hansard task consist-
ing of the debates of the Canadian parliament. This corpus is one of the resources that
were used during the word alignment shared task organised during the HLT/NAACL
2003 workshop on “Building and Using Parallel Texts” [22].
The independent test set is that defined in [23] which was manually labelled by two
annotators. Each annotator comes up with a S and P alignment set. The S alignment
sets from each annotator are intersected to defined the reference S alignment set, while
the reference P alignment set is the result of the union of the P alignment sets from
both annotators. The definition of the S and P alignment sets in this way guarantees
an alignment error rate of zero percent when we compare the S alignments of each
annotator with the reference alignment. The corpus statistics are shown in Table 1.
Table 1. Statistics on the French-English Hansard task (K denotes ×10
3
, and M denotes ×10
6
).
Training set Trial set Test set
Fr En Fr En Fr En
sentence pairs 1.1M 37 447
average length 20 17 19 17 17 15
vocabulary size 87K 68K 0.3K 0.3K 1.9K 1.7K
running words 24M 20M 0.7K 0.7K 7.8K 7.0K
singletons 27K 20K 0.3K 0.2K 1.3K 1.1K
6 Experimental Results
The objective of these experiments is to study the evolution of AER and α-optimised F-
measure on the Hansard task as a function of the number of components in the M2 mix-
ture model. The results with the GIZA++ toolkit are for sanity check reasons. Smooth-
ing parameters were manually tuned on the trial partition to minimise AER.
Table 2 presents AER figures on the test partition for M2 mixture model. Each
number in Table 2 is an average over values obtained from 10 randomised initialisation,
that are used to estimate confidence intervals computed as twice the standard deviation.
These experiments were performed for both directions, English-French (En-Fr) and
French-English (Fr-En) and varying the number of components in the mixture model
(T = 1, 2, 3). Experiments beyond 3 components per mixture were not run because of
Table 2. AER figures on the test partition of the Hansard corpus for the M2 mixture model
varying the number of components in the mixture (T = 1, 2, 3) and the conventional M2 model
implemented in the GIZA++ toolkit.
AER GIZA++ 1 2 3
Fr-En 20.0 19.6 19.0± 0.1 18.8 ± 0.1
En-Fr 18.3 17.6 17.2 ± 0.1 16.8 ± 0.1
Table 3. F-measure (α = 0.2) figures on the test partition of the Hansard corpus for the M2 mix-
ture model varying the number of components in the mixture (T = 1, 2, 3) and the conventional
M2 model implemented in the GIZA++ toolkit.
F-measure GIZA++ 1 2 3
Fr-En 85.5 86.1 86.6 ± 0.2 86.8 ± 0.1
En-Fr 85.8 86.6 87.1 ± 0.1 87.4 ± 0.1
memory requirements. The number of iterations per model was mix 1
5
2
5
for the M2
mixture model. Viterbi alignments were calculated according to Eq. (32).
In Table 2, there is a statistically significant improvementwhen we go from the con-
ventional single-component M2 model to the multiple-component M2 mixture model
for both language directions. Besides, the decrease in AER on the English-French di-
rection from two to three components is also statistically significant.
To have a broader view of the benefits and properties of the models in question,
we decided to carry out an evaluation in terms of α-optimised F-measure shown in
Table 3. According to [20] and being aware of the differences between our work and
that presented in [20], we set α = 0.2 in order to compute the corresponding F-measure
that would be fairly correlated with the performance of phrasal SMT performance.
Similarly to the AER results in Table 2, the computed F-measure shows that there
is a significant improvement when we compare the conventional M2 model to the
multiple-component M2 mixture model. However, the small difference between two
and three components in terms of AER is diminished in the evaluation with F-measure.
In any case, the interpretation of the figures in Table 3 foresees an improvement in trans-
lation quality if we train a phrase-based SMT system with the Viterbi alignments of the
multiple-component M2 mixture model, instead of the conventional M2 model. This
hypothesis has to be corroborated with translation experiments on the Hansard corpus.
7 Conclusions and Future Work
In this paper, we have revisited the M2 mixture model to perform an alternative eval-
uation based on Viterbi alignment quality. AER and F-measure results reported on a
large-scale shared task, as the Hansard corpus, unveil statistically significant improve-
ments of the multiple-component M2 mixture model over the conventional M2 model.
These encouraging results suggest the necessity of further evaluation for the M2
mixture model. This further evaluation would entail the training of a phrase-based SMT
system using word alignments supplied by the M2 mixture model. To this purpose, we
can employ the publicly available Moses toolkit [24], which implements a state-of-the-
art phrase-based SMT system, and study the evolution of the translation quality of the
resulting system as a function of the number of components in the M2 mixture model.
These results would corroborate the relation between alignment quality and translation
quality, demonstrating so the appropriateness of finite mixture modeling in SMT.
Alternatively, it would be interesting to develop mixture extensions of superior IBM
models, like Model 4 and 5, or the log-linear Model 6 [9] to fairly valorate the contri-
bution of mixture modeling to state-of-the-art alignment results.
References
1. McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley (2000)
2. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via
the EM algorithm (with discussion). Journal of the Royal Stat. Society B 39 (1977) 1–38
3. Wu, C.: On the convergence properties of the EM algorithm. The Annals of Statistics 11
(1983) 95–103
4. Civera, J., Juan, A.: Mixtures of IBM Model 2. In: Proc. of EAMT’06. (2006) 159–167
5. Zhao, B., Xing, E.P.: BiTAM: Bilingual Topic AdMixture Models for Word Alignment. In:
Proc. of COLING/ACL’06. (2006)
6. Civera, J., Juan, A.: Domain adaptation in statistical machine translation with mixture mod-
elling. In: Proc. of the 2nd Workshop in Statistical Machine Translation. (2007) 177–180
7. Brown, et al.: A Statistical Approach to Machine Translation. Comp.Ling. 16 (1990) 79–85
8. Brown, et al.: The Mathematics of Statistical Machine Translation: Parameter Estimation.
Comp.Ling. 19 (1993) 263–311
9. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models.
Comp.Ling. 29 (2003) 19–51
10. Mihalcea, R., Pedersen, T.: An evaluation exercise for word alignment. In: Proc. of the
HLT-NAACL’03: Workshop on Building and using parallel texts. (2003) 1–10
11. Taskar, B., Lacoste-Julien, S., Klein, D.: A discriminative matching approach to word align-
ment. In: Proc. of HLT’05. (2005) 73–80
12. Blunsom, P., Cohn, T.: Discriminative word alignment with conditional random fields. In:
Proc. of ACL ’06. (2006) 65–72
13. Zhao, B., Vogel, S.: Word alignment based on bilingual bracketing. In: Proc. of the HLT-
NAACL’03: Workshop on Building and using parallel texts. (2003) 15–18
14. Koehn, P., et al.: Statistical phrase-based translation. In: Proc. of NAACL’03. (2003) 48–54
15. Och, F.J., Ney, H.: The alignment template approach to statistical machine translation.
Comp.Ling. 30 (2004) 417–449
16. Chiang, D.: Hierarchical phrase-based translation. Comp.Ling. 33 (2007) 201–228
17. Callison-Burch, C., et al.: (meta-) evaluation of machine translation. In: Proc. of the Second
Workshop on Statistical Machine Translation, Prague, Czech Republic (2007) 136–158
18. Melamed, I.: Manual annotation of translational equivalence: The blinker project. Technical
Report 98-07, Institute for Research in Cognitive Science (1998)
19. Fraser, A., Marcu, D.: Getting the structure right for word alignment: LEAF. In: Proc. of
EMNLP-CoNLL’07. (2007) 51–60
20. Fraser, A., Marcu, D.: Measuring word alignment quality for statistical machine translation.
Comp.Ling. 33 (2007) 293–303
21. Ayan, N.F., Dorr, B.J.: Going beyond AER: an extensive analysis of word alignments and
their impact on MT. In: Proc. of CONLING/ACL’06. (2006) 9–16
22. Germann, U.: Aligned hansards of the 36th parliament of canada. http://www.isi.edu/natural-
language/download/hansard/index.html (2001)
23. Och, F., Ney, H.: Improved statistical alignment models. In: Proc. of ACL. (2000) 440–447
24. Koehn, P., et al.: Moses: Open source toolkit for statistical machine translation. In: Proc. of
ACL’07: Demo and Poster Sessions. (2007) 177–180