English-Turkish Parallel Treebank with Morphological Annotations and
its Use in Tree-based SMT
Onur G
¨
org
¨
un
1
, Olcay Taner Yıldız
2
, Ercan Solak
2
and Razieh Ehsani
2
1
Alcatel-Lucent Teletas¸ Telekom
¨
unikasyon A.S¸.,
˙
Istanbul, Turkey
2
Department of Computer Engineering, Is¸ık University,
˙
Istanbul, Turkey
Keywords:
Machine Translation, Tree-based Translation.
Abstract:
In this paper, we report our tree based statistical translation study from English to Turkish. We describe
our data generation process and report the initial results of tree-based translation under a simple model. For
corpus construction, we used the Penn Treebank in the English side. We manually translated about 5K trees
from English to Turkish under grammar constraints with adaptations to accommodate the agglutinative nature
of Turkish morphology. We used a permutation model for subtrees together with a word to word mapping. We
report BLEU scores under simple choices of inference algorithms.
1 INTRODUCTION
Statistical machine translation have come a long way
in the recent years. After IBM model’s superior per-
formance over classical rule-based approaches, ma-
chine translation research has flourished with statis-
tical models of increasing complexity. With the in-
crease of computational power and availability of
the parallel corpora, researchers switch from manu-
ally crafted linguistic models to empirically learned
statistical models, from string/word based models
to tree/phrase based models (Jurafsky and Martin,
2009).
Early approaches (Hutchinson, 1994) in statistical
machine translation are word-based models. These
models treat the words as the main translation unit
and map the source word(s) into target word(s) by
inserting, deleting, and reordering. More formally,
target word(s) should be aligned to their correspond-
ing word(s) in the target language. On the other
hand, since word-based models take words as the
basic translation units, alignment gathered by these
models sometimes does not reflect the real alignment.
For languages with different fertilities like English-
Turkish, some words are left unaligned. For example
Turkish word “gidece
˘
gim” (I will go) should be trans-
lated as a verbal phrase. However, its English match
is usually extracted as “go” by word-based models.
Hence, translation does not reflect the real meaning
of the Turkish phrase.
Phrase-based model and its derivatives have been
used as machine translation models for English-
Turkish language pair. (El-Kahlout and Oflazer,
2006) proposed the first approach in statistical ma-
chine translation between English-Turkish. They im-
prove the BLEU score from the baseline 7.52 to 9.13
with 23K sentences. Continuing the efforts, (Yen-
iterzi and Oflazer, 2010) offer factored phrase-based
translation model. In their approach, they define cus-
tom complex syntactic tags at the English side and in-
put the augmented sentence into phrase-based trans-
lation system. In another work, (El-Kahlout, 2009)
investigates the effects of different sub-lexical repre-
sentational structures. They obtain a BLEU score of
25 with nearly 20K sentences.
Integration of syntactic structure into machine
translation models are known to improve the perfor-
mance. In particular, tree-based models are good
at incorporating the recursive structure of language
and offer better alignment results compared to phrase-
based models (Koehn, 2010). Ordering problems are
more difficult in language pairs like English and Turk-
ish, where the unmarked orders are different across
the grammar and the latter has many scrambling op-
tions.
In this paper, we study a tree-based approach
to English-to-Turkish translation. Our contributions
are two-fold; (i) we manually translate and annotate
syntactically and morphologically over 5K sentences
from Penn-Treebank (Marcus et al., 1993), (ii) we
propose a tree-based translation approach. Our tree-
based translation approach has two phases. In the first
510
Görgün, O., Yıldız, O., Solak, E. and Ehsani, R.
English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT.
DOI: 10.5220/0005653905100516
In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2016), pages 510-516
ISBN: 978-989-758-173-1
Copyright
c
2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
phase, we obtain the best permutation of the English
parse tree by permuting subtrees at each node and ob-
tain a Turkish tree up to English leaves. In the second
phase, we replace the leaves of the permuted tree with
the appropriate tokens from a word-for-word model.
Although this is the first tree approach for English-
Turkish translation, the results are promising.
The paper is organized as follows. In Section 2,
we give the details of our data preparation steps. We
give the training phase of our tree-based algorithm in
Section 3, and translation using that trained model in
Section 3.3. We give the experimental results in Sec-
tion 4 and conclude in Section 5.
2 DATA PREPARATION
The Penn Treebank (Marcus et al., 1993) contains
more than 40K syntactically annotated sentences in
its WSJ section. We used a subset of about 5K of
those sentences. Each sentence contains at most 15
tokens, including punctuation.
The tag set in Penn Treebank II is quite exten-
sive. Besides POS and phrase tags, the Treebank
also includes markings to identify semantic functions,
subject-predicate dependencies and movement traces.
At this preliminary stage of our work, we dropped all
of these extra markings and used only the bare tags.
For example, for the tag NP-SBJ-1, we use only NP.
In our translation application, a parse tree of an
English sentence is translated into Turkish through the
repeated application of subtree permutation and label
replacement. We confine the label replacement to leaf
nodes. In training, the model parameters are tuned
over a set of parallel trees. In translating a given En-
glish parse tree, we go over its every internal node and
apply the most probable permutation to its children.
For leaf nodes, we use a count-based word-for-word
replacement of English words with Turkish stems or
meta morphemes. The final surface form of the trans-
lated Turkish sentence is generated by applying mor-
phophonemic constraints to meta morphemes.
In constructing our corpus, we manually per-
formed the subtree permutation and leaf replacement
of the sentences in our restricted Penn Treebank cor-
pus. We divided the the data generation into three
phases. Below, we describe the data in each phase
and its relation to our translation task.
2.1 Phase 0
Phase 0 data consists of the original English parse
trees from the Penn Treebank II. English trees are
used to obtain translated trees in the following phases
with respect to the transformation heuristics defined.
2.2 Phase 1
In this phase, subtrees are manually permuted and
the leaf tokens are manually replaced with Turkish
glosses. In doing so, the translators obeyed the fol-
lowing heuristics as closely as possible.
1. Majority of Turkish sentences have a SOV order.
We tried to permute subtrees to reflect this ten-
dency.
2. Turkish is a head final language. In order to reflect
this restriction, we permuted the constituents in
noun, prepositional, adjective and adverb phrases.
3. We permuted modals, particles and verb tense
markers to align with the order of corresponding
Turkish morphemes as dictated by the morphotac-
tical rules.
4. We use only whole words in the leaves. We did
not allow any dangling morphemes.
5. When we embedded a functional word in the En-
glish leaf as a morpheme in the stem of a Turkish
leaf, we replaced the English leaf with *NONE*.
6. In Turkish there is no determiner corresponding
to “the” in English. We replaced “the” with
*NONE*.
7. We usually replaced prepositions with *NONE*
and attached their corresponding case to the nom-
inal head of the Turkish phrase.
8. In Turkish grammar, number agreement between
the verb and the subject is somewhat relaxed. So,
we let the translators deviate from the literal gloss
whenever it seems natural to do so.
9. Verb tenses do not usually form exactly match-
ing categories across pairs of languages. We re-
placed English verb tenses with their closest se-
mantic classes in Turkish.
10. When we translated the English tree for a ques-
tion sentence, we replaced its wh- constituent with
*NONE* and replaced its trace with the appropri-
ate question pronoun in Turkish. This reflects the
question formation process in Turkish, where any
constituent can be questioned by simply replac-
ing it with a suitable question pronoun and case
inflection.
11. We translated the proper nouns as their common
Turkish gloss if there is one, kept the original oth-
erwise.
English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT
511
12. We replaced subordinating conjunctions, marked
as “IN” in English sentences, with *NONE* and
appended the appropriate participle morpheme to
the stem in the Turkish translation.
13. As it often happens when translating between
languages from different families, a multiword
expression in Turkish may correspond to a sin-
gle English word. Conversely, more than one
words in English may correspond to a single
word in Turkish. In the latter case, we replaced
some of the English words in the expression with
*NONE*.
The pairs of English and Turkish parallel trees in
Figure 1 illustrate some of the heuristics above.
At the end of Phase 1, two sets of trees are gener-
ated. The first set has only the permuted trees where
the leaves are still English tokens. We call this data
set Phase-1-English. The second set has either the
leaves replaced with Turkish glosses or the *NONE*
tag. We call this data set Phase-1-Turkish. Having
two such sets of data proved useful in pinning down
sources of errors in the final translation sentence. The
tree generated in Phase-1-English phase for the trees
in Figure 1 is given in Figure 2.
S
VP
VBD
sold
NP
NNS
cars
CD
1,214
PP-LOC
IN
in
NP
NNP
U.S.
DT
the
NP-TMP
NN
year
JJ
last
NP-SBJ
NN
maker
NN
auto
NN
luxury
DT
The
Figure 1: Phase-1-English tree for the parallel pair in Figure
2.
2.3 Phase 2
Many constituents in English sentences are translated
into inflectional morphemes in corresponding Turkish
words. Thus, any task of automatic translation to or
from Turkish needs to use morphological structure of
Turkish words.
In the Phase 2 of our data generation, the human
annotators generated the morphological analyses of
the Turkish words at the leaf nodes of the trees gen-
erated in Phase 1. In order to speed up the annotation
and reduce the number of errors, we built a FST based
morphological analyzer for this step. The analyzer is
embedded within our annotation tool. When the an-
notator selects a Turkish word, the tool displays all
the analyses of the word and the annotator selects the
correct analysis. Thus, the analysis is automatic and
the disambiguation is manual.
Figure 3 illustrates the morphological analysis of
the sentence
(1) Stewart & Stevenson dizel ve gaz t
¨
urbinleri
ile c¸alıs¸tırılan ekipman
¨
uretir.
Stewart & Stevenson makes equipment pow-
ered with diesel and gas turbines.
2.4 Phase 3
Inflectional morphemes of Turkish words usually cor-
respond to functional words in English sentences. For
example, the following pair of English sentence and
its translation has the morpheme-word pairs given be-
low.
(2)
˙
Is¸-e git-me-yecek-sin.
Work.DAT go.NEG.FUT.2SG
You will not go to work.
Thus, a natural extension of permute-and-replace
translation is to allow the English functional words to
be replaced by suitable morphemes in Turkish. In this
phase, we detach the individual morphological tags of
the words analyzed in Phase 2. Then, the annotators
move these tags to the suitable empty slots identified
by the tag *NONE* at the leaves. The tag movement
respects the morphotactics of Turkish. In most cases,
the order of morphemes in the Turkish word matches
the order of leaves vacated by the English functional
words in the permuted tree. Figure 3 illustrates the
process.
In Phase 3, we generate two sets of data. The first
one has the tags identifying the morphemes. In the
second one, the tags are replaced by their canonical
forms. for example, the tag -ABL is replaced by -dAn
and the tag -NEG is replaced by -mA and so on. We
identify the latter set of trees as Phase-3-Canonical.
We used canonical forms instead of symbols because
the human annotators found working with canonical
morphemes more intuitive in moving them around.
Moreover, some tags correspond to null morphemes
and do not have visible surface forms.
Although we described the 3 level process as if
they are completely independent, in practice we saw
that the human annotators usually go back and forth
among the phases and use feedback across the phases.
For example, in trying to move canonical morphemes
in Phase 3, human annotators sometimes encounter
ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods
512
S
VP
PP-LOC
NP
NNP
U.S.
DT
the
IN
in
NP
NNS
cars
CD
1,214
VBD
sold
NP-TMP
NN
year
JJ
last
NP-SBJ
NN
maker
NN
auto
NN
luxury
DT
The
S
VP
VBD
sattı
NP
NNS
araba
CD
1214
PP-LOC
IN
*NONE*
NP
NNP
Birles¸ik Devletler’de
DT
*NONE*
NP-TMP
NN
yıl
JJ
gec¸en
NP-SBJ
NN
¨
ureticisi
NN
otomobil
NN
l
¨
uks
DT
*NONE*
Figure 2: A pair of trees illustrating the heuristics given in 1, 2, 5, 6, 7, 8 and 11.
S
.
.
NP
PRP
*NONE*
VP
MD
*NONE*
RB
*NONE*
VP
VB
git
NEG
FUT
2SG
PP
IN
*NONE*
NP
NN
is¸
DAT
Figure 3: Movement of morphemes to empty slots in Phase
3.
difficulties. In some cases, they end up with mor-
phemes for which they can not find empty *NONE*
slots where they can move them. In yet other cases,
they are left with too many *NONE* slots which they
can not fill with suitable morphemes. Often, they re-
alize that they had made a mistake when choosing a
morphologically complex gloss for a phrase in Phase
1. Therefore, they go back and modify the translation
in Phase 1.
3 APPLICATION TO
TREE-BASED SMT
In order to demonstrate the use of our morphologi-
cally annotated parallel treebank in SMT, we trained
and tested a model that learns to imitate the permu-
tation and leaf replacement operations performed by
human annotators in data generation. Our simple
model consists of a distribution over the permutations
for the children of an internal tree node and a distribu-
tion of Turkish words and morphemes for an English
word. Next we describe each training step in detail.
3.1 Permutation Training
Comparing parallel trees between Phase 0 and Phase-
1-English, we keep counts of the permutations for
each production rule in Phase 0. For example, com-
paring the trees in Figure 1, we identify the following
non-trivial permutations
At the end of training, each rule is associated with
a set of permutations and their counts. For each per-
mutation π
i
, we calculate its probability as
p
π
i
=
c
π
i
i
c
π
i
,
where c
π
i
denotes the number of times the permuta-
tion π
i
is observed for the particular rule.
English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT
513
S
.
.
PUNC
VP
VBZ
¨
ure
VERBˆ DB
VERB
CAUS
POS
AOR
A3SG
NP
NP
NN
ekipman
NOUN
A3SG
PNON
NOM
VP
VBN
c¸alıs¸
VERBˆ DB
VERB
CAUSˆ DB
VERB
PASS
POSˆ DB
ADJ
PRESPART
PP-CLR
IN
ile
CONJ
NP
NNS
t
¨
urbin
NOUN
A3PL
PNON
ACC
NN
gaz
NOUN
A3SG
PNON
NOM
CC
ve
CONJ
NN
dizel
NOUN
A3SG
PNON
NOM
NP
-NONE-
*
NP-SBJ
NNP
Stevenson
NOUN
PROP
A3SG
PNON
NOM
CC
&
PUNC
NNP
Stewart
NOUN
PROP
A3SG
PNON
NOM
Figure 4: Morphological analyses of a permuted and leaf-replaced tree.
Table 1: Permutations for trees in Figure 1.
rule permutation
VP VBD NP PP-LOC (2,1,0)
PP-LOC IN NP (1,0)
SNP-SBJ NP-TMP VP (0,1,2)
NP-SBJ DT NN NN NN (0,1,2,3)
NP-TMP JJ NN (0,1)
NPCD NNS (0,1)
Going over all the pairs of trees in Phase 1 data,
we calculate the probabilities of all permutations of
all the rules.
In testing, given a candidate permutation of En-
glish parse tree, we calculate its probability as the
product of the probabilities for the permutations in all
its subtrees. For example, the probability of the over-
all permutation from the English tree to its Turkish
translation in Figure 1 is the product of the probabili-
ties of permutations for the rules in Table 1.
3.2 Replacement Training
In order to assign probabilities to possible replace-
ments of leaves in an English tree, we use a simple
relative count of occurrences. So, for Turkish word
t, and an English word e, the probability of t being a
replacement of e is calculated as
p
e
(t) =
c
e
(t)
c
e
,
where c
e
(t) denotes the the number of times t replaces
e in the training data and c
e
denotes the total count of
e.
As training data, we use the leaves of the paral-
lel trees in Phase-1-English and Phase-3-Canonical.
So, not only do we generate probabilities for replace-
ments between whole words, we also generate prob-
abilities for replacing English function words with
their canonical Turkish morphemes.
In order to calculate the probability of the replace-
ment of the whole English sentence with a Turkish
sentence of the same length, we simply multiply the
probability of the individual replacements. Note that
deleting an English word corresponds to replacing it
with *NONE*. This replacement has a probability as
well.
3.3 Combined Tree Translator
Once we have the probabilities of permutations and
replacements, we can calculate the probability of a
translation so obtained by just multiplying the two
probabilities. Thus, for every translation candidate,
we have an associated probability. We could just out-
put the translation with the highest probability as the
designated translation among all possible translations.
However, generating all the translations is compu-
tationally expensive. Moreover, it ignores the con-
straints of language model statistics and morphotac-
tics of the Turkish word formation. In this work, we
ignore the language model mainly for the lack of a
Turkish LM that incorporates statistics with morpho-
tactics. Of course, using such model in ranking trans-
lation candidates would improve the results. We use
our FST based morphological analyzer to eliminate
impossible morpheme combinations.
ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods
514
For the initial exploration, we used a simplified
search for the best translation. We first choose the
most probable permuted tree and by reading off its
leaves, we obtain a scrambled English sentence. Sec-
ondly, we trace this sentence left to right and replace
words with Turkish stems and morphemes. For the
initial word, we keep the list of its N most probable re-
placements. For the full list of candidate replacements
of second word, we rank their combinations with the
N candidates of the first word according to the prod-
uct of their probabilities. We prune the new product
list by keeping the first N combinations. Continuing
like this, we maintain an N-best list of candidate par-
tial translations.
Some of the candidate replacements are mor-
phemes that need to combine with the previous stem
or morpheme. So, while going over the candidate re-
placements, whenever we encounter a non-morpheme
gloss, we check the morphotactical consistency of the
partial translation up to but not including the current
word. If it fails, we discard the candidate. For exam-
ple, suppose a partial translation candidate up to the
current candidate is “gel -PAST -NEG” and a candi-
date for the current word is the non-morpheme word
“s¸imdi”. We currently check the consistency of “gel
-PAST -NEG” and seeing that it fails, we drop the
candidate “s¸imdi”.
4 EXPERIMENTS
We tested our translation approach using 10-fold
cross-validation. We obtained BLEU scores for dif-
ferent N-best list sizes. Our corpus contains 5143
sentences. For each run, we use a random 90% of the
them for training and the rest for testing. The mean
scores and standard deviations for different list sizes
are given in Table 2.
Table 2: BLEU scores for the best tree in the first step and
different list sizes N in the second step.
N
1 5 10 50
9.5±0.4 11.5±0.2 12.1±0.3 12.8±0.5
In the second set of experiments, we try to obtain
the performance of the second step without the effect
of the first step, namely, the tree permutation. In or-
der to remove the effect of the first step, we start the
second step of the translation with the optimal tree ob-
tained in Phase-1-English (Section 2.2). In this way, if
the second step is optimal, we would obtain a BLEU
score of 100.
Table 3 shows BLEU scores for the overall trans-
lation when the optimal tree is used. The pattern of
results with respect to N is similar to Table 2. On the
other hand, the difference between the optimal tree
and the best tree is not large. It seems that, we need
to improve the leaf replacement step of our two-step
algorithm more than its tree permutation step.
Table 3: BLEU scores for the optimal tree in the first step
and different list sizes N in the second step.
N
1 5 10 50
10.8±0.1 13.7±0.3 14.5±0.2 16.3±0.5
In the third set of experiments, we compared our
translations with Google translate. For each sentence
in the English test corpus, we compared its Google
translation with our correct Turkish translation. We
obtained a BLEU score of 11.6 ± 1.0. Despite hav-
ing a larger variance, this score falls within the range
N = 5 10 of our results. Of course, Google trans-
late is a phrase-based engine that works with a gen-
eral lexicon. Thus the comparison might be a bit too
coarse. Still, we find it encouraging that even our sim-
ple translation approach with our rather limited cor-
pus performs similar to a popular translation engine.
5 CONCLUSION
In this paper, we gave the details of our efforts in
constructing a Turkish-English parallel Treebank and
used the resulting data in a simple SMT algorithm.
The BLEU scores for this initial attempt is not sur-
prisingly low but compared to a general phrase-based
tool like Google translate, it fares comparatively well.
A limitation of our parallel treebank construc-
tion is that it does not allow the detachment and at-
tachment of subtrees. Even a literal translation is
sometimes severely hindered under such a constraint.
However, in our corpus, the overall feel of the sen-
tence do not suffer much beyond sounding a bit too
formal. Of course, a general tree based approach
would have to construct the constituent trees indepen-
dently in two languages and then train a learner to
transform one tree to the other. Unfortunately there
is no usable constituent parser for Turkish. Thus, we
had to rely on leveraging the already available Penn
Treebank and generate our Turkish trees starting from
there. The flip side is that, this constrains our transla-
tion effort to only the English-to-Turkish direction.
An obvious challenge for any automatic trans-
lation task is the multiword expression and idioms.
English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT
515
Our translation model relies on a literal approach to
sentence construction rather than a phrase based ap-
proach that can probably accommodate idiomatic ex-
pressions. Still in our parallel corpus, a single En-
glish word might be translated as more than one Turk-
ish word. However, since this can happen only at the
leaves, the Turkish leaf might need to be further ex-
panded into a subtree. Instead, we left this problem as
one of a translation lexicon where many-to-one and
one-to-many mappings are possible.
When replacing the leaf tags in permuted trees
with stems and morphemes, we usually move mor-
phemes from other leaves. We observed some limi-
tations of this approach in our three-phase annotation
process. In the manual morphological analysis phase
of Phase 2, the annotators use the context of the leaf
word in choosing a word. At a later stage when an-
other annotator works with the same leaf in Phase 3,
sometimes the annotator realizes that, with the chosen
analysis, the morpheme movements are not so easy.
In these cases, the annotators usually go back to Phase
1 and change the replacement and have to the analy-
sis again. This loop might benefit some form of auto-
matic feedback across phases.
Exhaustive search over all of the translation trees
to find the most probable permutation and replace-
ment combination is computationally very expensive.
Thus in this exploratory work, we chose the best per-
muted tree and searched for the best replacement in-
dependently. Even then, optimizing over all possi-
ble replacements of all the tokens in the sentence is
prohibitively difficult. We had to rely on suboptimal
heuristics in searching for the best replacement. How-
ever, a dynamic programming approach with early
pruning of some branches using morphotactics at the
subtrees might yield a faster search.
REFERENCES
El-Kahlout, I. D. (2009). Statistical machine transla-
tion from English to Turkish (Ph.D. thesis).
El-Kahlout, I. D. and Oflazer, K. (2006). Initial ex-
plorations in English to Turkish statistical ma-
chine translation. In Proceedings of the Work-
shop on Statistical Machine Translation, StatMT
’06, pages 7–14, Stroudsburg, PA, USA. Associ-
ation for Computational Linguistics.
Hutchinson, J. (1994). The Georgetown-IBM demon-
stration. MT News International, no.8, pages 15–
18.
Jurafsky, D. and Martin, J. H. (2009). Speech and
Language Processing: An Introduction to Nat-
ural Language Processing, Computational Lin-
guistics and Speech Recognition (Prentice Hall
Series in Artificial Intelligence). Prentice Hall, 2
edition.
Koehn, P. (2010). Statistical Machine Translation.
Cambridge University Press, New York, NY,
USA, 1st edition.
Marcus, M. P., Marcinkiewicz, M. A., and Santorini,
B. (1993). Building a large annotated corpus
of English: The Penn Treebank. Computational
Linguistics, 19(2):313–330.
Yeniterzi, R. and Oflazer, K. (2010). Syntax-to-
morphology mapping in factored phrase-based
statistical machine translation from English to
Turkish. In 48th Annual Meeting of the Asso-
ciation for Computational Linguistics.
ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods
516