For the initial exploration, we used a simplified
search for the best translation. We first choose the
most probable permuted tree and by reading off its
leaves, we obtain a scrambled English sentence. Sec-
ondly, we trace this sentence left to right and replace
words with Turkish stems and morphemes. For the
initial word, we keep the list of its N most probable re-
placements. For the full list of candidate replacements
of second word, we rank their combinations with the
N candidates of the first word according to the prod-
uct of their probabilities. We prune the new product
list by keeping the first N combinations. Continuing
like this, we maintain an N-best list of candidate par-
tial translations.
Some of the candidate replacements are mor-
phemes that need to combine with the previous stem
or morpheme. So, while going over the candidate re-
placements, whenever we encounter a non-morpheme
gloss, we check the morphotactical consistency of the
partial translation up to but not including the current
word. If it fails, we discard the candidate. For exam-
ple, suppose a partial translation candidate up to the
current candidate is “gel -PAST -NEG” and a candi-
date for the current word is the non-morpheme word
“s¸imdi”. We currently check the consistency of “gel
-PAST -NEG” and seeing that it fails, we drop the
candidate “s¸imdi”.
4 EXPERIMENTS
We tested our translation approach using 10-fold
cross-validation. We obtained BLEU scores for dif-
ferent N-best list sizes. Our corpus contains 5143
sentences. For each run, we use a random 90% of the
them for training and the rest for testing. The mean
scores and standard deviations for different list sizes
are given in Table 2.
Table 2: BLEU scores for the best tree in the first step and
different list sizes N in the second step.
N
1 5 10 50
9.5±0.4 11.5±0.2 12.1±0.3 12.8±0.5
In the second set of experiments, we try to obtain
the performance of the second step without the effect
of the first step, namely, the tree permutation. In or-
der to remove the effect of the first step, we start the
second step of the translation with the optimal tree ob-
tained in Phase-1-English (Section 2.2). In this way, if
the second step is optimal, we would obtain a BLEU
score of 100.
Table 3 shows BLEU scores for the overall trans-
lation when the optimal tree is used. The pattern of
results with respect to N is similar to Table 2. On the
other hand, the difference between the optimal tree
and the best tree is not large. It seems that, we need
to improve the leaf replacement step of our two-step
algorithm more than its tree permutation step.
Table 3: BLEU scores for the optimal tree in the first step
and different list sizes N in the second step.
N
1 5 10 50
10.8±0.1 13.7±0.3 14.5±0.2 16.3±0.5
In the third set of experiments, we compared our
translations with Google translate. For each sentence
in the English test corpus, we compared its Google
translation with our correct Turkish translation. We
obtained a BLEU score of 11.6 ± 1.0. Despite hav-
ing a larger variance, this score falls within the range
N = 5 − 10 of our results. Of course, Google trans-
late is a phrase-based engine that works with a gen-
eral lexicon. Thus the comparison might be a bit too
coarse. Still, we find it encouraging that even our sim-
ple translation approach with our rather limited cor-
pus performs similar to a popular translation engine.
5 CONCLUSION
In this paper, we gave the details of our efforts in
constructing a Turkish-English parallel Treebank and
used the resulting data in a simple SMT algorithm.
The BLEU scores for this initial attempt is not sur-
prisingly low but compared to a general phrase-based
tool like Google translate, it fares comparatively well.
A limitation of our parallel treebank construc-
tion is that it does not allow the detachment and at-
tachment of subtrees. Even a literal translation is
sometimes severely hindered under such a constraint.
However, in our corpus, the overall feel of the sen-
tence do not suffer much beyond sounding a bit too
formal. Of course, a general tree based approach
would have to construct the constituent trees indepen-
dently in two languages and then train a learner to
transform one tree to the other. Unfortunately there
is no usable constituent parser for Turkish. Thus, we
had to rely on leveraging the already available Penn
Treebank and generate our Turkish trees starting from
there. The flip side is that, this constrains our transla-
tion effort to only the English-to-Turkish direction.
An obvious challenge for any automatic trans-
lation task is the multiword expression and idioms.
English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT
515