Unsupervised Evaluation of Human Translation Quality
Yi Zhou and Danushka Bollegala
Department of Computer Science, University of Liverpool, U.K.
Keywords:
Translation Quality, Evaluation of Human Translations, Cross-lingual Word Embeddings, Word Mover’s
Distance, Bidirectional Minimum Word Mover’s Distance.
Abstract:
Even though machine translation (MT) systems have reached impressive performances in cross-lingual trans-
lation tasks, the quality of MT is still far behind professional human translations (HTs) due to the complexity
in natural languages, especially for terminologies in different domains. Therefore, HTs are still widely de-
manded in practice. However, the quality of HT is also imperfect and vary significantly depending on the
experience and knowledge of the translators. Evaluating the quality of HT in an automatic manner has faced
many challenges. Although bilingual speakers are able to assess the translation quality, manually checking the
accuracy of translations is expensive and time-consuming. In this paper, we propose an unsupervised method
to evaluate the quality of HT without requiring any labelled data. We compare a range of methods for automat-
ically grading HTs and observe the Bidirectional Minimum Word Mover’s distance (BiMWMD) to produce
gradings that correlate well with humans.
1 INTRODUCTION
With the rapid development of international business
and multinational companies, there is an increasing
demand for translations of user manuals, contracts
and various other documents. Even though MT sys-
tems have shown promise for automatic translation,
they require large parallel corpora for training, which
might not be available for resource poor language
pairs such as Hindi, Sinhalese, etc. In addition, the
performance of MT systems are still far behind pro-
fessional human translators due to the complex na-
ture of grammar and word usage in languages. The
quality of translations generated by MT depends on
the distances between the source and target language
pairs (Han, 2016). For example, the quality of an En-
glish to French translation would be better than an En-
glish to Chinese translation, even though both trans-
lations are generated from the same MT system (Xu
et al., 2018). As a result, HTs are still widely used
across numerous industries.
A person’s first language, L1, refers to the na-
tive language of that person, whereas L2 is a second
language spoken by that person. HTs created by L2
speakers can be erroneous due to the different levels
of experiences and knowledge of the translators. Of-
ten, the quality of translations provided by L2 speak-
ers must be manually verified by professional trans-
lators before they can be accepted. A good transla-
tion must demonstrate six properties: intelligibility,
fidelity, fluency, adequacy, comprehension, and infor-
mativeness (Han, 2016). However, manually verify-
ing these properties in an HT is both time consuming
and costly. In this paper, we propose an unsupervised
method for evaluating the quality of HTs, which ad-
dresses this challenging problem.
Translation quality evaluation is a much more
complicated task than it might appear in a first
glance. Papineni et al. (2002) proposed the bilingual
evaluation understudy (BLEU) method to automati-
cally evaluate the quality of MT. They take profes-
sional HTs as golden references and consider a better
MT should be the one closer to the golden HTs. In
contrast, HTs quality evaluation must be done manu-
ally because such golden references are not available.
People who are familiar with both the source and the
target languages are required to evaluate the quality of
HTs. The number of such bilingual speakers are lim-
ited and might not exist in the case of rare languages.
However, human evaluation is time-consuming and
not re-usable. MT quality evaluation requires a ref-
erence translation. Because of this reason, MT evalu-
ation measures such as BLEU, cannot be used for the
purpose of evaluating HTs.
In this paper, we model HT quality evaluation as
an unsupervised graph matching problem. Specifi-
Zhou, Y. and Bollegala, D.
Unsupervised Evaluation of Human Translation Quality.
DOI: 10.5220/0008064500550064
In Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2019), pages 55-64
ISBN: 978-989-758-382-7
Copyright
c
2019 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
55
cally, given a source sentence S and its target transla-
tion T , we compare the similarity between the set of
words {s
1
, s
2
, . . . , s
n
} in S against the set of words
{t
1
, t
2
, . . . , t
m
} in T , using different distance metrics
such as Euclidean distance. In this work, we take the
advantage of cross-lingual word embeddings between
different languages and present a novel approach to
automatically evaluate the quality of HTs without ac-
cessing to golden references. Our work is inspired
by the Word Mover’s distance (Kusner et al., 2015),
which measures the distance between documents by
minimising the cost of transferring embedded words
from the source language to the target language. We
emphasise that our goal in this paper is not to propose
a novel MT method nor an evaluation metric for MT.
Instead, we consider the problem of automatically de-
tecting high/low quality of human translations, with-
out having any access to reference translations.
Specifically, we report and evaluate different
methods for the purpose of unsupervised HT eval-
uation and compare them against grades given by
judges, who are professional translators, for the qual-
ity of the HTs using Spearman rank and Pearson cor-
relations. As shown in the experiments, the Bidirec-
tional Minimum Word Mover’s distance (BiMWMD)
has the strongest correlation with the human assigned
grades, indicating that this method is able to accu-
rately detect the low quality and high quality HTs
without requiring any human supervision.
2 RELATED WORK
An HT can be compared against the source text using
similarity and distance metrics through cross-lingual
word embeddings. Cosine similarity and Euclidean
distance have been popularly used for this purpose.
Semantic Textual Similarity (STS) systems evaluate
the degree of semantic similarity between two sen-
tences. Most of the early work on sentence simi-
larity compute the sentence similarity as the average
of the words similarity over the two sentences (Cor-
ley and Mihalcea, 2005; Li et al., 2016; Islam and
Inkpen, 2008). At SemEval 2012, the supervised sys-
tems combining different similarity measures such as
lexico-semantic, syntactic and string similarity, us-
ing regression models have been proposed (B
¨
ar et al.,
2012;
ˇ
Sari
´
c et al., 2012). Later, Sultan et al. (2015)
propose an unsupervised system based on word align-
ment. Brychc
´
ın and Svoboda (2016) and Tian et al.
(2017) model semantic similarity for multilingual and
cross-lingual sentence pairs by first translating sen-
tences into the target language using MT, then apply-
ing the monolingual STS models. In order to address
the problem that human annotated data is limited for
resource poor languages, Brychc
´
ın (2018) studied lin-
ear transformations to map monolingual word embed-
dings into a common space using bilingual dictionary
for cross-lingual semantic similarity.
The distributional hypothesis (Harris, 1954) states
that words occurring in the same context tend to
have similar meanings. According to the hypothe-
sis, Mikolov et al. (2013) propose distributed Skip-
gram and Continuous Bag-of-Words (CBOW) models
to learn robust word embeddings from large amount
of unstructured texts data. Recent research creating
a shared vector space for words across two (bilin-
gual word embeddings) (Artetxe et al., 2017; Chan-
dar A P et al., 2014; Zou et al., 2013) or more (mul-
tilingual word embeddings) (Hermann and Blunsom,
2014; Lauly et al., 2014) languages is referred to
cross-lingual word embeddings learning. The dis-
tance between words from different languages with
similar meanings should be close to each other in the
shared embedding space (Chen and Cardie, 2018).
The cross-lingual word representations are ob-
tained by training embeddings in different languages
independently using monolingual corpora, then map
them to a common space through a transforma-
tion (Artetxe et al., 2018). Ruder et al. (2017) in-
troduced three different types of alignments in learn-
ing cross-lingual word embeddings: word alignment,
sentence alignment and document alignment. Word
alignment uses bilingual dictionaries with word-
by-word translations to learn cross-lingual embed-
dings (Vuli
´
c and Moens, 2015). Sentence align-
ment requires a parallel corpus (Hermann and Blun-
som, 2014; Gouws et al., 2015), which is a collec-
tion of texts in one language and the corresponding
translations into one or more languages. Document
alignment requires document in a comparable cor-
pus across different languages. A comparable cor-
pus contains documents that are not exact parallel
translations but convey the same information in dif-
ferent languages (Faruqui and Dyer, 2014; Gouws
and Søgaard, 2015).
Several approaches for learning cross-lingual
word embeddings have been proposed, which require
different types of alignment as supervision. Luong
et al. (2015) present the bilingual Skip-Gram model
(BiSkip) to learn cross-lingual word embeddings us-
ing a parallel corpus (sentence alignment), which can
be seen as an extension of the monolingual skip-
gram model. The Bilingual Compositional Model
(BiCVM) proposed by Hermann and Blunsom (2014)
learns cross-lingual word embeddings through sen-
tence alignment. The model leverages the fact that
the representations of aligned sentences should be
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
56
similar. Therefore, semantics can be learned from
parallel data. Vuli
´
c and Moens (2015) proposed a
model to learn cross-lingual word embeddings from
non-parallel data. They extend the skip-gram model
with negative sampling (SGNS) model and generate
cross-lingual word embeddings via a comparable cor-
pus (document alignment).
The method proposed by Kusner et al. (2015) for
measuring the distance between two documents is
known as the Word Mover’s Distance. It considers
the distance between documents to be the minimal
cost of transforming words from one document to
another. However, they take the alignment of each
source word to all of the target words to compute the
cost of a translation, which is expensive. In this pa-
per, we focus on the sentence alignment and propose
the Bidirectional Minimum Word Mover’s distance
(BiMWMD) method, where we consider the distance
between documents to be the cumulative cost of the
minimal cost of transferring each source word to the
corresponding target word. In addition, our proposed
method takes into account the translation flow from
both direction (i.e. from the source to the target and
from the target to the source).
3 TRANSLATION QUALITY
EVALUATION
Our goal is to propose a method, which is able to ac-
curately evaluate the quality of cross-lingual transla-
tions, without human supervision. Most translation
quality evaluation approaches are based on gold ref-
erences, which are manually created perfect transla-
tions of a source language text to the target language.
Our work considers the scenario that there are no such
golden references available.
Let us denote the source language by S and the
target language by T . For example, when translat-
ing from Japanese to English, S will be Japanese and
T will be English. Consider a set of words V
S
, V
T
respectively in the source and target languages. A
cross-lingual word embedding w R
d
of a word
w V
S
V
T
is an embedding that is shared between
both S and T . As already described in Section 2, sev-
eral methods have been proposed for training accu-
rate cross-lingual word embeddings that we can use
for this purpose. Here, we assume the availability of
such a set of cross-lingual word embeddings for the
source and target languages.
Let us consider a source language text S =
s
1
, s
2
, . . . , s
n
, which is translated to the target lan-
guage T = t
1
, t
2
, . . . , t
m
where s
i
R
d
represents
the embedding of i-th word in the source sentence,
t
j
R
d
represents the embedding of j-th word in
the target sentence. Here, n and m are the numbers
of words in the source and the target texts respec-
tively. Source and target texts could be single or mul-
tiple sentences. The methods that we discuss in this
paper for evaluating HT quality do not require any
sentence-level processing and can be applied to either
single sentences or paragraphs that contain multiple
sentences.
3.1 Averaged Vector (AV)
Prior work on unsupervised sentence embeddings
have found that averaging the word embeddings for
the words in a sentence to be a simple but an accu-
rate method for creating sentence embeddings (Arora
et al., 2017). Motivated by these prior proposals, we
represent both source and target language texts by
averaging the cross-lingual word embeddings for the
words that appear in each of the texts. We call this the
Averaged Vector (AV) method. Specifically, given a
source language text S = s
1
, s
2
, . . . , s
n
and its HT
T = t
1
, t
2
, . . . , t
m
, we represent the two texts re-
spectively by embeddings
¯
s,
¯
t R
d
as given by (1)
and (2).
¯
s =
1
n
n
X
i=1
s
i
(1)
¯
t =
1
m
m
X
j=1
t
j
(2)
After obtaining the vectors for the source sentence
and the target sentence, similarity between them can
be computed using the cosine similarity between the
two vectors
¯
s and
¯
t as given by (3).
sim(S, T ) = cos(
¯
s,
¯
t)
=
¯
s
>
¯
t
||
¯
s|| ||
¯
t||
. (3)
Here, we consider the similarity between S and T as a
proxy of the semantic agreement between the source
text and its translation, thereby providing a measure
of quality. In addition to the simple averaging of
word embeddings given in (1) and (2), in our pre-
liminary experiments we implemented tfidf (term fre-
quency inverse document frequency) weighting and
SIF (smooth inverse frequency) (Arora et al., 2017)
methods for creating sentence embeddings. But for
our task of comparing sentences written in different
languages, we did not observe any significant im-
provements in using those weighting methods. There-
fore, we decided to use the simple (unweighted) aver-
aging as given in (1) and (2).
Unsupervised Evaluation of Human Translation Quality
57
3.2 Source-centred Maximum
Similarity (SMS)
The AV method described in Section 3.1 can be seen
as comparing each word s
i
in the source text against
each word t
j
in the target text. Moreover, it is sym-
metric in the sense that if we had swapped the source
and the target texts, it will return the same similar-
ity score. However, not all words in the source text
might be related to all the words in the target text.
On the contrary, one word in the source text is of-
ten related to only a few words in the target transla-
tion. Therefore, we must compare each source word
against the most related word in the target transla-
tion. For this purpose, we modify the AV method and
propose source-centred maximum similarity (SMS)
method as described next.
First, we compute the cosine similarity of each
embedded word s
i
in the source text against all
the embedded words t
1
, t
2
, . . . , t
m
in the target text.
Next, the maximum similarity score between s
i
and
any of t
1
, t
2
, . . . , t
m
is taken as the score for trans-
forming s
i
from the source to target. Finally, we re-
port the averaged similarity score over all the maximal
scores as the similarity between S and T as given by
(4).
sim(S, T ) =
1
n
n
X
i=1
max
j=1,...,m
cos(s
i
, t
j
) (4)
3.3 Target-centred Maximum Similarity
(TMS)
TMS is the opposite to the SMS method and is target-
centred. This method calculates the cosine similarity
of each embedded target word t
j
against all the em-
bedded source words s
1
, s
2
, . . . , s
n
. Then, the maxi-
mal similarity score is computed as the score of trans-
lating each t
j
back to a word s
i
in the source text.
Finally, we take the average score over all the maxi-
mal similarity scores of the target words as given by
(5).
sim(S, T ) =
1
m
m
X
i=1
max
i=1,...,n
cos(s
i
, t
j
) (5)
3.4 Word Mover’s Distance (WMD)
WMD is a measure of the distance between docu-
ments proposed by Kusner et al. (2015), which is in-
spired by the Earth Mover’s Distance (EMD) (Rubner
et al., 2000). WMD can be used to measure the dis-
similarity between two text documents. Specifically,
it measures the minimum amount of the cost that has
to paid for transforming words from a source text S to
reach the words in a target text T . By using this met-
ric, we are able to estimate the similarity between a
source document and a target document even though
they contain no common words.
Let us assume that two text documents are rep-
resented as normalised bag-of-words vectors and the
i-th target word t
i
appears h(t
i
) times in the target
text T . The normalised frequency f(t
i
) of t
i
is given
by (6).
f(t
i
) =
h(t
i
)
P
m
j=1
h(t
j
)
(6)
Likewise, the normalised frequency, f(s
j
) of a word
s
j
in the source text S is given by (7).
f(s
j
) =
h(s
j
)
P
n
i=1
h(s
i
)
(7)
Then, the transportation problem can be formally
defined as the minimum cumulative cost of moving
words from a S to T under the constraints specified
in the following linear programme (LP).
minimise
n
X
i=1
m
X
j=1
T
ij
c(i, j) (8)
subject to:
m
X
j=1
T
ij
= f(s
i
), i {1, . . . , n} (9)
n
X
i=1
T
ij
= f(t
j
), j {1, . . . , m} (10)
T 0 (11)
Here, T R
n×m
is a nonnegative flow matrix that
is learnt by the LP, and c(i, j) is the cost of translating
(transforming) the word s
i
to t
j
. We measure this
translation cost as the Euclidean distance between the
embeddings of s
i
and t
j
as given by (12).
c(i, j) = ||s
i
t
j
||
2
(12)
Intuitively, if c(i, j) is high for translating s
i
to t
j
,
then the (i, j) element T
ij
of T can be set to a small
(possibly zero) value to reduce the objective given
by (13). The equality constraints given in (9) and
(10) specify respectively column and row stochastic-
ity constraints for T. In other words, these equal-
ity constrains ensure that the total weights transferred
from each source word to the target text, and vice
versa are preserved, making T a double stochastic
matrix. Note that each source text word s
i
is allowed
to match against one or more target text words t
j
un-
der these constraints.
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
58
Figure 1: Translating a word in Japanese source (S) text
into English target (T ) text. The perfect alignment between
S and T is s
1
I, s
2
null, s
3
school, s
4
null,
s
5
like, and s
6
null. The thin arrow represents the
minimum cost translating to I. The correct translations
are likely to have smaller distances (costs) associated with.
3.5 Bidirectional Minimum WMD
(BiMWMD)
WMD method described in Section 3.4 is symmet-
ric in the sense that even if we swap the source and
target texts we will get the same score for the trans-
lation quality. On the other hand, SMS and TMS
methods described respectively in Sections 3.2 and
3.3 are both asymmetric translation quality evaluation
methods. Following SMS and TMS methods, we ex-
tend WMD method such that it considers the trans-
lation quality from the point-of-view of the source
text, which we refer to as the Source-centric Mini-
mum WMD (SMWMD) and from the point-of-view of
the target text, which we refer to as the Target-centred
Minimum WMD (TMWMD). Finally, we combine the
two extensions and propose the Bidirectional Mini-
mum WMD (BiMWMD) to evaluate the translation
quality from both point of views. Next, we describe
SMWND, TMWMD and BiMWMD methods.
SMWMD: Source-centred Minimum WMD
(SMWMD) method considers the translation flow to
be from the source sentence to the target sentence.
Figure 1 shows an example of measuring distance
from a source text S to a target text T . In SMWMD,
we measure the minimum cost of translating each
source word s
i
to any word in T , and consider the
sum of costs as the objective function for the LP.
Similar to WMD, we denote T
ij
0 to be the flow
matrix translating s
i
to t
j
according to the cost c(i, j)
given by (12). As we found that the normalised
frequencies f(t
i
) and f (s
j
) have little effect on
the results through the experiments, we assign both
frequencies to be 1 to simplify the objective function.
Then, the optimisation problem can be written as
follows:
minimise
n
X
i=1
min
j=1,...,m
T
ij
c(i, j) (13)
subject to:
X
m
j=1
T
ij
= 1, i {1, . . . , n} (14)
X
n
i=1
T
ij
= 1, j {1, . . . , m} (15)
T 0 (16)
To simplify the objective function in (13), we use
y
i
to replace T
ij
c(i, j), where T
ij
c(i, j) is the actual
cost of transforming words from one document to an-
other and y
i
is an upper bound on T
ij
c(i, j). Let us
denote the actual objective by T C given by (17) and
its upper bound by Y given by (18).
T C(S, T ) =
n
X
i=1
m
X
j=1
T
ij
c(i, j) (17)
Y (S, T ) =
n
X
i=1
y
i
(18)
Using y
i
, we can rewrite the previous optimisation
problem as an LP as follows:
minimise
n
X
i=1
y
i
(19)
subject to: T
ij
c
(
i, j) y
i
(20)
m
X
j=1
T
ij
= 1, i {1, . . . , n} (21)
n
X
i=1
T
ij
= 1, j {1 . . . , m} (22)
T 0 (23)
We collectively denote the minimum translation cost
for translating a source text S into a target text T given
by solving the LP above as, SMWMD(S, T ), which
can be either T C(S, T ) or Y (S, T ). During our ex-
periments, we will study the difference between the
actual objective (TC) and its upper bound (Y) for the
purpose of predicting the quality of HT.
TMWMD: An accurate translation of a given
source text must not only correctly translate the in-
formation contained in the source text but it must also
not add new information that was not present in the
original source text into the translation. One simple
way to verify this is to back translate the target text to
the source and measure their semantic distance. For
this purpose, we modify the WMD objective, in the
same manner as we did for SMWMD but pivoting on
the target text instead of the source text. We refer to
this approach as the Target-centred Minimum WMD
(TMWMD).
Unsupervised Evaluation of Human Translation Quality
59
Figure 2: Translating a word in Japanese source (S) text
into English target (T ) text. The perfect alignment between
S and T is s
1
I, s
2
null, s
3
school, s
4
null,
s
5
like, and s
6
null. The light arrow represents the
minimum cost alignment.
Specifically, we define the distance between S
and T as the minimal cumulative distance required
to move all words from the target text T to the source
text S. An example is give in Figure 2, where the tar-
get word I is being compared against all the words
in the source (indicated by arrows) and the closet
Japanese translation s
1
is mapped by a thin arrow.
TMWMD is the solution to the following LP:
minimise
m
X
j=1
y
j
(24)
subject to: T
ij
c
(
i, j) y
j
(25)
m
X
j=1
T
ij
= 1, i {1, . . . , n} (26)
n
X
i=1
T
ij
= 1, j {1 . . . , m} (27)
T 0 (28)
Note that TMWMD is the mirror image of SMWMD
in the sense that by swapping S and T we can obtain
the LP for SMWMD.
Likewise SMWMD, TMWMD can also be com-
puted using the actual objective (T C(S, T )) or the
upper bound (Y (S, T )). We collectively denote these
two variants as TMWMD(S, T ).
BiMWMD: SMWMD and TMWMD are evaluat-
ing the translation quality in one direction only. If the
translation cost from source to target as well as from
target to source are both small, then it is an indica-
tion of a higher quality translation. To quantitatively
capture this idea, we propose Bidirectional Minimum
Word Mover Distance (BiMWMD) as a translation
quality evaluation measure. BiMWMD is defined by
(29) and is the sum of optimal translation costs re-
turned individually by SMWMD and TMWMD.
BiMWMD(S, T ) = SMWMD(S, T ) + TMWMD(S, T )
(29)
From the definition (29), it follows that BiMWMD
is a symmetric translation quality measure, similar
to WMD. However, BiMWMD and WMD are solv-
ing different LPs, hence returning different transla-
tion quality predictions. Specifically, in WMD the
minimal cumulative cost for translating each word in
the source text S to all the words in the target text
T is returned as the objective. On the other hand,
BiMWMD solves two independent LPs, each consid-
ering only a single direction (SMWMD considers the
case of translating from S to T , whereas TMWMD
considers the case of translating from T to S). As we
later see in Section 4.4, BiMWMD shows a higher de-
gree of correlation with human ratings for translation
quality than WMD.
4 EXPERIMENTS
In this section, we evaluate the different translation
quality measures described in Section 3. For this
purpose, we annotated a translation dataset as de-
scribed in Section 4.1 and use correlation against hu-
man grades as the evaluation criteria. Experimental
results are discussed in Section 4.4.
4.1 Dataset
For evaluating the different translation quality mea-
sures described in Section 3, we created a dataset by
selecting 1030 sentences from Japanese user manu-
als on Digital cameras. We then asked a group of 50
human translators to translate the selected Japanese
sentences to English. The human translators are all
native Japanese speakers who have studied English as
a foreign language. The human translators for this ex-
periment were recruited using a crowd-sourcing plat-
form that is operational in Japan. The human trans-
lators have different levels of experience in translat-
ing technical documents ranging broadly from very
experienced translators to beginners. We believe this
would give us a broad spectrum of human translations
for evaluation purposes. Each Japanese sentence was
assigned to one of the human translators in the pool
of human translators and was asked to write a single
English translation.
Next, we randomly selected 130 pairs of Japanese-
English translated sentence pairs and asked four hu-
mans, who are bilingual speakers of Japanese and En-
glish and are professionally qualified translators with
over 10 years of experience in translating technical
documents, to rate each of the quality of the selected
translation pairs. We call these four humans as judges
to distinguish them from the pool of human trans-
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
60
lators who wrote the translations. Specifically, we
asked each of the four judges to grade a translated
sentence pair by assigning one of the following four
grades.
Grade 1 Quality Translations: A perfect transla-
tion. There are no further modifications required.
The translation pair is scored in a range of 0.76
1.00.
Grade 2 Quality Translations: A good translation.
Some words are incorrectly translated but the
overall meaning can be understood. The transla-
tion pair is scored in a range of 0.51 0.75.
Grade 3 Quality Translations: A bad translation.
There are more incorrectly translated words than
correctly translated words in the translation. The
translation pair is scored in a range of 0.260.50.
Grade 4 Quality Translations: Requires re-
translation. The translation cannot be compre-
hend or conveys a significantly different meaning
to the source sentence. The translation pair is
scored in a range of 0.00 0.25.
The average of the grades assigned by the four judges
to a translated sentence pair is considered as its final
grade.
4.2 Cross-lingual Word Embeddings
All of the translation quality measures we proposed
in Section 3 require cross-lingual word embeddings.
To create cross-lingual word embeddings between
Japanese and English languages in an unsupervised
manner, we align publicly available monolingual
word embeddings. Specifically, we first use the
monolingual word embeddings, which are trained on
Wikipedia and Common Crawl using fastText (Grave
et al., 2018). Because our dataset contains Japanese
and English words, we train two separate monolin-
gual word embedding sets for Japanese and English.
Next, we use the unsupervised adversarial training
methods proposed by Conneau et al. (2017) and im-
plemented in MUSE
1
to align the Japanese and En-
glish word embedding spaces, without requiring any
bilingual dictionaries or parallel/comparable corpora.
Although it is possible to further improve the perfor-
mance of this cross-lingual alignment using bilingual
lexical resources, by not depending on any such re-
sources we are able to realistically estimate the perfor-
mance of the different methods we proposed in Sec-
tion 3 when such resources are not available.
1
https://github.com/facebookresearch/MUSE
4.3 Evaluation Measures
Recall that our goal in this work is to predict the qual-
ity of human translations without having access to any
reference translations. Therefore, we would like to
know whether the translation quality scores returned
by the different methods we proposed in Section 3 are
correlating with the grades given by the human judges
to the human translations in the dataset we created in
Section 4.1. To evaluate the level of agreement be-
tween the grades and the translation quality scores,
we compute the Pearson and Spearman rank corre-
lation coefficients between these two sets of numbers.
Pearson correlation coefficient measures the linear re-
lationship between two variables, whereas Spearman
correlation coefficient considers only the relative or-
dering.
4.4 Results
In Table 1, we compare the different HT quality
evaluation measures described in Section 3. Re-
call that some methods return similarity scores (AV,
SMS, TMS), whereas others return distances (WMD,
SMWMD, TMWMD, BiMWMD). To compare both
similarities and distances equally, we convert the dis-
tances to similarities for each method by
1
distance
maximum distance
.
We use the interior-point method to solve the LPs in
all cases. A higher degree of correlation with the
grades assigned by the judges for the translations in-
dicate a reliable quality prediction measure. From
Table 1, we see that averaging the word embeddings
for creating text/sentence embeddings and then mea-
suring their cosine similarity (AV) provides a low-
level of correlation. Comparing SMS and TMS meth-
ods, we see that centering on the target provides a
higher degree of correlation than by centering on the
source. A similar trend can be observed when com-
paring SMWMD and TMWMD. In fact, SMWMD
returns negative correlation values for both Spear-
man and Pearson correlation coefficients. We see that
BiMWMD returns the best correlation scores against
judges’ grades among all methods compared in Ta-
ble 1. This result shows that it is important to consider
both directions of the translations to more accurately
estimate the quality of a human translation.
To study the effect of various parameters and set-
tings associated with the BiMWMD method, we eval-
uate it under different configurations. Specifically, to
analyse the effect of normalising word embeddings,
we consider three settings: `
1
normalisation, `
2
nor-
malisation and no normalisation (No). To decide be-
Unsupervised Evaluation of Human Translation Quality
61
   
カメラをテレビに接続するための映像と音声用のケー
ブル。映像と音声信号を送信する。
       
      
     
 
真撮レン絞りは、調整きるにす
ために複数枚の板(絞り羽根)重ね合わせて作ら
ている。 枚羽根絞りの場合、 枚の羽根で 角形が
できており、このような絞りを虹彩絞りという。
      
      
      
    
 
部名称。いわる「撮設定更ボン」。ボタン
押すと、液晶モニターに撮影に関する情報が表示され
る。
    
       
        
         
  
 
カメラを縦に構えて撮影すること         
     
    

             
       
 
     
    
       
     
     
     
      
     
     
    
      
     

        
        
     
      
    
         
     
  
       
     
        
      
      
 
       
     
     
 
        
     
     
       
    
       
        
    
     

      
     

      
     

       
     
     
      
    

       
  
       
      
    
       
       
      
     
   
Figure 3: Scores assigned by BiMWMD method to translation pairs graded by the human judges. We have scaled both
BiMWMD scores and judges’ grades to [0,1] range for the ease of comparison.
Table 1: Performance of the different HT quality evaluation
methods. The best correlations are in bold.
Method Spearman r Pearson ρ
AV 0.2628 0.3076
SMS 0.1505 0.3224
TMS 0.4576 0.4851
WMD 0.3953 0.5003
SMWMD -0.3928 -0.2328
TMWMD 0.4164 0.4199
BiMWMD 0.5895
0.5895
0.5895 0.5296
0.5296
0.5296
tween actual objective T C(S, T ) (given by (17)) vs.
its upper bound Y (S, T ) (given by (18)), we con-
sider each of the two values separately as the value
returned by BiMWMD and measure the correlation
against judges’ grades. The row and column stochas-
ticity constraints adds a large number of equality con-
straints to the LPs. Adding both row and column
stochasticity constraints at the same time often makes
the LP infeasible. To relax the constraints and to em-
pirically study the significance of the row and col-
umn stochasticity constraints, we run BiMWMD with
row stochasticity constraints only (denoted by Row)
vs. column stochasticity constraints only (denoted by
Column). All possible combinations of these config-
urations are evaluated in Table 2.
From Table 2, we see that the best performance
is obtained with `
2
normalised cross-lingual word
embeddings. We see that column stochasticity con-
straints are more important than the row stochasticity
constraints. Moreover, using the value of the upper
Table 2: Different settings for the BiMWMD method. Nor-
malisation of word embeddings: `
1
, `
2
and unnormalised
(No), Row and Column denote using only row or column
stochasticity constraints in the LP. Moreover, we can con-
sider the actual objective (TC) or its upper bound (Y) as the
value of BiMWMD.
Method Spearman r Pearson ρ
`
2
+Y+Row 0.0033 0.1764
`
2
+Y+Column 0.5895
0.5895
0.5895 0.5296
0.5296
0.5296
`
2
+TC+Row -0.2510 -0.0767
`
2
+TC+Column -0.2510 -0.0711
`
1
+Y+Row 0.3136 0.1834
`
1
+Y+Column 0.5721 0.5125
`
1
+TC+Row -0.2510 -0.0730
`
1
+TC+Column -0.2510 -0.0687
No+Y+Row -0.0728 0.0284
No+Y+Column 0.5146 0.4878
No+TC+Row -0.2073 -0.0395
No+TC+Column -0.2053 -0.0414
bound (Y (S, T )) as BiMWMD is more accurate than
the actual objective (T C(S, T )). Recall that the flow
matrix T has nm number of parameters. The number
of parameters increases with lengths of source and tar-
get texts. Therefore, it is possible to set most of those
nm elements to zero to minimise the actual objective,
thereby satisfying the inequality T
ij
c(i, j) y
j
in LP.
Therefore, the sum of upper bounds
P
j
y
j
, which is
the objective minimised by the reformed LP, is a bet-
ter proxy as BiMWMD.
A good measure for predicting the quality of HTs
must be able to distinguish low quality HTs from high
quality HTs. If we can automatically decide whether a
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
62
particular HT is of low quality without another human
having to read it, then it is possible to prioritise such
low quality HTs for retranslation or to be verified by
a human in charge of quality control. This is particu-
larly useful when we have a large number of transla-
tions to verify and would like to check the ones which
are most likely to be incorrect. To understand the
scores assigned by BiMWMD to translations of dif-
ferent grades, in Figure 3, we randomly select trans-
lation pairs with different grades and show the scores
predicted by BiMWMD, which was the best method
among the methods proposed in Section 3. We see
that high scores are assigned by BiMWMD method
for translations that are also rated as high quality by
the human judges, whereas low scores are assigned to
translations that are considered to be of low quality by
the judges.
5 CONCLUSION
We proposed different methods for automatically pre-
dicting the quality of human translations, without
having access to any gold standard reference trans-
lations. In particular, we proposed a broad range of
measures covering both symmetric and asymmetric
measures. Our experimental results show that Bidi-
rectional Minimum Word Mover Distance method is
in particular demonstrates a high degree of correlation
with grades assigned by a group of judges to a collec-
tion of human done translations. In future, we plan to
evaluate this method for other language pairs and in-
tegrate it into a translation quality assurance system.
REFERENCES
Arora, S., Liang, Y., and Ma, T. (2017). A simple but tough-
to-beat baseline for sentence embeddings. In Proc. of
ICLR.
Artetxe, M., Labaka, G., and Agirre, E. (2017). Learning
bilingual word embeddings with (almost) no bilingual
data. In Proc. of ACL, pages 451–462.
Artetxe, M., Labaka, G., and Agirre, E. (2018). A robust
self-learning method for fully unsupervised cross-
lingual mappings of word embeddings. In Proc. of
ACL, pages 789–798.
B
¨
ar, D., Biemann, C., Gurevych, I., and Zesch, T. (2012).
Ukp: Computing semantic textual similarity by com-
bining multiple content similarity measures. In Proc.
of SemEval, pages 435–440.
Brychc
´
ın, T. (2018). Linear transformations for cross-
lingual semantic textual similarity. arXiv:1807.04172.
Brychc
´
ın, T. and Svoboda, L. (2016). Uwb at semeval-2016
task 1: Semantic textual similarity using lexical, syn-
tactic, and semantic information. In Proc. of SemEval,
pages 588–594.
Chandar A P, S., Lauly, S., Larochelle, H., Khapra, M.,
Ravindran, B., Raykar, V. C., and Saha, A. (2014).
An autoencoder approach to learning bilingual word
representations. In Proc. of NIPS, pages 1853–1861.
Chen, X. and Cardie, C. (2018). Unsupervised multilingual
word embeddings. pages 261–270.
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and
J
´
egou, H. (2017). Word translation without parallel
data. arXiv:1710.04087v3.
Corley, C. and Mihalcea, R. (2005). Measuring the semantic
similarity of texts. In Proc. of ACL Workshop, pages
13–18.
Faruqui, M. and Dyer, C. (2014). Improving vector space
word representations using multilingual correlation.
In Proc. of EACL, pages 462–471.
Gouws, S., Bengio, Y., and Corrado, G. (2015). Bil-
bowa: Fast bilingual distributed representations with-
out word alignments. In Proc. of ICML.
Gouws, S. and Søgaard, A. (2015). Simple task-specific
bilingual word embeddings. In Proc. of NAACL HLT,
pages 1386–1390.
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and
Mikolov, T. (2018). Learning word vectors for 157
languages. In Proc. of LREC, pages 3483–3487.
Han, L. (2016). Machine translation evaluation resources
and methods: A survey. arXiv.
Harris, Z. S. (1954). Distributional structure. Word, pages
146–162.
Hermann, K. M. and Blunsom, P. (2014). Multilingual mod-
els for compositional distributed semantics. In Proc.
of ACL, pages 58–68.
Islam, A. and Inkpen, D. (2008). Semantic text similarity
using corpus-based word similarity and string similar-
ity. ACM Transactions on Knowledge Discovery from
Data (TKDD), 2(2):10.
Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015).
From word embeddings to document distances. In
Proc. of ICML, pages 957–966.
Lauly, S., Boulanger, A., and Larochelle, H. (2014). Learn-
ing multilingual word representations using a bag-of-
words autoencoder. arXiv.
Li, Y., McLean, D., Bandar, Z. A., Crockett, K., et al.
(2016). Sentence similarity based on semantic nets
and corpus statistics. IEEE Transactions on Knowl-
edge & Data Engineering, pages 1138–1150.
Luong, T., Pham, H., and Manning, C. D. (2015). Bilin-
gual word representations with monolingual quality in
mind. In Proc. of VSMNLP Workshop, pages 151–159.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space. In Proc. of ICLR.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: a method for automatic evaluation of machine
translation. In Prof. of ACL, pages 311–318.
Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth
mover’s distance as a metric for image retrieval. In-
ternational journal of computer vision, pages 99–12.
Unsupervised Evaluation of Human Translation Quality
63
Ruder, S., Vuli
´
c, I., and Søgaard, A. (2017). A survey of
cross-lingual word embedding models. arXiv.
ˇ
Sari
´
c, F., Glava
ˇ
s, G., Karan, M.,
ˇ
Snajder, J., and Ba
ˇ
si
´
c,
B. D. (2012). Takelab: Systems for measuring seman-
tic text similarity. In Proc. of SemEval, pages 441–
448. Association for Computational Linguistics.
Sultan, M. A., Bethard, S., and Sumner, T. (2015). Dls$@$
cu: Sentence similarity from word alignment and se-
mantic vector composition. In Proc. of SemEval,
pages 148–153.
Tian, J., Zhou, Z., Lan, M., and Wu, Y. (2017). Ecnu
at semeval-2017 task 1: Leverage kernel-based tra-
ditional nlp features and neural networks to build a
universal model for multilingual and cross-lingual se-
mantic textual similarity. In Proc. of SemEval, pages
191–197.
Vuli
´
c, I. and Moens, M.-F. (2015). Bilingual word embed-
dings from non-parallel document-aligned data ap-
plied to bilingual lexicon induction. In Proc. of IJC-
NLP, pages 719–725.
Xu, R., Yang, Y., Otani, N., and Wu, Y. (2018). Un-
supervised cross-lingual transfer of word embedding
spaces. In Proc. of EMNLP, pages 2465–2474.
Zou, W. Y., Socher, R., Cer, D., and Manning, C. D. (2013).
Bilingual word embeddings for phrase-based machine
translation. In Proc. of EMNLP, pages 1393–1398.
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
64