Unsupervised Evaluation of Human Translation Quality

Yi Zhou and Danushka Bollegala

Department of Computer Science, University of Liverpool, U.K.

Keywords:

Translation Quality, Evaluation of Human Translations, Cross-lingual Word Embeddings, Word Mover’s

Distance, Bidirectional Minimum Word Mover’s Distance.

Abstract:

Even though machine translation (MT) systems have reached impressive performances in cross-lingual trans-

lation tasks, the quality of MT is still far behind professional human translations (HTs) due to the complexity

in natural languages, especially for terminologies in different domains. Therefore, HTs are still widely de-

manded in practice. However, the quality of HT is also imperfect and vary signiﬁcantly depending on the

experience and knowledge of the translators. Evaluating the quality of HT in an automatic manner has faced

many challenges. Although bilingual speakers are able to assess the translation quality, manually checking the

accuracy of translations is expensive and time-consuming. In this paper, we propose an unsupervised method

to evaluate the quality of HT without requiring any labelled data. We compare a range of methods for automat-

ically grading HTs and observe the Bidirectional Minimum Word Mover’s distance (BiMWMD) to produce

gradings that correlate well with humans.

1 INTRODUCTION

With the rapid development of international business

and multinational companies, there is an increasing

demand for translations of user manuals, contracts

and various other documents. Even though MT sys-

tems have shown promise for automatic translation,

they require large parallel corpora for training, which

might not be available for resource poor language

pairs such as Hindi, Sinhalese, etc. In addition, the

performance of MT systems are still far behind pro-

fessional human translators due to the complex na-

ture of grammar and word usage in languages. The

quality of translations generated by MT depends on

the distances between the source and target language

pairs (Han, 2016). For example, the quality of an En-

glish to French translation would be better than an En-

glish to Chinese translation, even though both trans-

lations are generated from the same MT system (Xu

et al., 2018). As a result, HTs are still widely used

across numerous industries.

A person’s ﬁrst language, L1, refers to the na-

tive language of that person, whereas L2 is a second

language spoken by that person. HTs created by L2

speakers can be erroneous due to the different levels

of experiences and knowledge of the translators. Of-

ten, the quality of translations provided by L2 speak-

ers must be manually veriﬁed by professional trans-

lators before they can be accepted. A good transla-

tion must demonstrate six properties: intelligibility,

ﬁdelity, ﬂuency, adequacy, comprehension, and infor-

mativeness (Han, 2016). However, manually verify-

ing these properties in an HT is both time consuming

and costly. In this paper, we propose an unsupervised

method for evaluating the quality of HTs, which ad-

dresses this challenging problem.

Translation quality evaluation is a much more

complicated task than it might appear in a ﬁrst

glance. Papineni et al. (2002) proposed the bilingual

evaluation understudy (BLEU) method to automati-

cally evaluate the quality of MT. They take profes-

sional HTs as golden references and consider a better

MT should be the one closer to the golden HTs. In

contrast, HTs quality evaluation must be done manu-

ally because such golden references are not available.

People who are familiar with both the source and the

target languages are required to evaluate the quality of

HTs. The number of such bilingual speakers are lim-

ited and might not exist in the case of rare languages.

However, human evaluation is time-consuming and

not re-usable. MT quality evaluation requires a ref-

erence translation. Because of this reason, MT evalu-

ation measures such as BLEU, cannot be used for the

purpose of evaluating HTs.

In this paper, we model HT quality evaluation as

an unsupervised graph matching problem. Speciﬁ-

Zhou, Y. and Bollegala, D.

Unsupervised Evaluation of Human Translation Quality.

DOI: 10.5220/0008064500550064

In Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2019), pages 55-64

ISBN: 978-989-758-382-7

cally, given a source sentence S and its target transla-

tion T , we compare the similarity between the set of

words {s

, s

, . . . , s

} in S against the set of words

, t

, . . . , t

} in T , using different distance metrics

such as Euclidean distance. In this work, we take the

advantage of cross-lingual word embeddings between

different languages and present a novel approach to

automatically evaluate the quality of HTs without ac-

cessing to golden references. Our work is inspired

by the Word Mover’s distance (Kusner et al., 2015),

which measures the distance between documents by

minimising the cost of transferring embedded words

from the source language to the target language. We

emphasise that our goal in this paper is not to propose

a novel MT method nor an evaluation metric for MT.

Instead, we consider the problem of automatically de-

tecting high/low quality of human translations, with-

out having any access to reference translations.

Speciﬁcally, we report and evaluate different

methods for the purpose of unsupervised HT eval-

uation and compare them against grades given by

judges, who are professional translators, for the qual-

ity of the HTs using Spearman rank and Pearson cor-

relations. As shown in the experiments, the Bidirec-

tional Minimum Word Mover’s distance (BiMWMD)

has the strongest correlation with the human assigned

grades, indicating that this method is able to accu-

rately detect the low quality and high quality HTs

without requiring any human supervision.

2 RELATED WORK

An HT can be compared against the source text using

similarity and distance metrics through cross-lingual

word embeddings. Cosine similarity and Euclidean

distance have been popularly used for this purpose.

Semantic Textual Similarity (STS) systems evaluate

the degree of semantic similarity between two sen-

tences. Most of the early work on sentence simi-

larity compute the sentence similarity as the average

of the words similarity over the two sentences (Cor-

ley and Mihalcea, 2005; Li et al., 2016; Islam and

Inkpen, 2008). At SemEval 2012, the supervised sys-

tems combining different similarity measures such as

lexico-semantic, syntactic and string similarity, us-

ing regression models have been proposed (B

ar et al.,

2012;

Sari

c et al., 2012). Later, Sultan et al. (2015)

propose an unsupervised system based on word align-

ment. Brychc

ın and Svoboda (2016) and Tian et al.

(2017) model semantic similarity for multilingual and

cross-lingual sentence pairs by ﬁrst translating sen-

tences into the target language using MT, then apply-

ing the monolingual STS models. In order to address

the problem that human annotated data is limited for

resource poor languages, Brychc

ın (2018) studied lin-

ear transformations to map monolingual word embed-

dings into a common space using bilingual dictionary

for cross-lingual semantic similarity.

The distributional hypothesis (Harris, 1954) states

that words occurring in the same context tend to

have similar meanings. According to the hypothe-

sis, Mikolov et al. (2013) propose distributed Skip-

gram and Continuous Bag-of-Words (CBOW) models

to learn robust word embeddings from large amount

of unstructured texts data. Recent research creating

a shared vector space for words across two (bilin-

gual word embeddings) (Artetxe et al., 2017; Chan-

dar A P et al., 2014; Zou et al., 2013) or more (mul-

tilingual word embeddings) (Hermann and Blunsom,

2014; Lauly et al., 2014) languages is referred to

cross-lingual word embeddings learning. The dis-

tance between words from different languages with

similar meanings should be close to each other in the

shared embedding space (Chen and Cardie, 2018).

The cross-lingual word representations are ob-

tained by training embeddings in different languages

independently using monolingual corpora, then map

them to a common space through a transforma-

tion (Artetxe et al., 2018). Ruder et al. (2017) in-

troduced three different types of alignments in learn-

ing cross-lingual word embeddings: word alignment,

sentence alignment and document alignment. Word

alignment uses bilingual dictionaries with word-

by-word translations to learn cross-lingual embed-

dings (Vuli

c and Moens, 2015). Sentence align-

ment requires a parallel corpus (Hermann and Blun-

som, 2014; Gouws et al., 2015), which is a collec-

tion of texts in one language and the corresponding

translations into one or more languages. Document

alignment requires document in a comparable cor-

pus across different languages. A comparable cor-

pus contains documents that are not exact parallel

translations but convey the same information in dif-

ferent languages (Faruqui and Dyer, 2014; Gouws

and Søgaard, 2015).

Several approaches for learning cross-lingual

word embeddings have been proposed, which require

different types of alignment as supervision. Luong

et al. (2015) present the bilingual Skip-Gram model

(BiSkip) to learn cross-lingual word embeddings us-

ing a parallel corpus (sentence alignment), which can

be seen as an extension of the monolingual skip-

gram model. The Bilingual Compositional Model

(BiCVM) proposed by Hermann and Blunsom (2014)

learns cross-lingual word embeddings through sen-

tence alignment. The model leverages the fact that

the representations of aligned sentences should be

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

similar. Therefore, semantics can be learned from

parallel data. Vuli

c and Moens (2015) proposed a

model to learn cross-lingual word embeddings from

non-parallel data. They extend the skip-gram model

with negative sampling (SGNS) model and generate

cross-lingual word embeddings via a comparable cor-

pus (document alignment).

The method proposed by Kusner et al. (2015) for

measuring the distance between two documents is

known as the Word Mover’s Distance. It considers

the distance between documents to be the minimal

cost of transforming words from one document to

another. However, they take the alignment of each

source word to all of the target words to compute the

cost of a translation, which is expensive. In this pa-

per, we focus on the sentence alignment and propose

the Bidirectional Minimum Word Mover’s distance

(BiMWMD) method, where we consider the distance

between documents to be the cumulative cost of the

minimal cost of transferring each source word to the

corresponding target word. In addition, our proposed

method takes into account the translation ﬂow from

both direction (i.e. from the source to the target and

from the target to the source).

3 TRANSLATION QUALITY

EVALUATION

Our goal is to propose a method, which is able to ac-

curately evaluate the quality of cross-lingual transla-

tions, without human supervision. Most translation

quality evaluation approaches are based on gold ref-

erences, which are manually created perfect transla-

tions of a source language text to the target language.

Our work considers the scenario that there are no such

golden references available.

Let us denote the source language by S and the

target language by T . For example, when translat-

ing from Japanese to English, S will be Japanese and

T will be English. Consider a set of words V

, V

respectively in the source and target languages. A

cross-lingual word embedding w ∈ R

of a word

w ∈ V

∪ V

is an embedding that is shared between

both S and T . As already described in Section 2, sev-

eral methods have been proposed for training accu-

rate cross-lingual word embeddings that we can use

for this purpose. Here, we assume the availability of

such a set of cross-lingual word embeddings for the

source and target languages.

Let us consider a source language text S =

, s

, . . . , s

, which is translated to the target lan-

guage T = t

, t

, . . . , t

where s

∈ R

represents

the embedding of i-th word in the source sentence,

∈ R

represents the embedding of j-th word in

the target sentence. Here, n and m are the numbers

of words in the source and the target texts respec-

tively. Source and target texts could be single or mul-

tiple sentences. The methods that we discuss in this

paper for evaluating HT quality do not require any

sentence-level processing and can be applied to either

single sentences or paragraphs that contain multiple

sentences.

3.1 Averaged Vector (AV)

Prior work on unsupervised sentence embeddings

have found that averaging the word embeddings for

the words in a sentence to be a simple but an accu-

rate method for creating sentence embeddings (Arora

et al., 2017). Motivated by these prior proposals, we

represent both source and target language texts by

averaging the cross-lingual word embeddings for the

words that appear in each of the texts. We call this the

Averaged Vector (AV) method. Speciﬁcally, given a

source language text S = s

, s

, . . . , s

and its HT

T = t

, t

, . . . , t

, we represent the two texts re-

spectively by embeddings

t ∈ R

as given by (1)

and (2).

s =

i=1

(1)

t =

j=1

(2)

After obtaining the vectors for the source sentence

and the target sentence, similarity between them can

be computed using the cosine similarity between the

two vectors

s and

t as given by (3).

sim(S, T ) = cos(

s|| ||

t||

. (3)

Here, we consider the similarity between S and T as a

proxy of the semantic agreement between the source

text and its translation, thereby providing a measure

of quality. In addition to the simple averaging of

word embeddings given in (1) and (2), in our pre-

liminary experiments we implemented tﬁdf (term fre-

quency inverse document frequency) weighting and

SIF (smooth inverse frequency) (Arora et al., 2017)

methods for creating sentence embeddings. But for

our task of comparing sentences written in different

languages, we did not observe any signiﬁcant im-

provements in using those weighting methods. There-

fore, we decided to use the simple (unweighted) aver-

aging as given in (1) and (2).

Unsupervised Evaluation of Human Translation Quality

3.2 Source-centred Maximum

Similarity (SMS)

The AV method described in Section 3.1 can be seen

as comparing each word s

in the source text against

each word t

in the target text. Moreover, it is sym-

metric in the sense that if we had swapped the source

and the target texts, it will return the same similar-

ity score. However, not all words in the source text

might be related to all the words in the target text.

On the contrary, one word in the source text is of-

ten related to only a few words in the target transla-

tion. Therefore, we must compare each source word

against the most related word in the target transla-

tion. For this purpose, we modify the AV method and

propose source-centred maximum similarity (SMS)

method as described next.

First, we compute the cosine similarity of each

embedded word s

in the source text against all

the embedded words t

, t

, . . . , t

in the target text.

Next, the maximum similarity score between s

and

any of t

, t

, . . . , t

is taken as the score for trans-

forming s

from the source to target. Finally, we re-

port the averaged similarity score over all the maximal

scores as the similarity between S and T as given by

(4).

sim(S, T ) =

i=1

max

j=1,...,m

cos(s

, t

) (4)

3.3 Target-centred Maximum Similarity

(TMS)

TMS is the opposite to the SMS method and is target-

centred. This method calculates the cosine similarity

of each embedded target word t

against all the em-

bedded source words s

, s

, . . . , s

. Then, the maxi-

mal similarity score is computed as the score of trans-

lating each t

back to a word s

in the source text.

Finally, we take the average score over all the maxi-

mal similarity scores of the target words as given by

(5).

sim(S, T ) =

i=1

max

i=1,...,n

cos(s

, t

) (5)

3.4 Word Mover’s Distance (WMD)

WMD is a measure of the distance between docu-

ments proposed by Kusner et al. (2015), which is in-

spired by the Earth Mover’s Distance (EMD) (Rubner

et al., 2000). WMD can be used to measure the dis-

similarity between two text documents. Speciﬁcally,

it measures the minimum amount of the cost that has

to paid for transforming words from a source text S to

reach the words in a target text T . By using this met-

ric, we are able to estimate the similarity between a

source document and a target document even though

they contain no common words.

Let us assume that two text documents are rep-

resented as normalised bag-of-words vectors and the

i-th target word t

appears h(t

) times in the target

text T . The normalised frequency f(t

) of t

is given

by (6).

f(t

) =

h(t

)

j=1

h(t

)

(6)

Likewise, the normalised frequency, f(s

) of a word

in the source text S is given by (7).

f(s

) =

h(s

)

i=1

h(s

)

(7)

Then, the transportation problem can be formally

deﬁned as the minimum cumulative cost of moving

words from a S to T under the constraints speciﬁed

in the following linear programme (LP).

minimise

i=1

j=1

c(i, j) (8)

subject to:

j=1

= f(s

), ∀i ∈ {1, . . . , n} (9)

i=1

= f(t

), ∀j ∈ {1, . . . , m} (10)

T ≥ 0 (11)

Here, T ∈ R

n×m

is a nonnegative ﬂow matrix that

is learnt by the LP, and c(i, j) is the cost of translating

(transforming) the word s

to t

. We measure this

translation cost as the Euclidean distance between the

embeddings of s

and t

as given by (12).

c(i, j) = ||s

− t

(12)

Intuitively, if c(i, j) is high for translating s

to t

then the (i, j) element T

of T can be set to a small

(possibly zero) value to reduce the objective given

by (13). The equality constraints given in (9) and

(10) specify respectively column and row stochastic-

ity constraints for T. In other words, these equal-

ity constrains ensure that the total weights transferred

from each source word to the target text, and vice

versa are preserved, making T a double stochastic

matrix. Note that each source text word s

is allowed

to match against one or more target text words t

un-

der these constraints.

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

Figure 1: Translating a word in Japanese source (S) text

into English target (T ) text. The perfect alignment between

S and T is s

→ I, s

→ null, s

→ school, s

→ null,

→ like, and s

→ null. The thin arrow represents the

minimum cost translating to I. The correct translations

are likely to have smaller distances (costs) associated with.

3.5 Bidirectional Minimum WMD

(BiMWMD)

WMD method described in Section 3.4 is symmet-

ric in the sense that even if we swap the source and

target texts we will get the same score for the trans-

lation quality. On the other hand, SMS and TMS

methods described respectively in Sections 3.2 and

3.3 are both asymmetric translation quality evaluation

methods. Following SMS and TMS methods, we ex-

tend WMD method such that it considers the trans-

lation quality from the point-of-view of the source

text, which we refer to as the Source-centric Mini-

mum WMD (SMWMD) and from the point-of-view of

the target text, which we refer to as the Target-centred

Minimum WMD (TMWMD). Finally, we combine the

two extensions and propose the Bidirectional Mini-

mum WMD (BiMWMD) to evaluate the translation

quality from both point of views. Next, we describe

SMWND, TMWMD and BiMWMD methods.

SMWMD: Source-centred Minimum WMD

(SMWMD) method considers the translation ﬂow to

be from the source sentence to the target sentence.

Figure 1 shows an example of measuring distance

from a source text S to a target text T . In SMWMD,

we measure the minimum cost of translating each

source word s

to any word in T , and consider the

sum of costs as the objective function for the LP.

Similar to WMD, we denote T

≥ 0 to be the ﬂow

matrix translating s

to t

according to the cost c(i, j)

given by (12). As we found that the normalised

frequencies f(t

) and f (s

) have little effect on

the results through the experiments, we assign both

frequencies to be 1 to simplify the objective function.

Then, the optimisation problem can be written as

follows:

minimise

i=1

min

j=1,...,m

c(i, j) (13)

subject to:

j=1

= 1, ∀i ∈ {1, . . . , n} (14)

i=1

= 1, ∀j ∈ {1, . . . , m} (15)

T ≥ 0 (16)

To simplify the objective function in (13), we use

to replace T

c(i, j), where T

c(i, j) is the actual

cost of transforming words from one document to an-

other and y

is an upper bound on T

c(i, j). Let us

denote the actual objective by T C given by (17) and

its upper bound by Y given by (18).

T C(S, T ) =

i=1

j=1

c(i, j) (17)

Y (S, T ) =

i=1

(18)

Using y

, we can rewrite the previous optimisation

problem as an LP as follows:

minimise

i=1

(19)

subject to: T

(

i, j) ≤ y

(20)

j=1

= 1, ∀i ∈ {1, . . . , n} (21)

i=1

= 1, ∀j ∈ {1 . . . , m} (22)

T ≥ 0 (23)

We collectively denote the minimum translation cost

for translating a source text S into a target text T given

by solving the LP above as, SMWMD(S, T ), which

can be either T C(S, T ) or Y (S, T ). During our ex-

periments, we will study the difference between the

actual objective (TC) and its upper bound (Y) for the

purpose of predicting the quality of HT.

TMWMD: An accurate translation of a given

source text must not only correctly translate the in-

formation contained in the source text but it must also

not add new information that was not present in the

original source text into the translation. One simple

way to verify this is to back translate the target text to

the source and measure their semantic distance. For

this purpose, we modify the WMD objective, in the

same manner as we did for SMWMD but pivoting on

the target text instead of the source text. We refer to

this approach as the Target-centred Minimum WMD

(TMWMD).

Unsupervised Evaluation of Human Translation Quality

Figure 2: Translating a word in Japanese source (S) text

into English target (T ) text. The perfect alignment between

S and T is s

→ I, s

→ null, s

→ school, s

→ null,

→ like, and s

→ null. The light arrow represents the

minimum cost alignment.

Speciﬁcally, we deﬁne the distance between S

and T as the minimal cumulative distance required

to move all words from the target text T to the source

text S. An example is give in Figure 2, where the tar-

get word I is being compared against all the words

in the source (indicated by arrows) and the closet

Japanese translation s

is mapped by a thin arrow.

TMWMD is the solution to the following LP:

minimise

j=1

(24)

subject to: T

(

i, j) ≤ y

(25)

j=1

= 1, ∀i ∈ {1, . . . , n} (26)

i=1

= 1, ∀j ∈ {1 . . . , m} (27)

T ≥ 0 (28)

Note that TMWMD is the mirror image of SMWMD

in the sense that by swapping S and T we can obtain

the LP for SMWMD.

Likewise SMWMD, TMWMD can also be com-

puted using the actual objective (T C(S, T )) or the

upper bound (Y (S, T )). We collectively denote these

two variants as TMWMD(S, T ).

BiMWMD: SMWMD and TMWMD are evaluat-

ing the translation quality in one direction only. If the

translation cost from source to target as well as from

target to source are both small, then it is an indica-

tion of a higher quality translation. To quantitatively

capture this idea, we propose Bidirectional Minimum

Word Mover Distance (BiMWMD) as a translation

quality evaluation measure. BiMWMD is deﬁned by

(29) and is the sum of optimal translation costs re-

turned individually by SMWMD and TMWMD.

BiMWMD(S, T ) = SMWMD(S, T ) + TMWMD(S, T )

(29)

From the deﬁnition (29), it follows that BiMWMD

is a symmetric translation quality measure, similar

to WMD. However, BiMWMD and WMD are solv-

ing different LPs, hence returning different transla-

tion quality predictions. Speciﬁcally, in WMD the

minimal cumulative cost for translating each word in

the source text S to all the words in the target text

T is returned as the objective. On the other hand,

BiMWMD solves two independent LPs, each consid-

ering only a single direction (SMWMD considers the

case of translating from S to T , whereas TMWMD

considers the case of translating from T to S). As we

later see in Section 4.4, BiMWMD shows a higher de-

gree of correlation with human ratings for translation

quality than WMD.

4 EXPERIMENTS

In this section, we evaluate the different translation

quality measures described in Section 3. For this

purpose, we annotated a translation dataset as de-

scribed in Section 4.1 and use correlation against hu-

man grades as the evaluation criteria. Experimental

results are discussed in Section 4.4.

4.1 Dataset

For evaluating the different translation quality mea-

sures described in Section 3, we created a dataset by

selecting 1030 sentences from Japanese user manu-

als on Digital cameras. We then asked a group of 50

human translators to translate the selected Japanese

sentences to English. The human translators are all

native Japanese speakers who have studied English as

a foreign language. The human translators for this ex-

periment were recruited using a crowd-sourcing plat-

form that is operational in Japan. The human trans-

lators have different levels of experience in translat-

ing technical documents ranging broadly from very

experienced translators to beginners. We believe this

would give us a broad spectrum of human translations

for evaluation purposes. Each Japanese sentence was

assigned to one of the human translators in the pool

of human translators and was asked to write a single

English translation.

Next, we randomly selected 130 pairs of Japanese-

English translated sentence pairs and asked four hu-

mans, who are bilingual speakers of Japanese and En-

glish and are professionally qualiﬁed translators with

over 10 years of experience in translating technical

documents, to rate each of the quality of the selected

translation pairs. We call these four humans as judges

to distinguish them from the pool of human trans-

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

lators who wrote the translations. Speciﬁcally, we

asked each of the four judges to grade a translated

sentence pair by assigning one of the following four

grades.

Grade 1 Quality Translations: A perfect transla-

tion. There are no further modiﬁcations required.

The translation pair is scored in a range of 0.76 −

1.00.

Grade 2 Quality Translations: A good translation.

Some words are incorrectly translated but the

overall meaning can be understood. The transla-

tion pair is scored in a range of 0.51 − 0.75.

Grade 3 Quality Translations: A bad translation.

There are more incorrectly translated words than

correctly translated words in the translation. The

translation pair is scored in a range of 0.26−0.50.

Grade 4 Quality Translations: Requires re-

translation. The translation cannot be compre-

hend or conveys a signiﬁcantly different meaning

to the source sentence. The translation pair is

scored in a range of 0.00 − 0.25.

The average of the grades assigned by the four judges

to a translated sentence pair is considered as its ﬁnal

grade.

4.2 Cross-lingual Word Embeddings

All of the translation quality measures we proposed

in Section 3 require cross-lingual word embeddings.

To create cross-lingual word embeddings between

Japanese and English languages in an unsupervised

manner, we align publicly available monolingual

word embeddings. Speciﬁcally, we ﬁrst use the

monolingual word embeddings, which are trained on

Wikipedia and Common Crawl using fastText (Grave

et al., 2018). Because our dataset contains Japanese

and English words, we train two separate monolin-

gual word embedding sets for Japanese and English.

Next, we use the unsupervised adversarial training

methods proposed by Conneau et al. (2017) and im-

plemented in MUSE

to align the Japanese and En-

glish word embedding spaces, without requiring any

bilingual dictionaries or parallel/comparable corpora.

Although it is possible to further improve the perfor-

mance of this cross-lingual alignment using bilingual

lexical resources, by not depending on any such re-

sources we are able to realistically estimate the perfor-

mance of the different methods we proposed in Sec-

tion 3 when such resources are not available.

https://github.com/facebookresearch/MUSE

4.3 Evaluation Measures

Recall that our goal in this work is to predict the qual-

ity of human translations without having access to any

reference translations. Therefore, we would like to

know whether the translation quality scores returned

by the different methods we proposed in Section 3 are

correlating with the grades given by the human judges

to the human translations in the dataset we created in

Section 4.1. To evaluate the level of agreement be-

tween the grades and the translation quality scores,

we compute the Pearson and Spearman rank corre-

lation coefﬁcients between these two sets of numbers.

Pearson correlation coefﬁcient measures the linear re-

lationship between two variables, whereas Spearman

correlation coefﬁcient considers only the relative or-

dering.

4.4 Results

In Table 1, we compare the different HT quality

evaluation measures described in Section 3. Re-

call that some methods return similarity scores (AV,

SMS, TMS), whereas others return distances (WMD,

SMWMD, TMWMD, BiMWMD). To compare both

similarities and distances equally, we convert the dis-

tances to similarities for each method by

1 −

distance

maximum distance

We use the interior-point method to solve the LPs in

all cases. A higher degree of correlation with the

grades assigned by the judges for the translations in-

dicate a reliable quality prediction measure. From

Table 1, we see that averaging the word embeddings

for creating text/sentence embeddings and then mea-

suring their cosine similarity (AV) provides a low-

level of correlation. Comparing SMS and TMS meth-

ods, we see that centering on the target provides a

higher degree of correlation than by centering on the

source. A similar trend can be observed when com-

paring SMWMD and TMWMD. In fact, SMWMD

returns negative correlation values for both Spear-

man and Pearson correlation coefﬁcients. We see that

BiMWMD returns the best correlation scores against

judges’ grades among all methods compared in Ta-

ble 1. This result shows that it is important to consider

both directions of the translations to more accurately

estimate the quality of a human translation.

To study the effect of various parameters and set-

tings associated with the BiMWMD method, we eval-

uate it under different conﬁgurations. Speciﬁcally, to

analyse the effect of normalising word embeddings,

we consider three settings: `

normalisation, `

nor-

malisation and no normalisation (No). To decide be-

Unsupervised Evaluation of Human Translation Quality

   

カメラをテレビに接続するための映像と音声用のケー

ブル。映像と音声信号を送信する。

         

      

     

 

写真撮影用レンズの絞りは、微調整できるようにする

ために複数枚の板（絞り羽根）を重ね合わせて作られ

ている。 枚羽根絞りの場合、 枚の羽根で  角形が

できており、このような絞りを虹彩絞りという。

       

      

       

    

 

各部名称。いわゆる「撮影設定変更ボタン」。ボタンを

押すと、液晶モニターに撮影に関する情報が表示され

る。

   “ ”

       

        

         

  

 

カメラを縦に構えて撮影すること         

  “”   

   “” “”

 

                

       

 

      

    

       

     

      

       

      

     

     

      

      

      



        

        

     

      

    

          

     

  

       

     

        

      

      

 

        

     

     

 

        

     

      

       

    

       

        

    

     



       

     



      

     



       

     

     

      

    



       

  

       

      

    

       

       

      

     

   

Figure 3: Scores assigned by BiMWMD method to translation pairs graded by the human judges. We have scaled both

BiMWMD scores and judges’ grades to [0,1] range for the ease of comparison.

Table 1: Performance of the different HT quality evaluation

methods. The best correlations are in bold.

Method Spearman r Pearson ρ

AV 0.2628 0.3076

SMS 0.1505 0.3224

TMS 0.4576 0.4851

WMD 0.3953 0.5003

SMWMD -0.3928 -0.2328

TMWMD 0.4164 0.4199

BiMWMD 0.5895

0.5895

0.5895 0.5296

0.5296

tween actual objective T C(S, T ) (given by (17)) vs.

its upper bound Y (S, T ) (given by (18)), we con-

sider each of the two values separately as the value

returned by BiMWMD and measure the correlation

against judges’ grades. The row and column stochas-

ticity constraints adds a large number of equality con-

straints to the LPs. Adding both row and column

stochasticity constraints at the same time often makes

the LP infeasible. To relax the constraints and to em-

pirically study the signiﬁcance of the row and col-

umn stochasticity constraints, we run BiMWMD with

row stochasticity constraints only (denoted by Row)

vs. column stochasticity constraints only (denoted by

Column). All possible combinations of these conﬁg-

urations are evaluated in Table 2.

From Table 2, we see that the best performance

is obtained with `

normalised cross-lingual word

embeddings. We see that column stochasticity con-

straints are more important than the row stochasticity

constraints. Moreover, using the value of the upper

Table 2: Different settings for the BiMWMD method. Nor-

malisation of word embeddings: `

, `

and unnormalised

(No), Row and Column denote using only row or column

stochasticity constraints in the LP. Moreover, we can con-

sider the actual objective (TC) or its upper bound (Y) as the

value of BiMWMD.

Method Spearman r Pearson ρ

+Y+Row 0.0033 0.1764

+Y+Column 0.5895

0.5895

0.5895 0.5296

0.5296

+TC+Row -0.2510 -0.0767

+TC+Column -0.2510 -0.0711

+Y+Row 0.3136 0.1834

+Y+Column 0.5721 0.5125

+TC+Row -0.2510 -0.0730

+TC+Column -0.2510 -0.0687

No+Y+Row -0.0728 0.0284

No+Y+Column 0.5146 0.4878

No+TC+Row -0.2073 -0.0395

No+TC+Column -0.2053 -0.0414

bound (Y (S, T )) as BiMWMD is more accurate than

the actual objective (T C(S, T )). Recall that the ﬂow

matrix T has nm number of parameters. The number

of parameters increases with lengths of source and tar-

get texts. Therefore, it is possible to set most of those

nm elements to zero to minimise the actual objective,

thereby satisfying the inequality T

c(i, j) ≤ y

in LP.

Therefore, the sum of upper bounds

, which is

the objective minimised by the reformed LP, is a bet-

ter proxy as BiMWMD.

A good measure for predicting the quality of HTs

must be able to distinguish low quality HTs from high

quality HTs. If we can automatically decide whether a

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

particular HT is of low quality without another human

having to read it, then it is possible to prioritise such

low quality HTs for retranslation or to be veriﬁed by

a human in charge of quality control. This is particu-

larly useful when we have a large number of transla-

tions to verify and would like to check the ones which

are most likely to be incorrect. To understand the

scores assigned by BiMWMD to translations of dif-

ferent grades, in Figure 3, we randomly select trans-

lation pairs with different grades and show the scores

predicted by BiMWMD, which was the best method

among the methods proposed in Section 3. We see

that high scores are assigned by BiMWMD method

for translations that are also rated as high quality by

the human judges, whereas low scores are assigned to

translations that are considered to be of low quality by

the judges.

5 CONCLUSION

We proposed different methods for automatically pre-

dicting the quality of human translations, without

having access to any gold standard reference trans-

lations. In particular, we proposed a broad range of

measures covering both symmetric and asymmetric

measures. Our experimental results show that Bidi-

rectional Minimum Word Mover Distance method is

in particular demonstrates a high degree of correlation

with grades assigned by a group of judges to a collec-

tion of human done translations. In future, we plan to

evaluate this method for other language pairs and in-

tegrate it into a translation quality assurance system.

REFERENCES

Arora, S., Liang, Y., and Ma, T. (2017). A simple but tough-

to-beat baseline for sentence embeddings. In Proc. of

ICLR.

Artetxe, M., Labaka, G., and Agirre, E. (2017). Learning

bilingual word embeddings with (almost) no bilingual

data. In Proc. of ACL, pages 451–462.

Artetxe, M., Labaka, G., and Agirre, E. (2018). A robust

self-learning method for fully unsupervised cross-

lingual mappings of word embeddings. In Proc. of

ACL, pages 789–798.

ar, D., Biemann, C., Gurevych, I., and Zesch, T. (2012).

Ukp: Computing semantic textual similarity by com-

bining multiple content similarity measures. In Proc.

of SemEval, pages 435–440.

Brychc

ın, T. (2018). Linear transformations for cross-

lingual semantic textual similarity. arXiv:1807.04172.

Brychc

ın, T. and Svoboda, L. (2016). Uwb at semeval-2016

task 1: Semantic textual similarity using lexical, syn-

tactic, and semantic information. In Proc. of SemEval,

pages 588–594.

Chandar A P, S., Lauly, S., Larochelle, H., Khapra, M.,

Ravindran, B., Raykar, V. C., and Saha, A. (2014).

An autoencoder approach to learning bilingual word

representations. In Proc. of NIPS, pages 1853–1861.

Chen, X. and Cardie, C. (2018). Unsupervised multilingual

word embeddings. pages 261–270.

Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and

egou, H. (2017). Word translation without parallel

data. arXiv:1710.04087v3.

Corley, C. and Mihalcea, R. (2005). Measuring the semantic

similarity of texts. In Proc. of ACL Workshop, pages

13–18.

Faruqui, M. and Dyer, C. (2014). Improving vector space

word representations using multilingual correlation.

In Proc. of EACL, pages 462–471.

Gouws, S., Bengio, Y., and Corrado, G. (2015). Bil-

bowa: Fast bilingual distributed representations with-

out word alignments. In Proc. of ICML.

Gouws, S. and Søgaard, A. (2015). Simple task-speciﬁc

bilingual word embeddings. In Proc. of NAACL HLT,

pages 1386–1390.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and

Mikolov, T. (2018). Learning word vectors for 157

languages. In Proc. of LREC, pages 3483–3487.

Han, L. (2016). Machine translation evaluation resources

and methods: A survey. arXiv.

Harris, Z. S. (1954). Distributional structure. Word, pages

146–162.

Hermann, K. M. and Blunsom, P. (2014). Multilingual mod-

els for compositional distributed semantics. In Proc.

of ACL, pages 58–68.

Islam, A. and Inkpen, D. (2008). Semantic text similarity

using corpus-based word similarity and string similar-

ity. ACM Transactions on Knowledge Discovery from

Data (TKDD), 2(2):10.

Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015).

From word embeddings to document distances. In

Proc. of ICML, pages 957–966.

Lauly, S., Boulanger, A., and Larochelle, H. (2014). Learn-

ing multilingual word representations using a bag-of-

words autoencoder. arXiv.

Li, Y., McLean, D., Bandar, Z. A., Crockett, K., et al.

(2016). Sentence similarity based on semantic nets

and corpus statistics. IEEE Transactions on Knowl-

edge & Data Engineering, pages 1138–1150.

Luong, T., Pham, H., and Manning, C. D. (2015). Bilin-

gual word representations with monolingual quality in

mind. In Proc. of VSMNLP Workshop, pages 151–159.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. In Proc. of ICLR.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Prof. of ACL, pages 311–318.

Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth

mover’s distance as a metric for image retrieval. In-

ternational journal of computer vision, pages 99–12.

Unsupervised Evaluation of Human Translation Quality

Ruder, S., Vuli

c, I., and Søgaard, A. (2017). A survey of

cross-lingual word embedding models. arXiv.

Sari

c, F., Glava

s, G., Karan, M.,

Snajder, J., and Ba

B. D. (2012). Takelab: Systems for measuring seman-

tic text similarity. In Proc. of SemEval, pages 441–

448. Association for Computational Linguistics.

Sultan, M. A., Bethard, S., and Sumner, T. (2015). Dls$@$

cu: Sentence similarity from word alignment and se-

mantic vector composition. In Proc. of SemEval,

pages 148–153.

Tian, J., Zhou, Z., Lan, M., and Wu, Y. (2017). Ecnu

at semeval-2017 task 1: Leverage kernel-based tra-

ditional nlp features and neural networks to build a

universal model for multilingual and cross-lingual se-

mantic textual similarity. In Proc. of SemEval, pages

191–197.

Vuli

c, I. and Moens, M.-F. (2015). Bilingual word embed-

dings from non-parallel document-aligned data ap-

plied to bilingual lexicon induction. In Proc. of IJC-

NLP, pages 719–725.

Xu, R., Yang, Y., Otani, N., and Wu, Y. (2018). Un-

supervised cross-lingual transfer of word embedding

spaces. In Proc. of EMNLP, pages 2465–2474.

Zou, W. Y., Socher, R., Cer, D., and Manning, C. D. (2013).

Bilingual word embeddings for phrase-based machine

translation. In Proc. of EMNLP, pages 1393–1398.

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval