Enhancing Bilingual Lexicon Induction with Dynamic Translation

Michaela Denisov

1 a

and Pavel Rychl

1,2 b

Natural Language Processing Centre, Masaryk University, Brno, Czech Republic

Lexical Computing, Brno, Czech Republic

ﬁ

Keywords:

Cross-Lingual Embedding Models, Parameter K, Classiﬁcation Neural Network.

Abstract:

Bilingual lexicon induction (BLI) has been a popular task for evaluating cross-lingual word embeddings

(CWEs). The prevalent metric employed in the evaluation is precision at k, where k represents the num-

ber of target words retrieved for each source word. However, establishing a ﬁxed k for the entire evaluation

dataset proves challenging due to varying target word counts for each source word. This leads to limited re-

sults, compromising either precision or recall. In this paper, we present a novel classiﬁcation-based approach

with dynamic k for bilingual lexicon induction that aims to identify all relevant target words for each source

word by exploiting the information derived from the aligned embeddings while offering a balanced trade-off

between precision and recall. On top of that, it enables the evaluation of the existing CWEs using dynamic k.

Compared to the standard baseline systems and evaluation procedures, it provides competitive results.

1 INTRODUCTION

Retrieving translations of individual words is an in-

trinsic evaluation task referred to as bilingual lexicon

induction (BLI). This task has been a commonly used

method for evaluating cross-lingual word embeddings

(CWEs), which aim to align two (or more) sets of

individually trained monolingual word embeddings

(MWEs) into a shared cross-lingual space where sim-

ilar words are represented by similar vectors (Ruder

et al., 2019).

Owing to this property, they have shown to be ben-

eﬁcial in many NLP applications, e.g., machine trans-

lation (Artetxe et al., 2018c; Duan et al., 2020; Zhou

et al., 2021; Wang et al., 2022), cross-lingual infor-

mation retrieval (Vuli

c and Moens, 2015), language

acquisition and learning (Yuan et al., 2020).

In the BLI task, the method aims to generate a list

of target words for each source word, ranking them

based on the cosine similarities between their respec-

tive embeddings. Afterwards, top-k target words for

each source word are selected, and the word pairs

are compared to the evaluation dataset (Ruder et al.,

2019).

However, the k is not determined by the method

and often is set by the evaluation metrics or derived

from the evaluation data. This limitation makes the

https://orcid.org/0009-0001-8402-504X

https://orcid.org/0000-0001-5097-4610

approach less reﬂective of real-world translation sce-

narios, where the number of target words correspond-

ing to a source word cannot be estimated and is vital

to be determined by the model to successfully fulﬁl

the BLI task’s objective.

In many papers, the preferred metric is precision

at k (P@k), where k is ﬁxed, usually k = {1, 5, 10}

(e.g., (Mikolov et al., 2013; Conneau et al., 2017; Li

et al., 2022; Tian et al., 2022)). What the papers actu-

ally report is HitRatio@k, where HitRatio@1 = P@1

and P@k

> P@k

as long as k

(Conneau et al.,

2017). This is problematic for two reasons.

Firstly, the majority of the source words are likely

to have more than one target word, and the number of

target words differs for each source word. For exam-

ple, the most popular evaluation datasets MUSE (Con-

neau et al., 2017) consist of word-to-many lists:

the English-French evaluation dataset contains 2943

word pairs from which are only 1.5K unique En-

glish words. As Table 1 shows, the English source

words exhibit various numbers of target words (e.g.,

compact - compact, compacte, compactes, compacts,

compresser, pacte; admit - admet, admets, admettre;

subway - m

etro).

Secondly, since HitRatio@k assumes that every

source word has only one target word, in cases where

k > 1, the metric may yield results exceeding 100%,

which leads to distortions in the results.

While prior work suggested replacing P@k with

Denisová, M. and Rychlý, P.

Enhancing Bilingual Lexicon Induction with Dynamic Translation.

DOI: 10.5220/0013346000003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 735-744

ISBN: 978-989-758-737-5; ISSN: 2184-433X

735

Table 1: The number of target words (TGW) in four MUSE

evaluation datasets and the dynamic k values (NN k) that

were predicted by VM-S+NN model trained on English-

Spanish.

TGW en-fr en-cs en-ko en-es NN k

1 698 820 1085 663 801

2 409 398 310 435 465

3 210 207 63 211 175

4 123 61 7 146 52

5 55 13 0 45 6

6+ 5 1 0 0 1

Mean Average Precision (MAP) to address this is-

sue, they evaluated their models with one-to-one

datasets (Glava

s et al., 2019). We argue that in a

real-language scenario, the source word is improba-

ble to have only one target word, and even the less

frequent words bear multiple meanings. For exam-

ple, speciﬁc-domain-related words also occur in reg-

ular texts (string - sequence of characters vs a piece

of rope, series of events, etc.).

Another attempt to advocate the MAP metric ap-

peared at the BUCC 2022 conference (Adjali et al.,

2022; Laville et al., 2022). While MAP is a valuable

metric for assessing a model based on the ranking of

target words and the quality of the embeddings’ align-

ment, it fails to consider the parameter k, which is set

in advance according to the evaluation dataset.

To relax from the constraint of having a ﬁxed

number of target words, dynamic translation was in-

troduced in the shared task of the BUCC 2020 con-

ference together with alternative evaluation metrics,

such as recall and F1 scores (Rapp et al., 2020). While

the participants advocated computing a threshold for

similarity scores between the source and target word

embeddings instead, they had to tailor it for each lan-

guage pair individually.

Another existing work proposes classiﬁcation-

based approaches (Heyman et al., 2017; Severini

et al., 2020a). In line with the previous research,

framing the BLI task as a classiﬁcation problem not

only allows for dynamic k but also leads to additional

improvements in the models’ performance (Irvine and

Callison-Burch, 2017; Karan et al., 2020). However,

these methods suffer from computational inefﬁcien-

cies, applying deep neural network to each word pair

that is being classiﬁed.

Motivated by these insights, we implement a

novel, simple classiﬁcation-based approach, allowing

for a dynamic k while exploiting various features de-

rived from the aligned embeddings. The aim is to

identify as many relevant target words as possible for

each source word and report P, recall, and F1 scores

not constrained by a predeﬁned set of k nearest neigh-

bours while balancing P and recall and maintaining

high performance.

Different to previous endeavours introducing

classiﬁcation-based approaches (Heyman et al., 2017;

Severini et al., 2020a) and similar approaches es-

tablishing dynamic k and alternative evaluation met-

rics (Rapp et al., 2020), our method is more straight-

forward to implement and more computationally ef-

ﬁcient, as we demonstrate in this paper. It builds up

a solution for existing CWEs to relax the constraint

of having a ﬁxed k, making them comparable with

methods using dynamic k and improving their perfor-

mance.

We evaluate our approach on the widely used eval-

uation datasets MUSE for various languages: English

(en) to German (de), French (fr), Spanish (es), Rus-

sian (ru), Czech (cs), Dutch (nl), Finnish (ﬁ), and Ko-

rean (ko), and on manually annotated data for Esto-

nian (et) to Slovak (sk).

Our contribution is manifold.

1. We present a classiﬁcation-based framework for

bilingual lexicon induction that dynamically de-

termines k, addressing the limitations of ﬁxed k in

traditional methods.

2. We propose a new solution that enables a more

accurate evaluation of existing CWEs by balanc-

ing P and recall without predeﬁning the number

of nearest neighbours.

3. We provide a rigorous evaluation across di-

verse language pairs, demonstrating consistent

improvements over state-of-the-art baselines.

4. To encourage reproducibility and further research,

we make our datasets, code, and models publicly

available.

2 RELATED WORK

The pioneering work introducing the embedding-

based method evaluated on the BLI task was proposed

by (Mikolov et al., 2013). In their work, the authors

reported results using metrics P@1 and P@5. Since

then, the BLI task has enjoyed popularity among re-

searchers as the mainstream task for the CWE eval-

uation and precision as the main reported metric,

including the most cited baseline methods such as

MUSE (Conneau et al., 2017) and VECMAP (Artetxe

et al., 2018a; Artetxe et al., 2018b).

The more comprehensive evaluation study sug-

gesting an alternative evaluation metric was proposed

in (Glava

s et al., 2019). They criticised the lack

of consistency and statistical signiﬁcance testing in

https://github.com/x-mia/Word pair classiﬁer

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

736

existing evaluations, hampering thorough compar-

isons. (Glava

s et al., 2019) recommended using MAP,

claiming this metric to be more informative since it

does not treat all models that rank the correct transla-

tion below k equally. The same argument was brought

up later at the BUCC 2022 conference (Adjali et al.,

2022; Laville et al., 2022), which employed MAP

metric in the shared task. In this paper, we concen-

trate on two submissions: CUNI (Po

ar et al., 2022)

and IJS (Repar et al., 2022).

In these papers, the CUNI team implemented

three approaches: static embeddings with posthoc

alignment (CUNI

muse

), unsupervised phrase-based

machine translation using the Monoses pipeline

(CUNI

mono

), and contextualized multilingual embed-

dings from pre-trained models like BERT (Devlin

et al., 2019) and XLM (Conneau and Lample, 2019)

(CUNI

comb

). In contrast, the IJS approach integrated

linguistic, neural, and sentence-transformer features

into an SVM binary classiﬁer consisting of three set-

tings (IJS

, IJS

). While CUNI focuses on mul-

tiple alignment techniques, IJS prioritizes compre-

hensive feature integration for precise term alignment.

Another attempt to bring different metrics into

BLI evaluation appeared in the shared task of the

BUCC 2020 conference (Rapp et al., 2020). The au-

thors were instructed to report on recall and F1 scores,

in addition to traditional precision, without setting

a ﬁxed k. In this paper, we focus on two models:

LMU (Severini et al., 2020b) and LS2N (Laville et al.,

2020).

In both papers, distinct methodologies were em-

ployed to ascertain a dynamic k. (Severini et al.,

2020b) calculated a local threshold value for each

source word instead of using a global threshold. The

score of each candidate word T for a given source

word S is determined by a function that considers the

margin between the similarity of S and T and the av-

erage similarity of S with its most similar candidates.

Each target candidate is considered a translation if

its score exceeds the threshold value. The threshold

value is tuned individually for each language pair.

(Laville et al., 2020) exploited scores from co-

sine similarity-based measure CSLS (Conneau et al.,

2017). Then, they employed two criteria to limit the

scores: i) setting a maximum number of candidates

to retain for each source word and ii) establishing a

minimum CSLS value to validate candidates. Each

language pair had its speciﬁc threshold value.

A new line of the BLI research introduced

classiﬁcation-based approaches (e.g., (Irvine and

Callison-Burch, 2017; Heyman et al., 2017; Severini

et al., 2020a)), which relax the constraint of having a

ﬁxed k and offered an alternative evaluation metrics,

such as recall and F1 score demonstrating the balance

in the performance.

(Irvine and Callison-Burch, 2017) leveraged tem-

poral word variation, normalised edit distance, and

word burstiness, among other inputs, to train a lin-

ear classiﬁer using a set of training translation pairs.

Contrarily, (Heyman et al., 2017) suggested incor-

porating word-level and character-level representa-

tions within a deep neural network architecture in-

stead. They provided experiments with various mod-

els. In this paper, we mention the models ex-

ploiting word-level representations (CLASS SGNS)

utilising word-level representations from the SGNS

model (Mikolov et al., 2013), character-level repre-

sentations (CHAR-LSTM

joint

), and both in a com-

bined model.

Additionally, they set a threshold t for the classi-

ﬁcation scores instead of ﬁxed k, which they further

ﬁne-tuned on a validation set. This enabled them to

enhance the model’s performance evaluated with F1

scores.

Finally, (Severini et al., 2020a) proposed a novel

approach, enabling the languages with different

scripts to exploit orthographic features via translit-

eration. They integrated semantic and orthographic

information using a transliteration system, seq2seqTr

(m+BOEs). In contrast to (Heyman et al., 2017), the

reported metric was HitRatio@ with a ﬁxed set of re-

trieved target words for each source word.

3 METHODOLOGY

We introduce a classiﬁcation neural network, leverag-

ing its ability to enable dynamic k. Each source word

is processed by the network separately. Let V

and

be the sets of all source and target words, respec-

tively. Given a list of target candidates C, which is

a list of the top 10 most similar target words, where

C ⊂ V

, it could also be denoted as C = {w

|i = 1...10}

and let w

be the most similar top target candidate, the

aim is to learn a function:

0,1 ← f (C), (1)

where the input C is not a target candidate (or its em-

bedding) directly but a vector of features derived from

the similarity of the candidate vector and similarities

of other candidates, and it can be formulated as fol-

lows:

0,1 ← f (sim,S

) (2)

The classiﬁcation neural network produces the

output of either 0 or 1. When the neural network

identiﬁes a target candidate w

as the corresponding

Enhancing Bilingual Lexicon Induction with Dynamic Translation

737

translation, it assigns a value of 1. The count of 1 as-

sociated with a source word w

is equal to the value

of k.

The training of the classiﬁcation neural network

requires sets of positive and negative examples to

make a correct prediction. For that purpose, we ex-

ploit the evaluation datasets MUSE and the baseline

CWEs. The data is described in Section 4.1 in fur-

ther detail. Then, the neural network is trained by

minimising the binary cross-entropy loss, deﬁned as

follows:

−

∑

i=1

· log( ˆy

) + (1 − y

) · log(1 − ˆy

)], (3)

where N is the length of the training data, y

is the true

label for the i-th instance (either 0 or 1), and ˆy

is the

predicted probability that the i-th instance belongs to

class 1.

The key component of the classiﬁcation neural

network is the input layer, consisting of a vector of

features representing a target candidate. This vector

contains six features, i.e., cosine similarity score, ab-

solute difference score, ratio score, and normalised

target’s word position, source word’s rank and cor-

pus frequency log values. The ﬁrst four are computed

from the CWEs using cosine similarity.

Let X

× X

be the aligned word embedding ma-

trices of the V

×V

. The cosine similarity score be-

tween the word embeddings (x

) corresponding to

the word pair (w

) is deﬁned as:

sim

= sim(x

), (4)

where the sim() function represents the dot product

between the source and target word embeddings.

The absolute difference score S

is then computed

as:

= sim

− sim

, (5)

where sim

denotes the similarity between the closest

target word embedding and the source word embed-

ding.

Similarly, the ratio score S

is calculated by the

following formula:

= sim

/sim

(6)

Furthermore, we can derive the nor-

malised target’s word position log value as

= Norm(|{i|sim

< sim

}|). F

normalised

frequency of source word w

in a corpus and R

normalised rank of the source word w

in MWE,

while the Norm() function is deﬁned as:

Norm(x,C) = (log(x) − min

)/(max

− min

) (7)

These features are then combined as an input vec-

tor Z ∈ R

, which is fed into the neural network c

= tanh(W

· Z + b

) (8)

= tanh(W

· c

i−1

+ b

) (9)

prediction = σ(W

· c

+ b

), (10)

where tanh and σ represent tanh and sigmoid activa-

tion functions, respectively, and T expresses the num-

ber of hidden layers implemented in the neural net-

work.

The weight matrices W

and bias terms b

are learned during the training process through back-

propagation. The activation functions tanh and σ are

applied after each transformation to introduce non-

linearity, which is essential for capturing complex re-

lationships in the data.

4 EXPERIMENTAL SETUP

In this section, we outline the key components of the

experiments that were conducted.

4.1 Data

To train the classiﬁcation neural network for all lan-

guages in combination with English, we utilised

the widely used evaluation datasets MUSE (Conneau

et al., 2017) and treated them as positive examples, all

labelled as 1.

Afterwards, we generated negative examples by

retrieving the ten most similar target candidates C and

their vectors of features for each source word from the

evaluation dataset using different CWE models de-

scribed in Section 4.2. All retrieved word pairs that

did not occur in the evaluation dataset received label

Each MUSE dataset consists of 1.5K source

words, meaning we obtained 15K word pairs for each

CWE model. We randomly sampled 8K word pairs

from each dataset and split them into 5K training data,

1.5K testing data, and 1.5K validation data. We train

our model using train and test data and report our re-

sults on validation data.

For the Estonian-Slovak language pair, we ex-

ploited the manually compiled and annotated data

from (Denisov

a, 2022). These datasets were much

smaller than the ones obtained from MUSE, resulting

in 600 training and 100 testing word pairs for each

model.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

738

4.2 Implementation Details

4.2.1 CWE

To retrieve aligned monolingual word embeddings we

utilised two state-of-the-art CWE frameworks, MUSE

and VECMAP (VM) in a supervised (MUSE-S, VM-

S), unsupervised (MUSE-U, VM-U) mode and mode

that relies on identical strings (MUSE-I, VM-I).

The default settings closely followed the MUSE

training described in (Conneau et al., 2017), and VM-

S and VM-I in (Artetxe et al., 2018a), and VM-U set-

tings in (Artetxe et al., 2018b). We used pre-trained

fastText (Grave et al., 2018) on Wikipedia with di-

mension 300. We induced the ﬁrst 200K aligned

embeddings. To train supervised systems (MUSE-S;

VM-S), we utilise MUSE training datasets.

4.2.2 Classiﬁcation Neural Network

The classiﬁcation neural network was implemented in

Python using TensorFlow (Abadi et al., 2016). We

utilised three hidden layers with 24-12-8 nodes. We

used Adam optimizer for training with a 0.001 learn-

ing rate. The training for each language pair ran for

500 epochs.

4.3 Baselines

We compare our model with the results from

classiﬁcation-based approaches presented in (Hey-

man et al., 2017) (CLASS SGNS, CHAR-LSTM

joint

combined) and (Severini et al., 2020a) (m+BOEs), the

best outcomes submitted by CUNI (CUNI

muse

) (Po

et al., 2022) and IJS (IJS

) (Repar et al., 2022) at the

BUCC 2022 conference in a shared task (Adjali et al.,

2022), and the results obtained by models LMU (Sev-

erini et al., 2020b) and LS2N (Laville et al., 2020) at

the BUCC 2020 conference in a shared task using dy-

namic k (Rapp et al., 2020).

Since the codes of these models are not publicly

available, we directly juxtapose our system’s perfor-

mance against the outcomes reported in the papers.

4.4 Metrics

Precision: at k (P@k) computes the ratio of true pos-

itives (TP) to the sum of true positives and false posi-

tives (FP). In other words, it is the ratio of the positive

target words to the number of all target words that

the model found (positive and negative). In this case,

k represents the number of the source word’s nearest

neighbours that were extracted.

Recall: (R) is calculated using the standard formula.

F1: score summarises the model’s performance by

capturing both metrics: P and R, showing the balance

between them, and it is computed in a standard way

as well.

5 EVALUATION

This section reports the main results obtained with

our classiﬁcation neural network. It is split into two

parts to distinguish when our model is being used as

a novel approach for BLI and as an extension for ex-

isting baseline CWE models, enabling the dynamic k.

In the ﬁrst one, we assess the outcomes concerning

our model and discuss them against baseline models

stated in Section 4.3.

In the second one, we compare the results of the

state-of-the-art CWE models described in Section 4.2,

evaluated using ﬁxed k and classiﬁcation neural net-

work with dynamic k.

5.1 Classiﬁcation Neural Network

5.1.1 Efﬁciency

We implemented a minimalist 3-hidden-layer classi-

ﬁcation neural network. This network efﬁciently pre-

dicts 1.5K source words within a short span of ap-

proximately 1.01 seconds. According to the (Heyman

et al., 2017), their method has a time complexity of

O(|V

| × |V

|) multiplied by the complexity of g

making it a computationally intensive process, espe-

cially for extensive vocabularies, where it becomes

impractically costly. Our approach offers greater ef-

ﬁciency when compared to the RNN neural network

presented in (Heyman et al., 2017).

5.1.2 Overall Results

The results across four language pairs (English to

French, Spanish, Russian, and German) compared

to the baselines CUNI

muse

, IJS

, LMU, LS2N, and

m+BOEs are provided in Table 2. The comparison of

the F1 scores with the models presented in (Heyman

et al., 2017) trained on the English-Dutch language

pair is displayed in Table 3.

Tables 2 and 3 indicate that our classiﬁcation ap-

proach outperforms almost all baselines within a mar-

gin of approximately 1% to 20%. In particular, the

model MUSE-S+NN stands out, achieving the high-

est results for English-Russian and English-Dutch and

Where V

and V

denote source and target words and

g denotes classiﬁer.

Enhancing Bilingual Lexicon Induction with Dynamic Translation

739

Table 2: P, R, and F1 score using dynamic k (CWE+NN) compared to the baselines CUNI

muse

(Po

ar et al., 2022), IJS

(Repar

et al., 2022), LMU (Severini et al., 2020b), LS2N (Laville et al., 2020), and m+BOEs (Severini et al., 2020a).

In the article, the authors present their ﬁndings on HitRatio@k, which cannot be directly compared to our results. Conse-

quently, we only make comparisons using the reported HitRatio@1, equivalent to P@1.

en-de en-fr en-es en-ru

P R F1 P R F1 P R F1 P R F1

CUNI

muse

- - - 39.8 31.7 35.3 - - - - - -

IJS

- - - 2.87 80.0 5.55 - - - - - -

LMU 40.2 59.8 48.1 - - - - - - 33.9 37.8 35.8

LS2N 54.3 54.8 54.5 61.2 69.7 65.1 63.8 61.4 62.6 32.6 38.7 35.4

m+BOEs - - - - - - - - - 36.0

∗

- -

MUSE-S+NN 56.8 73.8 64.2 63.4 61.1 62.2 77.6 59.0 67.0 66.4 75.8 70.8

MUSE-I+NN 49.3 58.1 53.3 62.7 61.3 62.0 67.6 63.6 65.5 43.2 56.4 48.9

MUSE-U+NN 57.8 56.8 57.3 52.7 63.1 57.4 73.9 63.5 68.3 37.3 40.7 39.0

VM-S+NN 60.7 67.1 63.7 64.7 64.7 64.7 72.5 62.2 67.0 42.3 63.9 50.9

VM-I+NN 54.5 63.7 58.8 66.4 70.3 68.3 57.3 71.1 63.5 39.3 59.7 47.4

VM-U+NN 55.3 62.1 58.5 68.8 68.5 68.6 71.8 61.3 66.1 41.6 50.0 45.4

performing well across English-Spanish and English-

German. The only exception is the English-French

language pair, where the baseline IJS

surpassed our

best model VM-I+NN by almost 10 %.

Table 3: Comparison of F1 using dynamic k (CWE+NN) to

the three models presented in (Heyman et al., 2017) evalu-

ated on English-Dutch.

CLASS SGNS 19.8

CHAR-LSTM

joint

34.9

combined 36.6

MUSE-S+NN 71.5

MUSE-I+NN 58.4

MUSE-U+NN 62.4

VM-S+NN 61.6

VM-I+NN 67.3

VM-U+NN 63.1

5.1.3 Setting the Dynamic K

Table 4 provides a sample of analysed source word

admit along with target candidates and their vectors

of features derived from the VM-S model trained us-

ing English-Spanish. Fig. 1 visualises the correla-

tion between the features sim and R

derived from the

same model across the entire English-Spanish valida-

tion data and assigned values of 1 or 0.

The English word admit has in the English-

Spanish MUSE evaluation dataset four target words,

i.e., admita, admite, admiten, admitir. The classiﬁ-

cation neural network assigned to three of them the

value of 1 but found an additional target word admi-

tirlo and did not include admiten that was found at

rank 1326. Thus, the k was set to 4 for the source

word admit.

Figure 1: Correlation between sim (cosine similarity score)

and R

(target candidate rank) features across the VM-

S+NN model trained using English-Spanish labelled as 0

or 1.

The analysis of the feature vectors of target can-

didates displayed in Table 4 suggests a strong corre-

lation between the scores’ magnitudes, various ranks,

and assigned labels, i.e., the higher sim value and the

lower R

value increase the probability of a target can-

didate being labelled as 1. Since the classiﬁcation

neural network learns patterns using information de-

rived from CWEs and ranks from MWEs and corpus

data, it plays a more crucial role than the linguistic

aspects of the language, highlighting the signiﬁcance

of the quality of the MWEs and CWEs’ alignment

method.

Additionally, Table 1 compares the number of tar-

get words in the evaluation dataset and the values of k

set by the classiﬁcation neural network. For example,

in the evaluation dataset, 435 source words have 2 tar-

get words and the classiﬁcation neural network set k

= 2 for 465 source words.

5.2 Extension

In the second part of the evaluation, we evaluated the

performance of the CWE models MUSE-S, MUSE-I,

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

740

Table 4: Example from the VM-S+NN model trained on the English-Spanish language pair. SRC = source word, TGT =

target word, ED = evaluation dataset, C = correct.

SRC rank TGT NN ED C sim S

admit 0 admitir 1 ✓ ✓ 0.799 0.0 1.0 0.0 0.922 0.692

k = 4 1 admitirlo 1 × ✓ 0.724 0.074 0.906 0.758 0.922 0.692

2 admita 1 ✓ ✓ 0.720 0.078 0.901 0.834 0.922 0.692

3 admite 1 ✓ ✓ 0.710 0.089 0.888 0.879 0.922 0.692

4 admito 0 × ✓ 0.709 0.090 0.886 0.910 0.922 0.692

5 admitiendo 0 × ✓ 0.703 0.096 0.879 0.935 0.922 0.692

6 entenderla 0 × × 0.688 0.110 0.861 0.955 0.922 0.692

7 creer 0 × × 0.683 0.115 0.855 0.972 0.922 0.692

8 admitan 0 × ✓ 0.668 0.131 0.835 0.987 0.922 0.692

9 ignorarla 0 × × 0.666 0.133 0.833 1.0 0.922 0.692

1326 admiten - ✓ ✓ - - - - - -

MUSE-U, VM-S, VM-I, and VM-U by employing

the P and F1 scores with ﬁxed and dynamic k.

For the ﬁxed k, we chose values in {1, 3, 5}. The

dynamic k is set by the classiﬁcation neural network

acting as an extension for the CWEs

, and we de-

note P and F1 scores using dynamic k as P@NN and

F1@NN, respectively.

The overall results across all language pairs dis-

playing F1 scores are provided in Table 7, and P met-

rics in Table 8, both placed in Appendix. We can ob-

serve that although in nearly all cases, the P@1 eval-

uation yields better results, almost all models offer a

signiﬁcantly better balance between P and R when dy-

namic k is employed, improving F1 scores by a mar-

gin rising up to almost 58%.

Table 5: F1 score of the best model when evaluated with

ﬁxed k (F1@1) vs. the best model when evaluated with

dynamic k (F1@NN).

Best F1@1 Best F1@NN

en-cs VM-S 39.0 VM-I 51.4

en-ﬁ VM-I 32.8 VM-U 48.3

et-sk VM-S 26.7 VM-U 68.0

On top of that, Table 5 shows how the models’

ranking changes when evaluated using ﬁxed and dy-

namic k across three language pairs. For example,

when evaluating models on the Estonian-Slovak lan-

guage pair with a ﬁxed k, the VM-S model achieves

the highest performance. However, when using a dy-

namic k metric, the VM-U model outperforms all oth-

ers by more than 41%. This can be illustrated using

the example of the Estonian word s

odur (soldier) in

Table 6. Using k = 1 for the evaluation would yield

poorer performance, as the top-1 induced target word

While HitRatio@k is often favoured, in this paper, we

opt to utilise P@k for the reasons outlined in the Introduc-

tion.

Model-X+NN means that the classiﬁcation neural net-

work was trained and evaluated on the output from model-X.

is absent from the evaluation dataset despite its cor-

rectness. In contrast, the classiﬁcation neural net-

work identiﬁed the correct target word from the eval-

uation dataset, as well as an additional correct word

not present in the dataset. This approach not only en-

ables efﬁcient selection of the optimal model but also

accurately identiﬁes target words despite biases in the

evaluation datasets (see Section 6).

Table 6: Example s

odur - soldier from the VM-U+NN

model trained on the Estonian-Slovak language pair. SRC =

source word, TGT = target word, ED = evaluation dataset,

C = correct.

SRC rank TGT NN ED C

odur 0 vojaka 1 × ✓

k = 2 1 vojak 1 ✓ ✓

2 bojovn

ık 0 × ×

3 voj 0 × ×

4 bojovn

ıka 0 × ×

5 lukostrelec 0 × ×

6 civilista 0 × ×

7 voj

ak 0 × ×

8 delostrelec 0 × ×

9 pechota 0 × ×

6 LIMITATIONS

Over the years, the MUSE datasets have been fre-

quently used for the BLI evaluation. Despite their

popularity, several concerns have emerged. (Ke-

mentchedjhieva et al., 2019) revealed that a signiﬁ-

cant portion of the word pairs are comprised of proper

nouns, which do not reﬂect the performance reliably.

Later, (Laville et al., 2022) pointed out more serious

issues, such as the fact that the datasets contain over

30% identical word pairs and around 40% graphically

close word pairs.

Another problem is the bias that occurs when cre-

ating positive and negative examples for the evalua-

Enhancing Bilingual Lexicon Induction with Dynamic Translation

741

tion. For example, the Spanish word admitir can have

over 40 valid translations for the English word ad-

mit, depending on the context. Moreover, the verb

admit could also be translated by reconocer or confe-

sar, which convey similar meanings. As a result, the

top 10 words identiﬁed by the proposed method might

include some of these over 100 suitable translations,

which are classiﬁed as negative pairs. This has a neg-

ative impact on the accuracy of the labels.

When demonstrated using our model, Table 4

shows that there were only four forms of the verb

admitir in the evaluation dataset, whereas the model

generated 7 viable word forms, four of them miss-

ing in the evaluation dataset (admitirlo, admito, admi-

tiendo, admitan). Moreover, alternative translations

like reconocer or confesar were not captured, indicat-

ing areas for improvement in contextual understand-

ing.

7 CONCLUSION

In this paper, we have presented a novel classiﬁcation-

based approach to BLI, addressing the limitations of

traditional evaluation metrics by introducing dynamic

k for enhanced P, R, and F1 scores. We evaluated our

approach across diverse language pairs, showing its

beneﬁts as a new approach for the BLI task and as

an extension for existing CWE approaches, enabling

dynamic k.

To summarise, the evaluation of CWE models us-

ing P@1 yields seemingly impressive results. How-

ever, it only assesses a small part of the evaluation

dataset. Therefore, employing dynamic k provides

a more accurate picture of the model’s performance

while balancing P and R. Additionally, the results

suggest that for determining the correct target candi-

dates, not only the absolute numbers are important,

but also the ranks and the scores relative to the high-

est score achieved.

Moreover, we have demonstrated that our ap-

proach is computationally efﬁcient and produces

competitive results when compared to the current

baseline systems.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving,

G., Isard, M., Jia, Y., J

ozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man

e, D., Monga, R., Moore,

S., Murray, D. G., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P. A.,

Vanhoucke, V., Vasudevan, V., Vi

egas, F. B., Vinyals,

O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y.,

and Zheng, X. (2016). Tensorﬂow: Large-scale ma-

chine learning on heterogeneous distributed systems.

ArXiv, abs/1603.04467.

Adjali, O., Morin, E., Sharoff, S., Rapp, R., and Zweigen-

baum, P. (2022). Overview of the 2022 BUCC Shared

Task: Bilingual Term Alignment in Comparable Spe-

cialized Corpora. In BUCC, 15th Workshop on Build-

ing and Using Comparable Corpora, pages 67–76.

Artetxe, M., Labaka, G., and Agirre, E. (2018a). General-

izing and improving bilingual word embedding map-

pings with a multi-step framework of linear transfor-

mations. In Proceedings of the Thirty-Second AAAI

Conference on Artiﬁcial Intelligence, pages 5012–

5019.

Artetxe, M., Labaka, G., and Agirre, E. (2018b). A robust

self-learning method for fully unsupervised cross-

lingual mappings of word embeddings. In Proceed-

ings of the 56th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 789–798. Association for Computational Lin-

guistics.

Artetxe, M., Labaka, G., and Agirre, E. (2018c). Unsuper-

vised statistical machine translation. In Proceedings

of the 2018 Conference on Empirical Methods in Nat-

ural Language Processing, pages 3632–3642. Associ-

ation for Computational Linguistics.

Conneau, A. and Lample, G. (2019). Cross-lingual lan-

guage model pretraining. In Wallach, H., Larochelle,

H., Beygelzimer, A., d'Alch

e-Buc, F., Fox, E., and

Garnett, R., editors, Advances in Neural Information

Processing Systems, volume 32, pages 7059–7069.

Curran Associates, Inc.

Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and

J’egou, H. (2017). Word translation without parallel

data. ArXiv, abs/1710.04087.

Denisov

a, M. (2022). Parallel, or comparable? that is the

question: The comparison of parallel and comparable

data-based methods for bilingual lexicon induction.

In Proceedings of the Sixteenth Workshop on Recent

Advances in Slavonic Natural Languages Processing,

RASLAN 2022, pages 4–13. Tribun EU.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. In Burstein,

J., Doran, C., and Solorio, T., editors, Proceedings of

the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguis-

tics: Human Language Technologies, Volume 1 (Long

and Short Papers), pages 4171–4186. Association for

Computational Linguistics.

Duan, X., Ji, B., Jia, H., Tan, M., Zhang, M., Chen, B., Luo,

W., and Zhang, Y. (2020). Bilingual dictionary based

neural machine translation without using parallel sen-

tences. In Proceedings of the 58th Annual Meeting of

the Association for Computational Linguistics, pages

1570–1579. Association for Computational Linguis-

tics.

Glava

s, G., Litschko, R., Ruder, S., and Vuli

c, I. (2019).

How to (properly) evaluate cross-lingual word embed-

dings: On strong baselines, comparative analyses, and

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

742

some misconceptions. In Korhonen, A., Traum, D.,

and M

arquez, L., editors, Proceedings of the 57th An-

nual Meeting of the Association for Computational

Linguistics, pages 710–721. Association for Compu-

tational Linguistics.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and

Mikolov, T. (2018). Learning word vectors for

157 languages. In Proceedings of the International

Conference on Language Resources and Evaluation

(LREC 2018).

Heyman, G., Vuli

c, I., and Moens, M.-F. (2017). Bilin-

gual lexicon induction by learning to combine word-

level and character-level representations. In Lapata,

M., Blunsom, P., and Koller, A., editors, Proceedings

of the 15th Conference of the European Chapter of the

Association for Computational Linguistics: Volume 1,

Long Papers, pages 1085–1095. Association for Com-

putational Linguistics.

Irvine, A. and Callison-Burch, C. (2017). A comprehen-

sive analysis of bilingual lexicon induction. Compu-

tational Linguistics, 43(2):273–310.

Karan, M., Vuli

c, I., Korhonen, A., and Glava

s, G. (2020).

Classiﬁcation-based self-learning for weakly super-

vised bilingual lexicon induction. In Jurafsky, D.,

Chai, J., Schluter, N., and Tetreault, J., editors, Pro-

ceedings of the 58th Annual Meeting of the Associa-

tion for Computational Linguistics, pages 6915–6922.

Association for Computational Linguistics.

Kementchedjhieva, Y., Hartmann, M., and Søgaard, A.

(2019). Lost in evaluation: Misleading benchmarks

for bilingual dictionary induction. In Proceedings of

the 2019 Conference on Empirical Methods in Nat-

ural Language Processing and the 9th International

Joint Conference on Natural Language Processing

(EMNLP-IJCNLP), pages 3336–3341. Association for

Computational Linguistics.

Laville, M., Hazem, A., and Morin, E. (2020). TALN/LS2N

participation at the BUCC shared task: Bilingual dic-

tionary induction from comparable corpora. In Rapp,

R., Zweigenbaum, P., and Sharoff, S., editors, Pro-

ceedings of the 13th Workshop on Building and Using

Comparable Corpora, pages 56–60. European Lan-

guage Resources Association.

Laville, M., Morin, E., and Langlais, P. (2022). About

evaluating bilingual lexicon induction. In Rapp, R.,

Zweigenbaum, P., and Sharoff, S., editors, Proceed-

ings of the BUCC Workshop within LREC 2022, pages

8–14. European Language Resources Association.

Li, Y., Liu, F., Vuli

c, I., and Korhonen, A. (2022). Improv-

ing bilingual lexicon induction with cross-encoder

reranking. In Goldberg, Y., Kozareva, Z., and Zhang,

Y., editors, Findings of the Association for Computa-

tional Linguistics: EMNLP 2022, pages 4100–4116.

Association for Computational Linguistics.

Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting

similarities among languages for machine translation.

arXiv preprint arXiv:1309.4168.

ar, B., Tauchmanov

a, K., Neumannov

a, K., Kva-

pil

ıkov

a, I., and Bojar, O. (2022). CUNI submission

to the BUCC 2022 shared task on bilingual term align-

ment. In Rapp, R., Zweigenbaum, P., and Sharoff, S.,

editors, Proceedings of the BUCC Workshop within

LREC 2022, pages 43–49. European Language Re-

sources Association.

Rapp, R., Zweigenbaum, P., and Sharoff, S. (2020).

Overview of the fourth BUCC shared task: Bilin-

gual dictionary induction from comparable corpora.

In Rapp, R., Zweigenbaum, P., and Sharoff, S., edi-

tors, Proceedings of the 13th Workshop on Building

and Using Comparable Corpora, pages 6–13. Euro-

pean Language Resources Association.

Repar, A., Pollak, S., Ul

car, M., and Koloski, B. (2022).

Fusion of linguistic, neural and sentence-transformer

features for improved term alignment. In Rapp, R.,

Zweigenbaum, P., and Sharoff, S., editors, Proceed-

ings of the BUCC Workshop within LREC 2022, pages

61–66. European Language Resources Association.

Ruder, S., Vuli

c, I., and Søgaard, A. (2019). A survey of

cross-lingual word embedding models. The Journal

of Artiﬁcial Intelligence Research, 65:569–631.

Severini, S., Hangya, V., Fraser, A., and Sch

utze, H.

(2020a). Combining word embeddings with bilingual

orthography embeddings for bilingual dictionary in-

duction. In Scott, D., Bel, N., and Zong, C., editors,

Proceedings of the 28th International Conference on

Computational Linguistics, pages 6044–6055. Inter-

national Committee on Computational Linguistics.

Severini, S., Hangya, V., Fraser, A., and Sch

utze, H.

(2020b). LMU bilingual dictionary induction system

with word surface similarity scores for BUCC 2020.

In Rapp, R., Zweigenbaum, P., and Sharoff, S., edi-

tors, Proceedings of the 13th Workshop on Building

and Using Comparable Corpora, pages 49–55. Euro-

pean Language Resources Association.

Tian, Z., Li, C., Ren, S., Zuo, Z., Wen, Z., Hu, X., Han, X.,

Huang, H., Deng, D., Zhang, Q., and Xie, X. (2022).

RAPO: An adaptive ranking paradigm for bilingual

lexicon induction. ArXiv, abs/2210.09926.

Vuli

c, I. and Moens, M.-F. (2015). Monolingual and cross-

lingual information retrieval models based on (bilin-

gual) word embeddings. Proceedings of the 38th In-

ternational ACM SIGIR Conference on Research and

Development in Information Retrieval, pages 363–

372.

Wang, X., Ruder, S., and Neubig, G. (2022). Expanding

pretrained models to thousands more languages via

lexicon-based adaptation. In Proceedings of the 60th

Annual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 863–877.

Association for Computational Linguistics.

Yuan, M., Zhang, M., Van Durme, B., Findlater, L., and

Boyd-Graber, J. (2020). Interactive reﬁnement of

cross-lingual word embeddings. In Proceedings of

the 2020 Conference on Empirical Methods in Natural

Language Processing (EMNLP), pages 5984–5996.

Association for Computational Linguistics.

Zhou, Y., Geng, X., Shen, T., Zhang, W., and Jiang, D.

(2021). Improving zero-shot cross-lingual transfer

for multilingual question answering over knowledge

graph. In Proceedings of the 2021 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, pages 5822–5834. Association for Compu-

tational Linguistics.

Enhancing Bilingual Lexicon Induction with Dynamic Translation

743

APPENDIX

Tables 7 and 8 present the outcomes of the models

MUSE-S+NN, MUSE-I+NN, MUSE-U+NN, VM-

S+NN, VM-I+NN, and VM-U+NN evaluated using

F1 scores (F@5, 3, 1, NN), and P metrics (P@5, 3, 1,

NN) across all language pairs, respectively.

Table 7: Reported F1 score (F1@5, F1@3, F1@1, and F1@NN (k selected by the classiﬁcation neural network)) for the

MUSE-S, MUSE-I, MUSE-U, VM-S, VM-I, and VM-U models evaluated with the MUSE evaluation datasets.

MUSE-S MUSE-I MUSE-U

F1@ 5 3 1 NN 5 3 1 NN 5 3 1 NN

en-de 42.1 50.4 48.8 64.2 34.5 40.7 41.1 53.3 34.8 40.8 41.6 57.3

en-fr 37.2 46.5 52.9 62.2 37.1 46.5 52.9 62.0 37.1 46.2 53.0 57.4

en-es 37.7 47.5 53.0 67.0 37.6 47.3 53.2 65.5 37.8 47.8 53.1 68.3

en-ru 39.7 51.9 60.2 70.8 25.5 30.4 31.8 48.9 23.7 28.1 27.2 39.0

en-cs 27.5 33.2 34.8 45.2 26.1 31.7 34.2 41.9 25.1 30.1 32.3 46.8

en-nl 33.2 40.6 42.9 71.5 24.7 30.4 32.4 58.4 31.2 40.5 51.0 64.6

en-ﬁ 24.3 29.5 29.3 38.8 21.2 25.7 27.7 41.6 19.2 23.0 23.5 31.7

en-ko 12.1 14.3 15.1 22.9 11.4 13.6 15.2 20.0 9.6 11.3 12.2 17.7

et-sk 9.8 11.2 12.4 66.0 9.2 10.4 11.6 50.0 7.3 8.7 9.5 61.0

VM-S VM-I VM-U

F1@ 5 3 1 NN 5 3 1 NN 5 3 1 NN

en-de 36.9 43.4 42.6 63.7 36.3 42.6 42.9 58.8 36.3 42.5 42.8 58.5

en-fr 38.9 48.4 53.7 64.7 38.7 48.0 54.4 68.3 38.5 48.1 54.4 68.6

en-es 39.8 49.7 53.3 67.0 38.9 48.9 54.0 63.5 38.8 49.0 54.1 66.1

en-ru 29.8 37.3 38.4 50.9 28.4 34.8 34.8 47.4 25.0 30.5 29.2 45.4

en-cs 31.3 38.7 39.0 47.0 29.5 36.4 36.7 51.4 29.2 35.7 36.6 46.4

en-nl 34.1 44.1 54.2 61.6 33.6 43.7 55.5 67.3 33.6 43.6 55.5 63.1

en-ﬁ 28.2 34.4 32.6 45.0 26.0 31.6 32.8 46.2 25.7 31.6 32.6 48.3

en-ko 19.8 24.7 30.4 42.2 13.6 16.2 18.8 22.6 11.1 13.3 14.2 9.3

et-sk 13.4 15.2 26.7 67.5 10.8 12.6 14.9 53.0 6.3 7.7 10.4 68.0

Table 8: Reported P (@5, @3, @1, and @NN (k selected by the classiﬁcation neural network)) for the MUSE-S, MUSE-I,

MUSE-U, VM-S, VM-I, and VM-U models evaluated with the MUSE evaluation dataset.

MUSE-S MUSE-I MUSE-U

P@ 5 3 1 NN 5 3 1 NN 5 3 1 NN

en-de 31.3 45.7 83.9 56.8 25.7 36.9 70.7 49.3 25.9 37.0 71.5 57.8

en-fr 25.9 38.5 78.4 63.4 25.9 38.5 78.3 62.7 25.8 38.2 78.5 52.7

en-es 26.4 39.4 79.1 77.6 26.3 39.3 79.3 67.6 26.6 39.7 79.1 73.9

en-ru 26.3 40.1 79.2 66.4 16.9 23.5 41.9 43.2 15.7 21.7 35.8 37.3

en-cs 18.5 26.0 47.1 41.3 17.5 24.9 46.1 34.8 16.8 23.6 43.6 46.8

en-nl 26.8 41.0 87.1 70.3 19.8 31.3 68.4 63.0 21.2 31.4 67.7 62.4

en-ﬁ 16.2 23.0 39.2 32.6 14.18 20.0 37.1 42.2 12.9 18.0 31.5 31.4

en-ko 7.6 10.3 17.5 28.9 7.2 9.8 17.6 50.0 6.0 8.1 14.1 28

et-sk 6.1 9.3 14.1 52.4 5.2 7.9 13.9 36.5 5.2 7.8 12.2 47.8

VM-S VM-I VM-U

P@ 5 3 1 NN 5 3 1 NN 5 3 1 NN

en-de 27.5 39.4 73.3 60.7 27.0 38.6 73.7 54.5 27.0 38.5 73.7 55.3

en-fr 27.1 40.0 79.5 64.7 27.0 39.7 80.6 66.4 26.8 39.8 80.5 68.8

en-es 27.8 41.3 79.5 72.5 27.2 40.6 80.6 57.3 27.1 40.7 80.7 71.8

en-ru 19.8 28.8 50.5 42.3 18.8 26.8 45.8 39.3 16.6 23.6 38.5 41.6

en-cs 21.0 30.3 52.7 44.4 19.8 28.5 49.6 46.3 19.6 28.0 49.5 36.8

en-nl 22.7 34.2 71.9 57.5 22.8 33.9 73.6 61.6 22.4 33.8 73.6 62.8

en-ﬁ 18.8 26.8 43.7 39.5 17.4 24.6 43.9 42.6 17.2 24.6 43.7 43.5

en-ko

12.5 17.8 35.2 46.2 8.6 11.7 21.8 24.6 7.0 9.6 16.5 13.9

et-sk 8.0 11.7 24.3 73.0 7.7 9.9 18.2 48.9 7.2 9.4 12.7 53.2

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

744