
ties. The second shows that sentence transformers can
be used to discover NE translations. We discuss dif-
ferences that can occur in translations of the NEs, es-
pecially the change of part-of-speech and entity type.
Section 7 discusses the need for new datasets with NE
annotation.
2 RELATED WORK
In the early days, named entity translation has been
described, e.g., in (Al-Onaizan and Knight, 2002),
which proposes an algorithm for translating NEs be-
tween English and Arabic. In (Awadallah et al.,
2007), the authors propose a system for NE transla-
tion between English and Arabic.
Authors of (Fu et al., 2014) propose a general
framework to generate large-scale NER training data
from parallel corpora using an English-Chinese par-
allel corpus. The aim is to improve Chinese NER
using English data. A chunk symmetry strategy and
English-Chinese transliteration model are used in (Li
et al., 2021b).
Named entity translation is closely related to ter-
minology extraction and translation as explored, e.g.,
in (Del
´
eger et al., 2006). The paper describes us-
ing word alignment in parallel corpora to extract
new term translations automatically. The work fo-
cuses on translating medical terminology from En-
glish to French. Terminology extraction in multilin-
gual data has also been proposed as a CLEF-ER chal-
lenge (Rebholz-Schuhmann et al., 2013).
In our first experiment, the translation candidates
were found in previous work, so the task is only to es-
tablish alignment between appropriate candidates. A
similar (but harder) task is aligning English NE an-
notations with other languages. Transformer models
are used, e.g., in (Li et al., 2021a), for alignment of
NEs in German, Spanish, Dutch, and Chinese, with
F1 ranging from 0.71 to 0.81.
Named entity recognition and linking are also re-
lated to ontologies and knowledge graphs. In the pa-
per (Stankovi
´
c et al., 2024), the authors prepared a
NER-annotated Italian-Serbian corpus comprising lit-
erary works translations. The paper focuses on se-
mantic interoperability as one of the key aspects of
linked data and digital humanities.
3 THE PARALLEL DATASET
Parallel Global Voices (PGV (Prokopidis et al.,
2016)) is a massively parallel (756 language pairs),
automatically aligned corpus of citizen media stories
translated by volunteers. The Global Voices commu-
nity blog contains several guides, including the Trans-
lators’ guide
2
. It contains recommendations to “lo-
calize” whenever possible. Also, it mentions English
as the most significant source language. However,
according to authors of the PGV (Prokopidis et al.,
2016), the source language for the translation cannot
always be reliably identified.
PGV contains texts crawled in 2015, reporting “on
trending issues and stories published on social media
and independent blogs in 167 countries” (Prokopidis
et al., 2016).
The corpus contains the Global Voices (GV) top-
ics about politics and elections; civil, sexual, and
socio-economic rights; disasters and the environment;
demonstrations and police reaction; labor; and spe-
cific geographic regions. In addition, the corpus con-
tains articles about the organization of the GV net-
work, culture, and online media.
The sentence-level alignment has been done au-
tomatically. Sometimes, sentence boundaries are in-
correctly detected, e.g., on initials inside people’s
names. The Czech-English pairs (450 documents) are
in aligned 1:1 in 86% of cases, the rest are 1:2, 2:1,
1:0, and 0:1 alignments.
We used existing NER models for pre-annotation.
They performed well in precision: a BERT-
based (Devlin et al., 2018) model achieved 0.70 pre-
cision in the MUC-5 strict evaluation scheme (Chin-
chor and Sundheim, 1993), and Czert-B achieved 0.73
precision in the same evaluation scheme. On the other
hand, the recall of English and especially Czech mod-
els was low: 0.41 and 0.18, respectively, in the MUC-
5 strict evaluation scheme.
In (Nev
ˇ
e
ˇ
rilov
´
a and
ˇ
Zi
ˇ
zkov
´
a, 2024), we set up
the annotation task with detailed instructions follow-
ing the UniversalNER (Mayhew et al., 2024) annota-
tion scheme. We have shown that manual annotation
could be performed relatively efficiently using high-
precision/low-recall pre-annotations.
When set up wisely, the annotation environment
allows the production of high-quality annotations
quickly: annotation median time was around 4.5 sec-
onds, and the inter-annotator agreement was Cohen-
κ=0.91. We used the LabelStudio
3
for annotation.
The screenshots in Figures 2 and 1 are from this tool.
2
https://community.globalvoices.org/guide/
lingua-guides/lingua-translators-guide/
3
https://labelstud.io/
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1216