The paper is structured as follows. Section 2 cov-
ers related work. The method is discussed in Section
3. Experimental studies are covered in Section 4. Fi-
nally, Section 5 draws conclusions.
2 RELATED WORK
With the advent of machine learning, IR models have
evolved from classic methods to learning-based rank-
ing functions. One of the critical factors for designing
effective IR models is how to learn text representa-
tions and model relevance matching. With the recent
advancements in Pretrained Large Language Models
(LLMs), such as BERT and GPT, dense representa-
tions of queries and texts can be effectively learnt in
latent space, and construct a semantic matching func-
tion for relevance modeling. This approach is known
as dense retrieval, as it employs dense vectors or em-
beddings to represent the texts (Zhao et al., 2022).
BERT (Bidirectional Encoder Representations
from Transformers) (Devlin et al., 2018) is a state-
of-the-art language representation model that has
achieved very good results on a variety of natural lan-
guage processing tasks. It is a deep learning model
that uses a transformer architecture to process se-
quences of text and generate high-quality representa-
tions. The key innovation of BERT is its use of bidi-
rectional processing, which allows it to capture both
forward and backward contextual information about
a given word. This is achieved by dividing the input
text into chunks of a fixed length, and then processing
each chunk in both directions, from left to right and
from right to left. This allows BERT to capture in-
formation about the context in which a word appears,
including the words that come before and after it. In
addition to bidirectional processing, BERT also uses
several other techniques to improve its performance.
These include the use of the following features: (i)
multi-head self-attention, which allows the model to
selectively focus on different parts of the input text;
(ii) a masked language modeling objective, which en-
courages the model to predict missing words based
on the context in which they appear; a next sentence
prediction task, which encourages the model to under-
stand the relationships between different sentences in
a document.
David Bamman, et al. (Bamman and Burns, 2020)
introduced Latin BERT, a contextual language model
for the Latin language that was trained on a large
corpus of 642.7 million words from various sources
spanning the Classical era to the 21st century. The au-
thors demonstrated the capabilities of this language-
specific model through several case studies, including
its use for part-of-speech tagging, where Latin BERT
achieves a new state-of-the-art performance for three
Universal Dependency Latin datasets. The model is
also used for predicting missing text, including criti-
cal emendations, and outperforms static word embed-
dings for word sense disambiguation. Furthermore,
the study shows that Latin BERT can be used for
semantically-informed search by querying contextual
nearest neighbors.
LaBSE is a multilingual sentence embedding
model that is based on the BERT architecture (Feng
et al., 2022). The authors systematically investi-
gated methods for learning cross-lingual sentence em-
beddings by combining the best methods for learn-
ing monolingual and cross-lingual representations, in-
cluding masked language modeling (MLM), transla-
tion language modeling (TLM), dual encoder transla-
tion ranking, and additive margin softmax. The au-
thors showed that introducing a pre-trained multilin-
gual language model dramatically reduces the amount
of parallel training data required to achieve good
performance. Composing the best of these meth-
ods produced a model that achieves 83.7% bi-text
retrieval accuracy in over 112 languages on Tatoeba
dataset, against the 65.5% accuracy achieved by pre-
vious state-of-the-art models, while performing com-
petitively on mono-lingual transfer learning bench-
marks. The authors also demonstrated the effective-
ness of the LaBSE model by mining parallel data from
CommonCrawl repository and using it to train com-
petitive Neural Machine Translation (NMT) models
for English-Chinese and English-German.
One recent work in language understanding that
leverages contextualized features is Semantics-aware
BERT (SemBERT) (Zhang et al., 2020). SemBERT
incorporates explicit contextual semantics from pre-
trained semantic role labeling, improving BERT’s
language representation capabilities. SemBERT is ca-
pable of absorbing contextual semantics without sub-
stantial task-specific changes, with a more powerful
and simple design compared to BERT. It has achieved
new state-of-the-art results in various machine read-
ing comprehension and natural language inference
tasks.
For latin-based IR, Piroska Lendvai et al. fine-
tuned Latin BERT for Word Sense Disambiguation on
the Thesaurus Linguae Latinae (Lendvai and Wick,
2022). This work proposes to use LatinBERT to cre-
ate a new dataset based on a subset of representations
in the Thesaurus Linguae Latinae. The results of the
study showed that the contextualized BERT represen-
tations fine-tuned on TLL data perform better than
static embeddings used in a bidirectional LSTM clas-
sifier on the same dataset. Moreover, the per-lemma
Dense Information Retrieval on a Latin Digital Library via LaBSE and LatinBERT Embeddings
519