Dense Information Retrieval on a Latin Digital Library via LaBSE and

LatinBERT Embeddings

Federico Andrea Galatolo

1 a

, Gabriele Martino

1 b

, Mario G. C. A. Cimino

1 c

and Chiara Ombretta Tommasi

Dept. Information Engineering, University of Pisa, 56122, Pisa, Italy

Dept. Civilisations and Forms of Knowledge, University of Pisa, 56126 Pisa, Italy

Keywords:

Digital Library, Information Retrieval, Transformer, BERT, Latin.

Abstract:

Dense Information Retrieval (DIR) has recently gained attention due to the advances in deep learning-based

word embedding. In particular, for historical languages such as Latin, a DIR task is appropriate although

challenging, due to: (i) the complexity of managing searches using traditional Natural Language Processing

(NLP); (ii) the availability of fewer resources with respect to modern languages; (iii) the large variation in

usage among different eras. In this research, pre-trained transformer models are used as features extractors, to

carry out a search on a Latin Digital Library. The system computes embeddings of sentences using state-of-the-

art models, i.e., Latin BERT and LaBSE, and uses cosine distance to retrieve the most similar sentences. The

paper delineates the system development and summarizes an evaluation of its performance using a quantitative

metric based on expert’s per-query documents ranking. The proposed design is suitable for other historical

languages. Early results show the higher potential of the LabSE model, encouraging further comparative

research. To foster further development, the data and source code have been publicly released.

1 INTRODUCTION

Information Retrieval (IR) systems have become an

essential component of modern information manage-

ment. These systems are designed to retrieve rele-

vant information from large collections of documents

in response to user queries. In particular, Dense IR

(DIR) approaches, which are based on deep learning

technology, are increasingly used in various domains,

to quickly and efﬁciently ﬁnd relevant information

from large and heterogeneous data, with respect to

IR based on traditional Natural Language Processing

(NLP).

Recent advances in NLP have led to the devel-

opment of pre-trained transformer models, such as

BERT and LaBSE, that have shown impressive per-

formance. These models are trained on massive cor-

pora to learn rich representations of languages, and

are suitable for a variety of NLP tasks.

In particular, historical digital libraries raise

unique challenges. Speciﬁcally, historical texts are

https://orcid.org/0000-0001-7193-3754

https://orcid.org/0009-0006-3345-1045

https://orcid.org/0000-0002-1031-1959

written in languages that are no longer in widespread

use, exhibit archaic spelling and grammar, and are

hard to process using traditional NLP systems. To

overcome these challenges, this paper presents a DIR

system for a Latin library using pre-trained transform-

ers.

The method requires to compute the embeddings

using the available models, such as Latin BERT and

LaBSE, for each sentence in the documents and for

the queries. To retrieve the most similar sentences for

a given query, a distance such as cosine is used.

This paper delineates the development of the pro-

posed system, which encompasses a data preprocess-

ing pipeline, a query processing engine, and a search

interface for users. To evaluate the system perfor-

mance, a quantitative metric based on the per-query

document ranking is provided by a Latin expert, and

compared to the results achieved by the proposed sys-

tem.

Experimental results show that the model based

on LaBSE embeddings outperforms the one based on

Latin BERT, for the purpose of retrieving pertinent

information from the Latin library. To foster further

development, the data and source code have been pub-

licly released (Federico Galatolo, 2023).

518

Galatolo, F., Martino, G., Cimino, M. and Tommasi, C.

Dense Information Retrieval on a Latin Digital Library via LaBSE and LatinBERT Embeddings.

DOI: 10.5220/0012134700003541

In Proceedings of the 12th Inter national Conference on Data Science, Technology and Applications (DATA 2023), pages 518-523

ISBN: 978-989-758-664-4; ISSN: 2184-285X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

The paper is structured as follows. Section 2 cov-

ers related work. The method is discussed in Section

3. Experimental studies are covered in Section 4. Fi-

nally, Section 5 draws conclusions.

2 RELATED WORK

With the advent of machine learning, IR models have

evolved from classic methods to learning-based rank-

ing functions. One of the critical factors for designing

effective IR models is how to learn text representa-

tions and model relevance matching. With the recent

advancements in Pretrained Large Language Models

(LLMs), such as BERT and GPT, dense representa-

tions of queries and texts can be effectively learnt in

latent space, and construct a semantic matching func-

tion for relevance modeling. This approach is known

as dense retrieval, as it employs dense vectors or em-

beddings to represent the texts (Zhao et al., 2022).

BERT (Bidirectional Encoder Representations

from Transformers) (Devlin et al., 2018) is a state-

of-the-art language representation model that has

achieved very good results on a variety of natural lan-

guage processing tasks. It is a deep learning model

that uses a transformer architecture to process se-

quences of text and generate high-quality representa-

tions. The key innovation of BERT is its use of bidi-

rectional processing, which allows it to capture both

forward and backward contextual information about

a given word. This is achieved by dividing the input

text into chunks of a ﬁxed length, and then processing

each chunk in both directions, from left to right and

from right to left. This allows BERT to capture in-

formation about the context in which a word appears,

including the words that come before and after it. In

addition to bidirectional processing, BERT also uses

several other techniques to improve its performance.

These include the use of the following features: (i)

multi-head self-attention, which allows the model to

selectively focus on different parts of the input text;

(ii) a masked language modeling objective, which en-

courages the model to predict missing words based

on the context in which they appear; a next sentence

prediction task, which encourages the model to under-

stand the relationships between different sentences in

a document.

David Bamman, et al. (Bamman and Burns, 2020)

introduced Latin BERT, a contextual language model

for the Latin language that was trained on a large

corpus of 642.7 million words from various sources

spanning the Classical era to the 21st century. The au-

thors demonstrated the capabilities of this language-

speciﬁc model through several case studies, including

its use for part-of-speech tagging, where Latin BERT

achieves a new state-of-the-art performance for three

Universal Dependency Latin datasets. The model is

also used for predicting missing text, including criti-

cal emendations, and outperforms static word embed-

dings for word sense disambiguation. Furthermore,

the study shows that Latin BERT can be used for

semantically-informed search by querying contextual

nearest neighbors.

LaBSE is a multilingual sentence embedding

model that is based on the BERT architecture (Feng

et al., 2022). The authors systematically investi-

gated methods for learning cross-lingual sentence em-

beddings by combining the best methods for learn-

ing monolingual and cross-lingual representations, in-

cluding masked language modeling (MLM), transla-

tion language modeling (TLM), dual encoder transla-

tion ranking, and additive margin softmax. The au-

thors showed that introducing a pre-trained multilin-

gual language model dramatically reduces the amount

of parallel training data required to achieve good

performance. Composing the best of these meth-

ods produced a model that achieves 83.7% bi-text

retrieval accuracy in over 112 languages on Tatoeba

dataset, against the 65.5% accuracy achieved by pre-

vious state-of-the-art models, while performing com-

petitively on mono-lingual transfer learning bench-

marks. The authors also demonstrated the effective-

ness of the LaBSE model by mining parallel data from

CommonCrawl repository and using it to train com-

petitive Neural Machine Translation (NMT) models

for English-Chinese and English-German.

One recent work in language understanding that

leverages contextualized features is Semantics-aware

BERT (SemBERT) (Zhang et al., 2020). SemBERT

incorporates explicit contextual semantics from pre-

trained semantic role labeling, improving BERT’s

language representation capabilities. SemBERT is ca-

pable of absorbing contextual semantics without sub-

stantial task-speciﬁc changes, with a more powerful

and simple design compared to BERT. It has achieved

new state-of-the-art results in various machine read-

ing comprehension and natural language inference

tasks.

For latin-based IR, Piroska Lendvai et al. ﬁne-

tuned Latin BERT for Word Sense Disambiguation on

the Thesaurus Linguae Latinae (Lendvai and Wick,

2022). This work proposes to use LatinBERT to cre-

ate a new dataset based on a subset of representations

in the Thesaurus Linguae Latinae. The results of the

study showed that the contextualized BERT represen-

tations ﬁne-tuned on TLL data perform better than

static embeddings used in a bidirectional LSTM clas-

siﬁer on the same dataset. Moreover, the per-lemma

Dense Information Retrieval on a Latin Digital Library via LaBSE and LatinBERT Embeddings

519

BERT models achieved higher and more robust per-

formance compared to previous results based on data

from a bilingual Latin dictionary.

More recently, Zhengbao Jiang et al. proposed

X-FACTR: a Multilingual Factual Knowledge Re-

trieval from Pretrained Language Models (Jiang et al.,

2020). The authors proposed a benchmark of cloze-

style probes for 23 typologically diverse languages

to assess factual knowledge retrieval in LLMs. The

study expanded probing methods from single to mul-

tiword entities and developed several decoding algo-

rithms to generate multi-token predictions. The re-

sults of the study provided insights into how well cur-

rent state-of-the-art LLMs perform on this task in lan-

guages with more or fewer available resources. The

researchers further proposed a code-switching-based

method to improve the ability of multilingual LLMs

to access knowledge, which has been veriﬁed to be ef-

fective in several benchmark languages. The bench-

mark data and code have been released to facilitate

further research in this area.

3 THE PROPOSED METHOD

In this work, a cross-language DIR system is devel-

oped for Latin texts, in which the user can perform

queries using different languages. For the evaluation,

sample queries in Latin together with their English

counterpart have been selected. The ﬁrst step of the

method is tokenization, i.e., breaking up all the doc-

uments into smaller text units. Tokenization depends

on the speciﬁc language. Each sentence is then passed

through the LLMs embeddings extraction. The pro-

posed approach experiments the high speciﬁcity of

Latin BERT and the high generality of LaBSE, al-

though the two models are quite different. LaBSE

starts from a pre-trained BERT model in Multi-lingual

Model and Translation Language Model (TLM) com-

bination, and retrains the model to combine sentence-

level embeddings of different languages. In contrast,

Latin BERT is trained in Masked Language Model-

ing (MLM), which is based on predicting a selected

random word from a sentence. Both the output em-

beddings of the two models have a dimensionality

of 768. In addition, the average embedding of the

two models and the concatenated vector have been

computed, to test how the combination of the two

models would affect the retrieval performance. Each

Latin document and each related sentence, together

with its four extracted representative embeddings, are

stored in a Lucene-base database that allows indexed

research. When a query has to be submitted, the se-

lected type embedding is computed and the search in

the database is done using the cosineSimilarity, one

of the most used similarity metric for search engines.

It is well-known that Cosine similarity is more robust

to the course of dimensionality. Finally, all the docu-

ments are sorted for similarity.

3.1 Performance Evaluation

For the evaluation of the proposed DIR System, Q

query sentences have been selected, both in Latin and

in their respective English translations. Then, the

ﬁrst R results of the queries have been extracted. Fi-

nally, each resulting sentence has been evaluated via

a graded evaluation between 1 and 5. The evaluation

is done by a Latin-English expert. The evaluation of

each retrieved document is based on the semantic co-

herence ratio between the query and the retrieved sen-

tence. To evaluate each query, the Discounted Cumu-

lative Gain (J

arvelin and Kek

ainen, 2002) is used:

DCG

|D|

∑

i=1

grade

log

(i + 1)

Where D is the total number of the resulting doc-

uments for the query q. The grade

is the graded eval-

uation in the deﬁned interval for that result. The Nor-

malized DCG is used: the DCG value is normalized

with respect to the Ideal DCG, resorting all the re-

sults according to the evaluation. The resulting value

is then in the interval [0, 1].

nDCG

DCG

IDCG

Where IDCG

(Ideal DCG) corresponds to the

DCG calculated with all the retrieved documents

sorted in descending order of grades. Finally, the av-

erage Normalized DCG is computed:

AnDCG =

|Q|

∑

nDCG

Where |Q| is the total number of queries. Let us

note that the nDCG compares the actual document

ranking of the query with respect to the ideal one,

but only for the relevant documents. This raises a

complication in deﬁning the threshold for consider-

ing whether a document is relevant or not. Moreover,

nDCG is not suitable to consider the overall perfor-

mance of all the retrieved documents, besides their

grade.

For this reason, a new performance evaluation

metric has been developed, i.e. the Penalized Nor-

malized DCG:

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

520

PnDCG

DCG

IDCG

MaxDist − Dist

MaxDist

where MaxDist is the L1-Norm between the best pos-

sible ranking and the worst possible ranking for a set

of D documents retrieved:

MaxDist = |D| × |maxGrade − minGrade|

Dist is instead the L1-Norm between the best pos-

sible ranking and the ideal ranking:

Dist =

|D|

∑

i=1

maxGrade − idealGrade

It is worth noting that, if the idealized ranking is

exactly equal to the best grade (where all the retrieved

documents have the maximum grade), the PnDCG

is equal to 1.0. In contrast, if all the retrieved doc-

uments have the minimum grade, the left-hand side

of the product (the nDCG

) is equal to 1.0, whereas

the right-hand side becomes 0.0, bringing the whole

evaluation metric to 0.0, reﬂecting a low performance.

Finally, the Average Penalized Normalized DCG is

calculated as follows:

APnDCG =

|Q|

∑

PnDCG

4 EXPERIMENTAL STUDIES

For the purpose of the experiments, Q = 10 query sen-

tences have been selected, both in Latin and in their

respective English translations. The ﬁrst R = 10 re-

sults of the queries have been extracted and graded by

the Expert.

Table.1 shows the experimental results. It is worth

noting that the LaBSE embeddings outperform any

other extracted embeddings, even the combination

LaBSE and LatinBERT. On the other hand, it seems

that the LatinBERT embeddings are not suitable for

this task, conﬁrming the ﬁndings achieved by Hu (Hu

et al., 2020). Speciﬁcally, Hu et al. discovered that

the performance of such models on bitext retrieval

tasks is very weak if not coupled with a sentence-level

ﬁne-tuning. Moreover, it is important to notice that

searching the same sentence in English performs bet-

ter than the respective in Latin. This bias could be

ascribed to the training of LaBSE model (Feng et al.,

2022): most of the sentences used for the bilingual

training are in English, bringing the model to have a

higher abstraction capability in English.

To further explain the performance, the PCA

(Principal Component Analysis) of the embeddings

has been computed, to achieve a dimensionality re-

duction to 50 dimensions. Then t-SNE (t-Distributed

Stochastic Neighbor Embedding) is used to visualize

the latent space on a bidimensional plot. To visual-

ize the functioning of the search-engine, the t-SNE is

computed with the usage of cosine similarity, to bet-

ter represent the distance metric used by the search-

engine.

Fig.1 (a) reports the space of the LaBSE sentences

embeddings for several documents. It is clear that

the distribution of the LaBSE space is more regular

with respect to the other space embeddings. Consid-

ering that all the documents are consistent in the top-

ics that they treat, despite the lexical spectrum, this

regularity of the space could explain the better perfor-

mance achieved by the LaBSE model, as well as the

lower performance of LatinBERT model, represented

in Fig. 1 (b). Finally, the Mean and Concat embed-

dings, represented in Fig. 2 reﬂect a detriment of the

regularity of the space.

5 CONCLUSIONS

Despite the extensive advances of Dense Informa-

tion Retrieval systems (DIR), text retrieval of ancient

languages, such as Latin, has to be explored, given

the additional challenges. This research work illus-

trates the development of a DIR system, as well as

its experimentation on a Latin Digital Library, carry-

ing out multilingual queries, in Latin and English. A

novel search-engine metric is also proposed, to eval-

uate the system performance starting from a set of

graded documents. Early results show the potential

of this comparative framework, encouraging further

research. Speciﬁcally, it is shown that the LaBSE

model outperforms the Latin Bert model, as well as

Table 1: Performance Evaluation of the proposed Dense Information Retrieval System.

LaBSE LatinBERT

LaBSE-LatinBERT

Mean

LaBSE-LatinBERT

Concat

Latin 0.33 ± 0.06 0.05 ± 0.02 0.32 ± 0.06 0.32 ± 0.06

English 0.52 ± 0.05 NA 0.43 ± 0.06 0.43 ± 0.06

Dense Information Retrieval on a Latin Digital Library via LaBSE and LatinBERT Embeddings

521

(a) LaBSE Embeddings Projection of documents

sentences

(b) LatinBERT Embeddings Projection of documents

sentences

Figure 1: Labse - LatinBERT Embeddings t-SNE.

(a) LaBSE-LatinBERTMean Embeddings Projection of

documents sentences

(b) LaBSE-LatinBERTConcat Embeddings Projection

of documents sentences

Figure 2: Mean - Concat Embeddings t-SNE.

that queries in English perform better than in Latin.

The source code has been publicly released, along

with an in-browser demonstration.

ACKNOWLEDGEMENTS

Work partially supported by the Italian Ministry of

University and Research (MUR), in the framework

of: (i) the FISR 2019 Programme, under Grant No.

03602 of the project “SERICA”; (ii) the FoReLab

project (Departments of Excellence); (iii) the ”Rea-

soning” project, PRIN 2020 LS Programme, Project

number 2493 04-11-2021. Research partially funded

by PNRR - M4C2 - Investimento 1.3, Partenari-

ato Esteso PE00000013 - ”FAIR - Future Artiﬁcial

Intelligence Research” - Spoke 1 ”Human-centered

AI”, funded by the European Commission under the

NextGeneration EU programme. This work has been

partially carried out in the National Center for Sus-

tainable Mobility MOST/Spoke10, funded by the Ital-

ian Ministry of University and Research in the frame-

work of the National Recovery and Resilience Plan.

REFERENCES

Bamman, D. and Burns, P. J. (2020). Latin BERT: A con-

textual language model for classical philology. arXiv

preprint arXiv:2009.10053.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). BERT: Pre-training of deep bidirectional

transformers for language understanding. arXiv

preprint arXiv:1810.04805.

Federico Galatolo, G. M. (2023). Serica-intelligent-

search github. https://huggingface.co/spaces/

GabMartino/serica-intelligent-search-fork,

https://github.com/galatolofederico/serica-ir.

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang,

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

522

W. (2022). Language-agnostic BERT sentence em-

bedding. In Proceedings of the 60th Annual Meeting

of the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 878–891, Dublin, Ireland.

Association for Computational Linguistics.

Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O.,

and Johnson, M. (2020). Xtreme: A massively mul-

tilingual multi-task benchmark for evaluating cross-

lingual generalisation. In International Conference on

Machine Learning, pages 4411–4421. PMLR.

arvelin, K. and Kek

ainen, J. (2002). Cumulated gain-

based evaluation of IR techniques. ACM Trans. Inf.

Syst., 20(4):422–446.

Jiang, Z., Anastasopoulos, A., Araki, J., Ding, H., and

Neubig, G. (2020). X-FACTR: Multilingual factual

knowledge retrieval from pretrained language models.

arXiv preprint arXiv:2010.06189.

Lendvai, P. and Wick, C. (2022). Finetuning Latin BERT

for word sense disambiguation on the thesaurus lin-

guae latinae. In Proceedings of the Workshop on Cog-

nitive Aspects of the Lexicon, pages 37–41, Taipei,

Taiwan. Association for Computational Linguistics.

Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou, X.,

and Zhou, X. (2020). Semantics-aware BERT for lan-

guage understanding. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, volume 34,

pages 9628–9635.

Zhao, W. X., Liu, J., Ren, R., and Wen, J.-R. (2022). Dense

text retrieval based on pretrained language models: A

survey. arXiv preprint arXiv:2211.14876.

Dense Information Retrieval on a Latin Digital Library via LaBSE and LatinBERT Embeddings

523