Large Language Models for Summarizing Czech Historical Documents

and Beyond

aclav Tran

1 a

, Jakub

ıd

1,2 b

, Ji

ı Mart

ınek

1,2 c

, Ladislav Lenc

1,2 d

and Pavel Kr

1,2 e

Department of Computer Science and Engineering, University of West Bohemia in Pilsen,

Univerzitn

ı, Pilsen, Czech Republic

NTIS - New Technologies for the Information Society, University of West Bohemia in Pilsen,

Univerzitn

ı, Pilsen, Czech Republic

Keywords:

Czech Text Summarization, Deep Neural Networks, Mistral, mT5, Posel od

Cerchova, SumeCzech,

Transformer Models.

Abstract:

Text summarization is the task of shortening a larger body of text into a concise version while retaining

its essential meaning and key information. While summarization has been signiﬁcantly explored in English

and other high-resource languages, Czech text summarization, particularly for historical documents, remains

underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models

such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and

languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions:

(1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these

advanced models, and (2) introducing a novel dataset called Posel od

Cerchova for summarization of historical

Czech documents with baseline results. Together, these contributions provide a great potential for advancing

Czech text summarization and open new avenues for research in Czech historical text processing.

1 INTRODUCTION

The rapid evolution of Natural Language Processing

(NLP) techniques has elevated the performance of

text summarization systems. While most advances

focus on high-resource languages like English, the

Czech language, particularly historical variations, re-

mains underrepresented. Historical Czech documents

pose unique challenges due to linguistic shifts, out-

dated vocabulary, and inconsistent syntax. These nu-

ances create a signiﬁcant gap in the development of

automated summarization systems capable of han-

dling this domain effectively.

Therefore, this paper addresses two interlinked

challenges. First, it seeks to establish new state-of-

the-art benchmarks on SumeCzech, the most com-

prehensive dataset for modern Czech text summariza-

tion using modern Large Language Models (LLMs),

https://orcid.org/0009-0003-0250-2821

https://orcid.org/0000-0002-4492-5481

https://orcid.org/0000-0003-2981-1723

https://orcid.org/0000-0002-1066-7269

https://orcid.org/0000-0002-3096-675X

namely Mistral (Jiang et al., 2023) and mT5 (Xue

et al., 2021b). Second, recognizing the lack of re-

sources tailored for historical Czech, we introduce a

newly created dataset derived from the historical jour-

nal Posel od

Cerchova. The dataset is speciﬁcally de-

signed to facilitate summarization tasks in historical

contexts, enabling future researchers to address the

linguistic complexities inherent in this domain. This

corpus is freely available for research purposes

By combining model advancements and dataset

innovation, this research aims to drive progress in the

Czech summarization ﬁeld and open venues for ap-

plications in cultural preservation, historical research,

and digital humanities.

2 RELATED WORK

Text summarization methods can be categorized into

abstractive and extractive ones. Extractive sum-

marization selects the most representative sentences

from the source document, while abstractive summa-

https://corpora.kiv.zcu.cz/posel od cerchova/

798

Tran, V., Šmíd, J., Martínek, J., Lenc, L. and Král, P.

Large Language Models for Summarizing Czech Historical Documents and Beyond.

DOI: 10.5220/0013374100003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 798-804

ISBN: 978-989-758-737-5; ISSN: 2184-433X

rization generates summaries composed of newly cre-

ated sentences.

Early summarization methods were extractive

ones and relied on statistical and graph-based meth-

ods like TF-IDF (Term Frequency-Inverse Document

Frequency) (Christian et al., 2016), which scores sen-

tence importance based on term frequency relative to

rarity across a corpus. Similarly, TextRank (Mihal-

cea and Tarau, 2004) represents sentences as nodes

in a graph and ranks them using the PageRank algo-

rithm (Page et al., 1999).

Neural networks advanced both extractive

and also abstractive summarization by model-

ing sequences with Recurrent Neural Networks

(RNNs) (Elman, 1990). One extractive approach

involves sequence-to-sequence architectures where

LSTM models capture the contextual importance of

each sentence within a document (Nallapati et al.,

2017). Hierarchical attention networks combine

sentence-level and word-level attention to better

capture document structure and relevance for sum-

marization (Yang et al., 2016). This approach

has proven effective in summarizing longer and

more complex documents. Hybrid approaches

combining BERT embeddings (Devlin et al., 2019)

with K-Means clustering (Lloyd, 1982) to identify

key sentences (Miller, 2019) have shown excellent

performance for abstractive summarization.

Advances in sequence-to-sequence Transformer-

based models (Vaswani et al., 2017) have revolution-

ized abstractive summarization. Recent models like

T5 (Raffel et al., 2020a) adopt a text-to-text frame-

work and excel in various tasks, including summa-

rization, due to pre-training on the C4 dataset. PE-

GASUS (Zhang et al., 2019) introduces gap sen-

tences generation for masking key sentences dur-

ing pre-training, achieving strong performance on 12

datasets. Similarly, BART (Lewis et al., 2019) uses

denoising objectives for robust text summary gener-

ation. Multilingual models such as mT5 (Xue et al.,

2021b) and mBART (Liu et al., 2020) extend these

capabilities to multiple languages, including Czech,

through datasets like mC4 (Xue et al., 2021a) and

multilingual Common Crawl

However, these models often underperform on

non-English corpora without ﬁne-tuning.

3 DATASETS

The following section provides a brief review of the

primary existing summarization datasets. Moreover,

http://commoncrawl.org/

the created Posel od

Cerchova corpus will also be de-

tailed at the end of this section.

3.1 English Datasets

CNN/Daily Mail (Hermann et al., 2015): dataset

consists of over 300,000 English news articles, each

paired with highlights written by the article authors. It

has been widely used in summarization and question-

answering tasks, evolving through several versions

tailored for speciﬁc NLP tasks.

XSum (Narayan et al., 2018): contains 226,000

single-sentence summaries paired with BBC articles

covering diverse domains such as news, sports, and

science. Its focus on single-sentence summarization

makes it less biased toward extractive methods.

Arxiv Dataset (Cohan et al., 2018): includes 215,000

pairs of scientiﬁc papers and their abstracts sourced

from arXiv. It has been cleaned and formatted to en-

sure standardization, with sections like ﬁgures and ta-

bles removed.

BOOKSUM (Kryscinski et al., 2022): is a dataset

tailored for summarizing long texts like novels, plays,

and stories, with summaries provided at paragraph,

chapter, and book levels. Texts and summaries

were sourced from Project Gutenberg and other web

archives, supporting both extractive and abstractive

summarization.

3.2 Multilingual Datasets

XLSum (Hasan et al., 2021): provides over one mil-

lion article-summary pairs across 44 languages, rang-

ing from low-resource languages like Bengali and

Swahili to high-resource languages such as English

and Russian. Extracted from various BBC sites, this

dataset is a valuable resource for multilingual summa-

rization research.

MLSUM (Scialom et al., 2020): consists of 1.5 mil-

lion article-summary pairs in ﬁve languages: German,

Russian, French, Spanish, and Turkish. The dataset

was created by archiving news articles from well-

known newspapers, including Le Monde and El Pais,

with a focus on ensuring broad topic coverage.

The above-mentioned datasets are for English

summarization, and some are multilingual; however,

Czech resources remain very limited.

3.3 SumeCzech

SumeCzech large-scale dataset (Straka et al., 2018) is

a notable exception to the scarcity of Czech-speciﬁc

resources. This dataset was created at the Institute

Large Language Models for Summarizing Czech Historical Documents and Beyond

799

of Formal and Applied Linguistics at Charles Uni-

versity and is tailored for summarization tasks in the

Czech language. It comprises one million Czech news

articles. These articles are sourced from ﬁve ma-

jor Czech news sites:

Cesk

e Noviny, Den

ık, iDNES,

Lidovky, and Novinky.cz. Each document is struc-

tured in JSONLines format, with ﬁelds for the URL,

headline, abstract, text, subdomain, section, and pub-

lication date. The preprocessing includes language

recognition, duplicate removal, and ﬁltering out en-

tries with empty or excessively short headlines, ab-

stracts, or texts.

This dataset supports multiple summarization

tasks, such as headline generation and multi-sentence

abstract generation. The training, development, and

testing splits are in roughly 86.5/4.5/4.5 ratio. The

average word count is 409 for full texts and 38 for

abstracts.

Nevertheless, this dataset caters exclusively to

modern Czech and fails to address the needs of his-

torical text processing.

3.4 Posel od

Cerchova

To construct the dataset, we used data from the histor-

ical journal Posel od

Cerchova (POC), which is avail-

able on the archival portal Porta fontium

The construction of the dataset involved address-

ing the challenge of creating summaries for the pro-

vided texts, which were composed in historical Czech

and, in some rare cases, even German. The texts also

covered a variety of different topics, from local news

surrounding Doma

zlice (a historic town in the Czech

Republic), opinion pieces, and various local adver-

tisements to internal and worldwide politics and feuil-

letons. Furthermore, it was important to construct a

dataset of sufﬁcient size to ensure the accuracy and re-

liability of the evaluation. These aspects added com-

plexity to the summarization task.

To overcome the mentioned issues, we employed

state-of-the-art (SOTA) LLMs GPT-4 (OpenAI, 2024)

and Claude 3 Opus (Anthropic, 2024) (Opus) (speciﬁ-

cally the claude-3-opus-20240229 version) for ini-

tial text summary creation. These models were se-

lected based on their SOTA performance in many

NLP tasks and excellent performance in some prelim-

inary summarization experiments.

While generating the summaries, it was essential

to ensure conciseness. Since most of the implemented

methods were ﬁne-tuned on the SumeCzech dataset,

we aimed to maintain consistency by creating sum-

maries in a journalistic style, reﬂecting the dataset’s

https://www.portafontium.eu

characteristics. To achieve this, the prompts for gen-

erating the summaries included explicit instructions,

as shown below:

• Vytvo

r shrnut

ı n

asleduj

ıc

ıho textu ve stylu

novin

re. Po

cet v

et <= 5; (EN: Create a sum-

mary of the following text in the style of a jour-

nalist. Number of sentences <= 5)

During the summarization task, we observed that

while both models produced summaries of very good

quality, Opus tended to create more succinct and

stylistically appropriate ones, closely aligning with

the news reporter format. However, there were in-

stances where summaries generated by Opus exhib-

ited an excessive focus on a single topic.

On the other hand, GPT-4 aimed to incorporate

a greater level of detail within the ﬁve-sentence con-

straint but occasionally deviated from the speciﬁed

stylistic prompt.

If the model-generated summary exhibited signiﬁ-

cant stylistic deviations or excessive focus on a single

topic, we either modiﬁed or regenerated it until a cor-

rect version was achieved.

Two-level summaries were created; the ﬁrst one

was on the page level, and the second one summarizes

a whole article that is usually composed of several

pages. We thus summarized 432 pages, effectively

resulting in the creation of 100 issue summaries. The

subset containing page summaries is hereafter re-

ferred to as POC-P, while the issue summaries are

referred to as POC-I. Note that all created summaries

were checked and corrected manually by two native

Czech speakers.

The dataset is in the .json format and contains the

following information:

• text. Text extracted from the given page, a digital

rendition of the original printed content;

• summary. Summary of the page, which is no

more than 5 sentences long;

• year. Publication year of the journal;

• journal. Speciﬁcation of the source journal: the

day, month, and the number of the issue is con-

tained within this identiﬁer;

• page src. Name of the source image ﬁle con-

verted into the text;

• page num. Page number.

This dataset is designed to support summarization

tasks within Czech historical contexts, providing re-

searchers with the tools to tackle the linguistic chal-

lenges unique to this domain. The corpus is freely

accessible for research purposes

https://corpora.kiv.zcu.cz/posel od cerchova/

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

800

4 MODELS

The experiments employ two advanced Transformer-

based models, Multilingual Text-to-Text Transfer

Transformer (mT5) (Xue et al., 2021b) and Mistral

7B (Jiang et al., 2023).

4.1 Multilingual Text-to-Text Transfer

Transformer

The Multilingual Text-to-Text Transfer Transformer

(mT5) is a variant of the T5 model designed for mul-

tilingual tasks. This model is trained on the multi-

lingual mC4 dataset (Xue et al., 2021a), which in-

cludes Czech, and effectively handles a wide range

of languages. The model is based on Transformer

encoder-decoder architecture and uses a Sentence-

Piece tokenizer (Kudo and Richardson, 2018) to pro-

cess complex language structures, including Czech

morphology. Pre-trained using a span corruption ob-

jective (Raffel et al., 2020b), mT5 predicts masked

spans of text, enabling it to learn semantic and con-

textual relationships.

The mT5 model is available in various sizes, from

small with 300 million parameters to XXL with 13

billion parameters, and is therefore adapted to dif-

ferent computational needs. The base variant of the

mT5, which contains 580 million parameters, is used

for further experiments.

4.2 Mistral Language Model

The Mistral Language Model (Mistral LM) is a highly

efﬁcient large language model known for its robust

performance across diverse natural language process-

ing tasks. It is designed to combine high accuracy

with computational efﬁciency, achieving state-of-the-

art results in reasoning, text generation, summariza-

tion, and other NLP applications. Mistral 7B, with its

7 billion parameters, strikes a balance between com-

putational efﬁciency and task performance, surpass-

ing larger models like 13B or 34B in several bench-

marks.

This model utilizes advanced attention mecha-

nisms like Grouped-Query Attention (GQA) (Ainslie

et al., 2023) and Sliding Window Attention

(SWA) (Beltagy et al., 2020). GQA enhances pro-

cessing speed by grouping attention heads to focus

on the same input data, while SWA reduces computa-

tional costs by limiting token attention to nearby to-

kens. The model supports techniques such as quanti-

zation (Gholami et al., 2021) and Low-Rank Adapta-

tion (LoRA) (Hu et al., 2021) for efﬁcient ﬁne-tuning

on limited hardware, enabling it to handle longer in-

puts effectively.

5 EXPERIMENTS

5.1 Evaluation Metrics

The following evaluation metrics are used.

ROUGE (Recall-Oriented Understudy for Gisting

Evaluation) (Lin, 2004) is a set of metrics used to

evaluate the quality of summaries by comparing n-

gram overlaps between a system-generated summary

and reference texts. Key ROUGE metrics include

ROUGE-N (for n-gram overlap) and ROUGE-L (for

the longest common subsequence).

ROUGERAW (Straka and Strakov

a, 2018) is a

variant of ROUGE that evaluates raw token-level

overlaps between predicted and reference texts with-

out any preprocessing like stemming or lemmatiza-

tion. It measures exact matches of tokens, making

it suitable for tasks where precise token alignment is

important.

5.2 Set-up

We used AdamW optimizer (Loshchilov and Hut-

ter, 2017) with a learning rate set to 0.001 as sug-

gested by authors of mT5 (Xue et al., 2021b) for the

training of this model. For Mistral 7B, we utilized

QLoRA (Dettmers et al., 2024), a method that inte-

grates a 4-bit quantized model with a small, newly

introduced set of learnable parameters. During ﬁne-

tuning, only these additional parameters are updated

while the original model remains frozen, thereby sub-

stantially reducing memory requirements. We em-

ploy the models from the HuggingFace Transformers

library (Wolf et al., 2020). For training both mod-

els, we used a single NVIDIA A40 GPU with 45 GB

VRAM.

5.3 Model Variants

We use three variants of the models in our experi-

ments:

• M7B-SC: The Mistral 7B model ﬁne-tuned on the

SumeCzech dataset;

• M7B-POC: The Mistral 7B model further ﬁne-

tuned on the POC dataset;

• mT5-SC: The mT5 model ﬁne-tuned on the

SumeCzech dataset.

Large Language Models for Summarizing Czech Historical Documents and Beyond

801

Table 1: Results of various methods on SumeCzech dataset with precision (P), recall (R), and F1-score (F).

Method ROUGE

raw

-1 ROUGE

raw

-2 ROUGE

raw

-L

P R F P R F P R F

M7B-SC 24.4 19.7 21.2 6.5 5.3 5.7 17.8 14.5 15.5

mT5-SC 22.0 17.9 19.2 5.3 4.3 4.6 16.1 13.2 14.1

HT2A-S (Krotil, 2022) 22.9 16.0 18.2 5.7 4.0 4.6 16.9 11.9 13.5

First (Straka et al., 2018) 13.1 17.9 14.4 0.1 9.8 0.2 1.1 8.8 0.9

Random (Straka et al., 2018) 11.7 15.5 12.7 0.1 2.0 0.1 0.7 10.3 0.8

Textrank (Straka et al., 2018) 11.1 20.8 13.8 0.1 6.0 0.3 0.7 13.4 0.8

Tensor2Tensor (Straka et al., 2018) 13.2 10.5 11.3 0.1 2.0 0.1 0.2 8.1 0.8

Table 2: Results of implemented methods on the POC-P subset from Posel od

Cerchova dataset with precision (P), recall (R),

and F1-score (F).

Method ROUGE

raw

-1 ROUGE

raw

-2 ROUGE

raw

-L

P R F P R F P R F

M7B-POC 23.5 17.4 19.6 4.8 3.5 4.0 16.6 12.2 13.8

mT5-SC 20.2 8.2 11.1 1.4 0.5 0.7 14.9 6.1 8.2

Table 3: Results of implemented methods on POC-I subset from Posel od

Cerchova dataset with precision (P), recall (R), and

F1-score (F).

Method ROUGE

raw

-1 ROUGE

raw

-2 ROUGE

raw

-L

P R F P R F P R F

M7B-POC 19.3 17.6 18.0 3.2 2.8 2.9 13.7 12.4 12.8

mT5-SC 18.2 5.9 8.6 1.0 0.3 0.4 14.0 4.5 6.5

5.4 Results on the SumeCzech Dataset

This experiment compares the results of the proposed

mT5-SC and M7B-SC models with related work on

the SumeCzech dataset, see Table 1.

The ﬁrst comparative method, HT2A-S (Krotil,

2022), is based on the mBART model, which is fur-

ther ﬁne-tuned on the SumeCzech dataset. The other

methods provided by the authors of the SumeCzech

dataset (Straka et al., 2018) are as follows: First, Ran-

dom, Textrank and Tensor2Tensor (Vaswani et al.,

2018).

Table 1 demonstrates that the proposed M7B-SC

method is very efﬁcient, outperforming all other base-

lines and achieving new state-of-the-art results on this

dataset. Furthermore, the second proposed approach,

mT5-SC, also performs remarkably well, consistently

obtaining the second-best results.

5.4.1 Results on Posel od

Cerchova Dataset

This section evaluates the proposed methods on the

Posel od

Cerchova dataset. Table 2 shows the results

on the POC-P subset containing summaries for every

page (106 pages), while Table 3 depicts the results

on the POC-I subset, which is composed of the sum-

maries of every article (25 issues).

These tables show clearly that, as in the previous

case, M7T-POC model gives signiﬁcantly better re-

sults than the mT5-SC model, and it is with a very

high margin.

6 CONCLUSIONS

This paper explored the application of state-of-the-

art large language models, speciﬁcally Mistral 7B

and mT5, for summarization of Czech texts, ad-

dressing both modern and historical contexts. Our

experiments demonstrated that the proposed M7B-

SC model establishes a new benchmark for the

SumeCzech dataset, achieving state-of-the-art per-

formance, while the mT5-SC model also performed

strongly, consistently ranking second.

Furthermore, we introduced a novel dataset, Posel

Cerchova, dedicated for the summarization of his-

torical Czech documents. By leveraging this dataset,

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

802

we provided baseline results and highlighted the

unique challenges posed by historical Czech texts.

These contributions not only advance the ﬁeld of

Czech text summarization but also pave the way for

future research in processing historical documents,

offering signiﬁcant opportunities in cultural preserva-

tion and digital humanities. Future work could focus

on further enhancing summarization quality, explor-

ing hybrid modeling approaches, and extending the

dataset for multilingual and cross-temporal studies.

ACKNOWLEDGEMENTS

This work was created with the partial support of

the project R&D of Technologies for Advanced Dig-

italization in the Pilsen Metropolitan Area (Dig-

iTech) No. CZ.02.01.01/00/23 021/0008436 and by

the Grant No. SGS-2022-016 Advanced methods

of data processing and analysis. Computational re-

sources were provided by the e-INFRA CZ project

(ID:90254), supported by the Ministry of Education,

Youth and Sports of the Czech Republic.

REFERENCES

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y.,

Lebr

on, F., and Sanghai, S. (2023). Gqa: Train-

ing generalized multi-query transformer models from

multi-head checkpoints.

Anthropic (2024). The Claude 3 Model Family: Opus, Son-

net, Haiku.

Beltagy, I., Peters, M. E., and Cohan, A. (2020). Long-

former: The long-document transformer.

Christian, H., Agus, M., and Suhartono, D. (2016). Sin-

gle document automatic text summarization using

term frequency-inverse document frequency (tf-idf).

ComTech: Computer, Mathematics and Engineering

Applications, 7:285.

Cohan, A., Dernoncourt, F., Kim, D. S., Bui, T., Kim, S.,

Chang, W., and Goharian, N. (2018). A discourse-

aware attention model for abstractive summarization

of long documents. In Walker, M., Ji, H., and Stent,

A., editors, Proceedings of the 2018 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, Volume 2 (Short Papers), pages 615–621,

New Orleans, Louisiana. Association for Computa-

tional Linguistics.

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer,

L. (2024). Qlora: efﬁcient ﬁnetuning of quantized

llms. In Proceedings of the 37th International Con-

ference on Neural Information Processing Systems,

NIPS ’23, Red Hook, NY, USA. Curran Associates

Inc.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding.

Elman, J. L. (1990). Finding structure in time. Cognitive

Science, 14(2):179–211.

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W.,

and Keutzer, K. (2021). A survey of quantization

methods for efﬁcient neural network inference.

Hasan, T. et al. (2021). Xlsum: A multilingual dataset

for summarization. In Findings of the Association

for Computational Linguistics: EMNLP 2021, pages

2133–2149.

Hermann, K. M., Ko

cisk

y, T., Grefenstette, E., Espeholt,

L., Kay, W., Suleyman, M., and Blunsom, P. (2015).

Teaching machines to read and comprehend. In

Advances in Neural Information Processing Systems

(NeurIPS), pages 1693–1701.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,

S., Wang, L., and Chen, W. (2021). Lora: Low-rank

adaptation of large language models.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford,

C., Chaplot, D. S., de las Casas, D., Bressand, F.,

Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R.,

Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T.,

Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mis-

tral 7b.

Krotil, M. (2022). Text summarization methods in czech.

Bachelor’s thesis, Czech Technical University in

Prague, Faculty of Electrical Engineering, Depart-

ment of Cybernetics.

Kryscinski, W., Rajani, N., Agarwal, D., Xiong, C., and

Radev, D. (2022). BOOKSUM: A collection of

datasets for long-form narrative summarization. In

Goldberg, Y., Kozareva, Z., and Zhang, Y., edi-

tors, Findings of the Association for Computational

Linguistics: EMNLP 2022, pages 6536–6558, Abu

Dhabi, United Arab Emirates. Association for Com-

putational Linguistics.

Kudo, T. and Richardson, J. (2018). SentencePiece: A sim-

ple and language independent subword tokenizer and

detokenizer for neural text processing. In Blanco, E.

and Lu, W., editors, Proceedings of the 2018 Confer-

ence on Empirical Methods in Natural Language Pro-

cessing: System Demonstrations, pages 66–71, Brus-

sels, Belgium. Association for Computational Lin-

guistics.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-

hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer,

L. (2019). Bart: Denoising sequence-to-sequence pre-

training for natural language generation, translation,

and comprehension.

Lin, C.-Y. (2004). ROUGE: A package for automatic evalu-

ation of summaries. In Text Summarization Branches

Out, pages 74–81, Barcelona, Spain. Association for

Computational Linguistics.

Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvinine-

jad, M., Lewis, M., and Zettlemoyer, L. (2020). Mul-

tilingual denoising pre-training for neural machine

translation. Transactions of the Association for Com-

putational Linguistics, 8:726–742.

Large Language Models for Summarizing Czech Historical Documents and Beyond

803

Lloyd, S. (1982). Least squares quantization in pcm. IEEE

Transactions on Information Theory, 28(2):129–137.

Loshchilov, I. and Hutter, F. (2017). Decoupled weight de-

cay regularization. arXiv preprint arXiv:1711.05101.

Mihalcea, R. and Tarau, P. (2004). TextRank: Bringing or-

der into text. In Lin, D. and Wu, D., editors, Pro-

ceedings of the 2004 Conference on Empirical Meth-

ods in Natural Language Processing, pages 404–411,

Barcelona, Spain. Association for Computational Lin-

guistics.

Miller, D. (2019). Leveraging bert for extractive text sum-

marization on lectures.

Nallapati, R., Zhai, F., and Zhou, B. (2017). Summarunner:

A recurrent neural network based sequence model for

extractive summarization of documents. In Proceed-

ings of the Thirty-First AAAI Conference on Artiﬁcial

Intelligence (AAAI), pages 3075–3081.

Narayan, S., Cohen, S. B., and Lapata, M. (2018). Ex-

treme summarization (xsum). In Proceedings of the

2018 Conference on Empirical Methods in Natural

Language Processing, pages 931–936.

OpenAI (2024). Gpt-4 technical report.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999).

The pagerank citation ranking : Bringing order to the

web. In The Web Conference.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020a).

Exploring the limits of transfer learning with a uniﬁed

text-to-text transformer. Journal of Machine Learning

Research, 21(140):1–67.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020b).

Exploring the limits of transfer learning with a uniﬁed

text-to-text transformer. Journal of Machine Learning

Research, 21(140):1–67.

Scialom, T. et al. (2020). Mlsum: Multilingual summariza-

tion dataset. In Proceedings of the 2020 Conference

on Empirical Methods in Natural Language Process-

ing, pages 2146–2161.

Straka, M., Mediankin, N., Kocmi, T.,

Zabokrtsk

y, Z.,

Hude

cek, V., and Haji

c, J. (2018). SumeCzech: Large

Czech news-based summarization dataset. In Pro-

ceedings of the Eleventh International Conference on

Language Resources and Evaluation (LREC 2018),

Miyazaki, Japan. European Language Resources As-

sociation (ELRA).

Straka, M. and Strakov

a, J. (2018). Rougeraw: Language-

agnostic evaluation for summarization. Proceedings

of the International Conference on Computational

Linguistics.

Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez,

A. N., Gouws, S., Jones, L., Kaiser, L., Kalchbrenner,

N., Parmar, N., Sepassi, R., Shazeer, N., and Uszkor-

eit, J. (2018). Tensor2tensor for neural machine trans-

lation. CoRR, abs/1803.07416.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L. u., and Polosukhin,

I. (2017). Attention is all you need. In Guyon,

I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-

gus, R., Vishwanathan, S., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 30. Curran Associates, Inc.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,

C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-

icz, M., Davison, J., Shleifer, S., von Platen, P., Ma,

C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger,

S., Drame, M., Lhoest, Q., and Rush, A. M. (2020).

Transformers: State-of-the-art natural language pro-

cessing. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing:

System Demonstrations, pages 38–45, Online. Asso-

ciation for Computational Linguistics.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou,

R., Siddhant, A., Barua, A., and Raffel, C. (2021a).

mC4: A massively multilingual cleaned crawl corpus.

In Proceedings of the 2021 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 7517–7532, Online and Punta Cana, Dominican

Republic. Association for Computational Linguistics.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou,

R., Siddhant, A., Barua, A., and Raffel, C. (2021b).

mT5: A massively multilingual pre-trained text-to-

text transformer. In Toutanova, K., Rumshisky,

A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I.,

Bethard, S., Cotterell, R., Chakraborty, T., and Zhou,

Y., editors, Proceedings of the 2021 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, pages 483–498, Online. Association for Compu-

tational Linguistics.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy,

E. (2016). Hierarchical attention networks for docu-

ment classiﬁcation. In Proceedings of the 2016 Con-

ference of the North American Chapter of the Associa-

tion for Computational Linguistics: Human Language

Technologies, pages 1480–1489.

Zhang, J., Zhao, Y., Saleh, M., and Liu, P. J. (2019). Pe-

gasus: Pre-training with extracted gap-sentences for

abstractive summarization.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

804