Comparative Study of Large Language Models Applied to the

Classiﬁcation of Accountability Documents

Pedro Vinn

ıcius Bernhard

, Jo

ao Dallyson Sousa de Almeida

, Anselmo Cardoso de Paiva

Geraldo Braz Junior

, Renan Coelho de Oliveira

, Lu

ıs Jorge Enrique Rivero Cabrejos

and Darlan Bruno Pontes Quintanilha

Programa de P

os-Graduac¸

ao em Ci

encia da Computac¸

ao (PPGCC), Federal University of Maranh

ao - UFMA, Brazil

Federal University of Maranh

ao - UFMA, Applied Computing Group - NCA/UFMA,

Av. dos Portugues, SN, Campus Baganga, Baganga, CEP: 65085-584, S

ao Lu

ıs, MA, Brazil

Tribunal de Contas do Estado do Maranh

ao, Av. Carlos Cunha S/Nº Jaracaty, S

ao Lu

ıs, 65076-820, MA, Brazil

{pedro.bernhard, jdallyson, paiva, geraldo, luisrivero, dquintanilha}@nca.ufma.br, rcoliveira@tcema.tc.br

Keywords:

Large Language Models, Natural Language Processing, Document Classiﬁcation, Accountability, Public

Accountability.

Abstract:

Public account oversight is crucial, facilitated by electronic accountability systems. Through those systems,

audited entities submit electronic documents related to government and management accounts, categorized

according to regulatory guidelines. Accurate document classiﬁcation is vital for adhering to court standards.

Advanced technologies, including Large Language Models (LLMs), offer promise in optimizing this process.

This study examines the use of LLMs to classify documents pertaining to annual accounts received by reg-

ulatory bodies. Three LLM models were examined: mBERT, XLM-RoBERTa and mT5. These LLMs were

applied to a dataset of extracted texts speciﬁcally compiled for the research, based on documents provided by

the Tribunal de Contas do Estado do Maranh

ao (TCE/MA), and evaluated based on the F1-score. The results

strongly suggested that the XLM-RoBERTa model achieved an F1-score of 98.99% ± 0.12%, while mBERT

achieved 98.65% ± 0.29% and mT5 showed 98.71% ± 0.75%. These results highlight the effectiveness of

LLMs in classifying accountability documents and contributing to advances in natural language processing.

These approaches can potentially be exploited to improve automation and accuracy in document classiﬁca-

tions.

1 INTRODUCTION

A Brazilian Court of Accounts oversees the auditing

of administrators responsible for public ﬁnances at the

state and municipal levels (LENZA, 2020). These

administrators submit documents via an electronic

Annual Accountability System, following Normative

Instructions that categorize submissions (TCE/MA,

2023c,b,a). Typically, the entity’s holder, technical

manager, or an accredited third party manages this

process.

Automating document classiﬁcation can enhance

efﬁciency and optimize human resources in account-

ability procedures (Stites et al., 2023). Extracting in-

sights from large textual datasets is crucial for orga-

nizations, researchers, and professionals (Wan et al.,

2019). In this context, Natural Language Processing

(NLP) and neural networks have demonstrated sig-

niﬁcant potential (Khurana et al., 2023). An auto-

mated classiﬁcation model can streamline workﬂows,

improving resource allocation and processing speed

(Stites et al., 2023).

Neural networks have excelled in solving com-

plex problems because they can learn patterns and

represent non-linear information. Combining this ap-

proach with natural language processing makes it pos-

sible to extract relevant characteristics from docu-

ments and use this information to classify them ef-

ﬁciently and accurately (Khurana et al., 2023).

Recently, there has been a considerable increase

in public interest in artiﬁcial intelligence models for

natural language processing, such as Large Language

Models (LLMs) (Naveed et al., 2023). In natural lan-

guage processing, text classiﬁcation is an area with

few works in Portuguese focusing on real, multi-page

documents. The task is often applied to short, well-

formatted texts, such as classifying short comments

and summaries of academic articles or emails.

944

Bernhard, P. V., Sousa de Almeida, J. D., Cardoso de Paiva, A., Braz Junior, G., Coelho de Oliveira, R., Cabrejos, L. J. E. R. and Quintanilha, D. B. P.

Comparative Study of Large Language Models Applied to the Classiﬁcation of Accountability Documents.

DOI: 10.5220/0013439800003929

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 27th International Conference on Enterprise Information Systems (ICEIS 2025) - Volume 1, pages 944-951

ISBN: 978-989-758-749-8; ISSN: 2184-4992

This study seeks to assess and compare different

Language Models (LLMs) concerning their effective-

ness in categorizing annual accountability documents,

particularly those received electronically. The objec-

tive is to identify the most suitable LLM for this task

while also aiming to improve natural language pro-

cessing systems for applications in the organization

and classiﬁcation of texts in Portuguese.

1.1 Contributions

Research on using large language models (LLMs)

for text classiﬁcation has primarily focused on En-

glish, with limited resources and studies available for

Portuguese, especially when dealing with long docu-

ments. Portuguese, as a Romance language with rich

morphology and syntactic complexity, poses unique

challenges for natural language processing (NLP).

These challenges, combined with the relative scarcity

of annotated datasets and tools for Portuguese com-

pared to English, make this research particularly valu-

able.

From an academic perspective, this study bridges

a signiﬁcant gap in NLP research by exploring the ef-

fective use of LLMs for classifying long-form doc-

uments in Portuguese. This contribution broadens

the applicability of state-of-the-art NLP techniques

to less studied languages, facilitating advancements

in linguistic resource utilization and model adapta-

tion. Furthermore, the focus on the Brazilian con-

text strengthens NLP research in a region with dis-

tinct linguistic and cultural characteristics, paving the

way for applications ranging from sentiment analysis

to information retrieval in Portuguese-speaking envi-

ronments.

From a practical standpoint, this research en-

hances the efﬁciency of annual accountability pro-

cedures for governmental bodies and managers in

Brazil. By automating the classiﬁcation of documents

submitted for accountability purposes, it reduces the

workload associated with manual reviews and pro-

motes transparency and reliability in public adminis-

tration.

The remainder of this paper is organized into ﬁve

sections. Section 2 reviews related work on docu-

ment classiﬁcation. Section 3 details the methodol-

ogy, covering document acquisition, model selection,

and evaluation processes. Section 4 presents and dis-

cusses the performance of the models based on key

metrics. Finally, Section 5 offers concluding remarks

and suggestions for future research.

2 RELATED WORK

Document classiﬁcation has advanced signiﬁcantly

with the use of transformer-based models. Adhikari

et al. (2019a) pioneered BERT for this task, propos-

ing KD-LSTM

reg

, a Knowledge Distillation LSTM

model, which achieved an F1-score of 88.9% ± 0.5

on the Reuters dataset. This outperformed the previ-

ous state-of-the-art LSTM

reg

model by Adhikari et al.

(2019b), which scored 87% ± 0.5%. However, these

studies primarily focused on short documents, aver-

aging 175 words.

For longer documents, Wan et al. (2019) proposed

dividing documents into segments before classiﬁca-

tion, achieving an F1-score of up to 98.2% in multi-

label classiﬁcation. In the legal domain, Song et al.

(2022) introduced POSTURE50K, a legal multi-label

dataset, and a domain-speciﬁc pre-trained model with

a label attention mechanism, achieving a micro F1-

score of 81.2% and a macro F1-score of 27.6%.

na et al. (2023) explored multi-label classiﬁca-

tion of Spanish public documents using RoBERTa,

training separate models for each class. The highest

sensitivity achieved was 93.07% using an SVM clas-

siﬁer. While these works demonstrate progress, they

differ from this study, which focuses on single-label

classiﬁcation of long Portuguese accountability doc-

uments using multilingual models. Despite the lack

of direct comparability, this work achieves competi-

tive results, surpassing previous metrics in document

classiﬁcation tasks.

3 MATERIALS AND METHOD

This section outlines the methodology for evaluating

the performance of LLMs in this study, summarized

in Figure 1. Each step is detailed below.

3.1 Document Acquisition

A total of 19,853 documents were collected from the

Tribunal de Contas do Estado do Maranh

ao (TCE-

MA) database via the e-PCA system. These PDF doc-

uments were converted to text using the pypdﬁum2

library. A sanitation phase removed corrupted ﬁles,

duplicates, and documents with extraction issues to

ensure dataset quality.

https://pypi.org/project/pypdﬁum2/

Comparative Study of Large Language Models Applied to the Classiﬁcation of Accountability Documents

945

Figure 1: Methodology steps.

3.2 Preprocessing

Text extracted from PDFs underwent preprocessing

to reduce noise from the conversion process. Algo-

rithm 1 outlines the steps: removing irrelevant spe-

cial tokens, repeated characters, and excessive blank

or special characters. Parameters were empirically set

to balance noise reduction and information preserva-

tion.

Input: Text to be preprocessed.

Output: Preprocessed text.

1 Remove special tokens with no relevance to

the text;

2 Remove repeated characters;

3 Shorten blank character sequences;

4 Shorten special character sequences;

Algorithm 1: Text preprocessing.

Figure 2 illustrates the preprocessing impact on a

budget balance sheet document. Noise, such as re-

peated dashes, was removed to ensure the model re-

ceives relevant information.

The ﬁnal document types for classiﬁcation com-

prised (DCASP) Budget balance sheet, (DCASP) Fi-

nancial statement, (DCASP) Balance sheet, (DCASP)

Statement of changes in equity, (DCASP) Statement

of changes in assets, (DCASP) Cash ﬂow statement,

Figure 2: Budget balance sheet document. (a) PDF docu-

ment. (b) Extracted text. (c) Preprocessed extracted text.

(DCASP) Explanatory notes, Audit report and cer-

tiﬁcate (including the opinion of the head of the in-

ternal control body), Detailed management report,

Bank statements and reconciliations, and Letter to

TCE/MA.

3.3 LLMs

Three LLMs were selected for their multilingual ca-

pabilities and performance on Portuguese tasks:

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

946

• mBERT. Uses WordPiece tokenization, known

for strong contextual embeddings (Devlin et al.,

2019).

• XLM-RoBERTa. Employs SentencePiece tok-

enization, excelling in cross-lingual tasks (Con-

neau et al., 2020).

• mT5. Utilizes SentencePiece tokenization, ideal

for sequence-to-sequence tasks (Xue et al., 2021).

The three models – mBERT, XLM-RoBERTa, and

mT5 – were selected based on their proven perfor-

mance in multilingual NLP tasks, their effectiveness

in handling Portuguese, and their accessibility for

this study. Each model represents a distinct archi-

tecture (e.g., encoder-only, encoder-decoder) and to-

kenization approach, offering a diverse set of meth-

ods for comparison. While other models, such as

GPT-based architectures, were considered, they were

excluded due to computational constraints, licensing

limitations, or their focus on tasks outside the scope

of document classiﬁcation.

To optimize the training and validation processes,

all documents were preprocessed and tokenized be-

fore model training. The preprocessed and tokenized

documents were saved to disk, reducing computa-

tional overhead by eliminating the need for tokeniza-

tion during training.

Figure 3 illustrates the classiﬁcation process. The

text is ﬁrst extracted from PDF documents, followed

by preprocessing to remove noise. The preprocessed

text is then tokenized, and the ﬁrst 512 tokens are fed

into the LLM. The LLM extracts latent information

and outputs a vector representation of the document.

This output is passed to a classiﬁcation module, which

consists of a Multilayer Perceptron (MLP). The MLP,

combined with a softmax function, generates 11 prob-

abilities, each corresponding to one of the ﬁnal doc-

ument classes. This step-by-step process forms the

core of the document classiﬁcation methodology in

this study.

3.4 Data Division

Tokenized documents were split into development

(80%) and test (20%) sets, maintaining class propor-

tions. The development set was used for hyperparam-

eter optimization and cross-validation, while the test

set was reserved for ﬁnal validation. Figure 4 illus-

trates the data division process.

3.5 Hyperparameter Optimization

The development set was further split (20% training,

20% validation) for hyperparameter optimization us-

Figure 3: Model execution.

Figure 4: Data division.

ing the Tree-structured Parzen Estimator algorithm

and the Optuna library from Akiba et al. (2019). The

macro F1-score was optimized over 10 trials of 5

epochs each, tuning:

1. Learning rate (1 ×10

−5

to 1 × 10

−4

2. Weight decay (0.0 to 0.1),

3. Warm-up ratio (0.0 to 0.1).

AdamW optimizer parameters β

and β

were

ﬁxed based on Loshchilov and Hutter (2019) due to

poor results during tuning.

Comparative Study of Large Language Models Applied to the Classiﬁcation of Accountability Documents

947

3.6 Cross-Validation

Using the optimized hyperparameters and the devel-

opment dataset (80% of the total data), stratiﬁed 5-

fold cross-validation was performed. Each fold di-

vided the development set into 80% training and 20%

validation. Training ran for up to 15 epochs per fold,

with early stopping triggered if the macro F1-score

on the validation set did not improve for 5 consecu-

tive epochs. The best-performing model weights from

each fold, based on F1-score, were saved for further

evaluation.

3.7 Evaluation of Results

Five model conﬁgurations from cross-validation were

tested on the test set. Metrics included loss, accu-

racy, macro F1-score, ROC AUC, precision, and re-

call, chosen for their relevance in classiﬁcation tasks

with imbalanced data according to Opitz (2022). Re-

sults are presented as mean ± standard deviation.

4 RESULTS AND DISCUSSION

This section presents the experimental results and

analysis of the models evaluated in this study. The

dataset, preprocessing, and model performance are

discussed, with a focus on key metrics and ﬁndings.

4.1 Dataset Analysis

From the original 19,853 documents retrieved from

the TCE/MA database, 11,747 documents across 11

categories were retained after preprocessing. Table 1

shows the distribution of document types, with classes

such as budget balance sheets, ﬁnancial statements,

and audit reports. The Fisher-Pearson skewness co-

efﬁcient (0.023) indicates minimal class imbalance.

Table 2 provides statistics on sequence counts, sizes,

and document pages, while Table 3 lists the most

frequent sequences, reﬂecting accountability-related

terms. Figure 5 visualizes these sequences in a word

cloud.

4.2 Model Performance

Hyperparameter optimization was performed using

the TPE algorithm, with results shown in Figure 6.

Table 4 lists the optimal hyperparameters for each

model. Five-fold cross-validation was employed, and

the best-performing weights from each fold were used

for ﬁnal evaluation.

Table 1: Document classes.

# Name Qt. %

1 (DCASP) Budget balance sheet 1,225 10.43%

2 (DCASP) Financial statement 1,218 10.37%

3 (DCASP) Balance sheet 1,215 10.34%

4 (DCASP) Statement of changes in equity 1,206 10.27%

5 (DCASP) Statement of changes in assets 1,218 10.37%

(DCASP) Cash ﬂow statement

963 8.20%

7 (DCASP) Explanatory notes 694 5.91%

Audit report and certiﬁcate, with opinion

of the head of the internal control body

798 6.79%

9 Detailed management report 943 8.03%

10 Bank statements and reconciliations 1,135 9.66%

11 Letter to TCE/MA 1,132 9.64%

- Total 11,747 100%

Table 2: Document statistics.

Statistic

Quantity of

sequences per

document

Size of

a sequence

Quantity of

pages per

document

Lowest 14 1 1

Median 398 7 3

Greatest 1,218,778 128 9554

Mode 424 5 1

Average 7321.88 7.76 67.40

Standard deviation 36,412.81 3.38 337.91

Figures 7, 8, 9, 10, 11, and 12 show the train-

ing and validation metrics for loss, F1-score, accu-

racy, ROC AUC, precision, and recall, respectively.

XLM-RoBERTa consistently outperformed mBERT

and mT5 across most metrics, achieving the highest

F1-score (99.21% ± 0.16%) and accuracy (99.22% ±

0.20%) on the validation set. mT5 achieved the high-

est ROC AUC (99.93% ± 0.04%), but the difference

with XLM-RoBERTa was minimal.

Tables 5 and 6 summarize the validation and test

set metrics. XLM-RoBERTa achieved the best results

across most metrics, with an F1-score of 98.99% ±

0.12% on the test set. The confusion matrix in Figure

13 highlights that class 9 (Bank statements and rec-

onciliations) had the highest error rate (2.64%), but

overall performance remained strong.

Table 3: Most frequent sequences (size ≥ 3).

Sequence Quantity Sequence Quantity

saldo 1,912,656 enviada 472,248

conta 1,053,841 aplicac¸

ao 431,267

valor 768,908 atual 429,039

com 741,795 municipal 416,304

extrato 569,844 m

es 366,361

ted 569,396 cota 346,847

banco 563,153 por 326,218

transfer

encia 548,584 transf 318,534

para 492,926 referente 307,035

anterior 472,681 ano 305,332

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

948

Figure 5: Word cloud of most frequent sequences.

Table 4: Optimal hyperparameters.

Model

Learning

rate

Weight

decay

Warm up

ratio

mBERT 4.95 × 10

−5

0.063 0.045

XLM-RoBERTa 5.42 × 10

−5

0.097 0.081

mT5 7.46 × 10

−5

0.034 0.028

In conclusion, XLM-RoBERTa demonstrated su-

perior performance, achieving the highest F1-score

and accuracy, while mT5 excelled in ROC AUC. All

models performed exceptionally well, with mBERT

achieving an F1-score of 98.65% ± 0.29% on the test

set, underscoring their effectiveness in document clas-

siﬁcation.

5 CONCLUSIONS

Given the above, this study proposed an in-depth anal-

ysis of the effectiveness and performance of advanced

language models, known as Large Language Models

(LLMs), in the speciﬁc task of classifying documents

related to the Accountability of managers linked to

the Tribunal de Contas do Estado do Maranh

ao. The

proposal sought to evaluate how these models, which

are trained on a large scale to understand and generate

natural text, can optimize and improve the automation

of the process of analyzing and categorizing account-

ing and ﬁnancial documents submitted to the court.

After analyzing the data and implementing the

strategies outlined in the research project, the study

identiﬁed that the XLM-RoBERTa model is the most

suitable for the document classiﬁcation task since this

model achieved an F1-score of 98.99% ± 0.12% on

the test dataset.

When validating and testing the models discussed

in this research, it was found that all the models

showed favorable results in the metrics explored. This

highlights how good LLMs are for classifying docu-

ments.

Throughout this research, the ﬁrst speciﬁc objec-

Figure 6: Hyperparameter optimization.

Figure 7: Loss over epochs.

tive of the research was achieved. The collection and

organization of this information provided a solid ba-

sis for subsequent analyses, allowing for a systematic

approach to applying LLMs for classifying these doc-

uments. The availability of a representative data set

proved crucial to achieving the objectives set out in

this research, giving validity and solidity to the pro-

posed evaluation processes.

Continuing with the other speciﬁc objectives of

the research, the outlined goal of using these models

in the document categorization process was achieved,

demonstrating the ability of these advanced technolo-

gies to deal with the complexity inherent in the data

contained in the documents analyzed.

Experiments to evaluate the performance of each

LLM in classifying documents yielded signiﬁcant re-

sults. The analysis covered several metrics, including

accuracy, F1-score, ROC AUC, precision, and sensi-

tivity, providing a broad and accurate assessment of

each model’s performance.

The analysis of the results obtained, the central

target of this research, corroborates the achievement

of the established objectives. This crucial stage vali-

dated the approaches adopted, contributing to an un-

derstanding of the role and potential of LLMs in docu-

ment analysis and categorization. This analytical pro-

cess represents not only a conclusive closure to this

research but also opens doors for future investigations

and the practical application of this knowledge in the

wider context of document management and natural

Comparative Study of Large Language Models Applied to the Classiﬁcation of Accountability Documents

949

Table 5: Validation set metrics.

Model Loss F1-score Accuracy ROC AUC Precision Recall

mBERT 7.31% ± 2.41% 98.81% ± 0.35% 98.89% ± 0.35% 99.88% ± 0.06% 98.83% ± 0.34% 98.80% ± 0.36%

mT5 8.51% ± 2.46% 98.63% ± 0.72% 98.71% ± 0.61% 99.93% ± 0.04% 98.73% ± 0.56% 98.58% ± 0.82%

XLM-RoBERTa 5.48% ± 1.64% 99.21% ± 0.16% 99.22% ± 0.20% 99.91% ± 0.07% 99.24% ± 0.16% 99.18% ± 0.17%

Table 6: Test set metrics.

Model Loss F1-score Accuracy ROC AUC Precision Recall

mBERT 9.08% ± 1.80% 98.65% ± 0.29% 98.69% ± 0.29% 99.90% ± 0.05% 98.67% ± 0.27% 98.65% ± 0.30%

mT5 8.12% ± 2.13% 98.71% ± 0.75% 98.75% ± 0.67% 99.90% ± 0.01% 98.75% ± 0.67% 98.70% ± 0.80%

XLM-RoBERTa 6.53% ± 0.77% 98.99% ± 0.12% 98.99% ± 0.12% 99.94% ± 0.02% 98.98% ± 0.12% 99.01% ± 0.12%

Figure 8: F1-scores over epochs.

Figure 9: Accuracies over epochs.

Figure 10: ROC AUC over epochs.

Figure 11: Precisions over epochs.

language processing technology.

The research is crucial to understanding the po-

tential of these models in the area of auditing and in-

spection, contributing to the efﬁciency and effective-

ness of the procedures for evaluating accountability in

public management.

Figure 12: Recalls over epochs.

Figure 13: XLM-RoBERTa confusion matrix.

For further research, this study suggests several

approaches to improve the understanding and applica-

tion of LLMs in document classiﬁcation. Firstly, we

recommend evaluating more robust and recent LLMs

with billions of parameters to explore the potential of

these models on an even broader scale. Furthermore,

contemporary techniques such as Document Image

Classiﬁcation (DIC) should be incorporated, which

goes beyond textual analysis by classifying document

pages as images, thus broadening the analysis per-

spectives. In addition, the research suggests consider-

ing not only the textual content but also the structure

of the text, including elements such as location and

style, as criteria for classifying documents. This more

comprehensive approach seeks to enrich the under-

standing of the performance and capabilities of LLMs

in more diverse and challenging contexts.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

950

ACKNOWLEDGMENTS

The authors acknowledge the Coordenac¸

ao de

Aperfeic¸oamento de Pessoal de N

ıvel Superior

(CAPES), Brazil - Finance Code 001, Conselho

Nacional de Desenvolvimento Cient

ıﬁco e Tec-

nol

ogico (CNPq), Brazil, and Fundac¸

ao de Amparo

Pesquisa Desenvolvimento Cient

ıﬁco e Tecnol

ogico

do Maranh

ao (FAPEMA) (Brazil) and Tribunal de

Contas do Estado do Maranh

ao (TCE-MA) for the ﬁ-

nancial support.

During the preparation of this work the authors

used ChatGPT in order to enhance the ﬂow of the text

and DeepL as a translation assistant to improve ﬂu-

ency. After using these tools, the authors reviewed

and edited the content as needed and take full respon-

sibility for the content of the publication.

REFERENCES

Adhikari, A., Ram, A., Tang, R., and Lin, J. (2019a).

Docbert: Bert for document classiﬁcation.

Adhikari, A., Ram, A., Tang, R., and Lin, J. (2019b).

Rethinking complex neural network architectures for

document classiﬁcation. In Proceedings of the 2019

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, Volume 1 (Long and Short Pa-

pers), pages 4046–4051, Minneapolis, Minnesota. As-

sociation for Computational Linguistics.

Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M.

(2019). Optuna: A next-generation hyperparameter

optimization framework.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,

Wenzek, G., Guzm

an, F., Grave, E., Ott, M., Zettle-

moyer, L., and Stoyanov, V. (2020). Unsupervised

cross-lingual representation learning at scale. In Juraf-

sky, D., Chai, J., Schluter, N., and Tetreault, J., editors,

Proceedings of the 58th Annual Meeting of the As-

sociation for Computational Linguistics, pages 8440–

8451, Online. Association for Computational Linguis-

tics.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. In Burstein,

J., Doran, C., and Solorio, T., editors, Proceedings

of the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, Volume 1 (Long and

Short Papers), pages 4171–4186, Minneapolis, Min-

nesota. Association for Computational Linguistics.

Khurana, D., Koli, A., Khatter, K., and Singh, S. (2023).

Natural language processing: state of the art, current

trends and challenges. Multimedia Tools and Applica-

tions, 82(3):3713–3744.

LENZA, P. (2020). Direito constitucional esquematizado.

Saraiva, S

ao Paulo, 15. ed. rev. atual. ampl edition.

Loshchilov, I. and Hutter, F. (2019). Decoupled weight

decay regularization. In International Conference on

Learning Representations.

Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Us-

man, M., Akhtar, N., Barnes, N., and Mian, A. (2023).

A comprehensive overview of large language models.

Opitz, J. (2022). From bias and prevalence to macro f1,

kappa, and mcc: A structured overview of metrics for

multi-class evaluation.

na, A., Morales, A., Fierrez, J., Serna, I., Ortega-Garcia,

J., Puente, I., C

ordova, J., and C

ordova, G. (2023).

Leveraging large language models for topic classiﬁca-

tion in the domain of public affairs.

Song, D., Vold, A., Madan, K., and Schilder, F. (2022).

Multi-label legal document classiﬁcation: A deep

learning-based approach with label-attention and

domain-speciﬁc pre-training. Inf. Syst., 106(C).

Stites, M. C., Howell, B. C., and Baxley, P. A. (2023).

Assessing the impact of automated document classi-

ﬁcation decisions on human decision-making. Tech-

nical report, Sandia National Lab.(SNL-NM), Albu-

querque, NM (United States).

TCE/MA (2023a). e-pca - sistema de prestac¸

ao de contas

anual eletr

onica.

TCE/MA (2023b). InstruC¸

Ao normativa tce/ma nº 52, de

25 de outubro de 2017.

TCE/MA (2023c). Sistema de prestac¸

ao de contas anual

eletr

onica (epca) j

a est

a dispon

ıvel aos usu

arios.

Wan, L., Papageorgiou, G., Seddon, M., and Bernardoni, M.

(2019). Long-length legal document classiﬁcation.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou,

R., Siddhant, A., Barua, A., and Raffel, C. (2021).

mT5: A massively multilingual pre-trained text-to-

text transformer. In Toutanova, K., Rumshisky,

A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I.,

Bethard, S., Cotterell, R., Chakraborty, T., and Zhou,

Y., editors, Proceedings of the 2021 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, pages 483–498, Online. Association for Compu-

tational Linguistics.

Comparative Study of Large Language Models Applied to the Classiﬁcation of Accountability Documents

951