Using Large Language Models to Support the Audit Process in the

Accountability of Interim Managers in Notary Ofﬁces

Myke Valad

, Natalia Freire

, Mateus de Paula

, Lucas Almeida

and Leonardo Marques

SiDi Intelligence & Innovation Center, Manaus, Brazil

{m.douglas, natalia.f, mp.silva, l.dasilva, lc.marques}@sidi.org.br

Keywords:

Auditing, Large Language Models, Fraud Detection, Notary Ofﬁces, Resource Management, AI.

Abstract:

The auditing process in notary ofﬁces in Brazil is hindered by inefﬁciencies, high costs, and the complexity

of manual procedures. To address these challenges, we propose a system that leverages the capabilities of

Large Language Models (LLMs), speciﬁcally LLaMA2-7B and Falcon-7B, to automate critical information

extraction from diverse document types. The system detects anomalous monetary values and unauthorized

services, linking them to corresponding dates and beneﬁciaries to provide a detailed overview of ﬁnancial

discrepancies. Integrating advanced Natural Language Processing (NLP) techniques into auditing workﬂows

enhances fraud detection, reduces operational costs, and improves accuracy. With a BLEU metric superior to

0.67, the proposed system demonstrates signiﬁcant potential to streamline auditing operations. Key beneﬁts

include assisting court analysts in identifying fraud cases, optimizing public resource management by elimi-

nating unjustiﬁed expenses, and potentially increasing court revenues to reinvest in public services.

1 INTRODUCTION

The audit process in notary ofﬁces presents chal-

lenges that signiﬁcantly impact efﬁciency, ﬁnancial

sustainability, and accountability. One of the pri-

mary issues is the high ﬁnancial cost associated with

manual auditing procedures. Auditors must manu-

ally examine and cross-reference large volumes of

documents, which is both time-consuming and labor-

intensive. Furthermore, the diversity of document

types, including contracts, deeds, invoices, and re-

ceipts, adds complexity to the process. Each docu-

ment type may have a unique structure, format, and

set of relevant data points, making it challenging to

apply a standardized approach. Additionally, these

manual processes make it harder to detect and pre-

vent fraudulent activities, such as falsiﬁed reports, su-

perﬂuous charges, and unauthorized or prohibited ser-

vices (Santana et al., 2024; Simunic, 1980). This re-

liance on manual methods increases the likelihood of

human error and limits the scalability of auditing op-

erations, ultimately undermining the transparency and

https://orcid.org/0000-0001-7595-2266

https://orcid.org/0000-0002-0762-9800

https://orcid.org/0009-0009-9060-6447

https://orcid.org/0009-0002-7106-248X

https://orcid.org/0000-0002-3645-7606

effectiveness of interim managers in notary ofﬁces.

To address these issues, integrating LLMs offers

a transformative solution. With their advanced NLP

capabilities, LLMs can automate and streamline the

extraction and analysis of critical information from

diverse document types. By doing so, they can sig-

niﬁcantly reduce the time and cost associated with

manual audits. Moreover, LLMs are adept at iden-

tifying anomalies, such as inconsistencies in ﬁnancial

data or suspicious activities, which can aid in fraud

detection and prevention. These models also ensure

greater accuracy and consistency in data processing,

enabling interim managers to generate more reliable

reports and make informed decisions. Ultimately, us-

ing LLMs enhances the overall audit process, improv-

ing efﬁciency and accountability in notary ofﬁces.

LLMs are advanced neural networks designed to

understand, generate, and analyze human language

(Karanikolas et al., 2023). They are trained on vast

amounts of text data, ranging from books and arti-

cles to legal and ﬁnancial documents (Kumar, 2024).

Through this training, LLMs learn to recognize pat-

terns, relationships, and contextual nuances in lan-

guage, enabling them to perform various tasks such

as text summarization, information extraction, and

anomaly detection. Examples of widely known LLMs

include OpenAI’s GPT series, Meta’s LLaMA, and

Google’s Gemini. These models are particularly valu-

988

Valadão, M., Freire, N., de Paula, M., Almeida, L. and Marques, L.

Using Large Language Models to Support the Audit Process in the Accountability of Interim Managers in Notary Ofﬁces.

DOI: 10.5220/0013480900003929

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 27th International Conference on Enterprise Information Systems (ICEIS 2025) - Volume 1, pages 988-995

ISBN: 978-989-758-749-8; ISSN: 2184-4992

able in notary ofﬁce settings, where they can rapidly

process large datasets with high precision. LLMs en-

able real-time auditing and reporting by extracting

key information from complex documents, such as

dates, monetary values, and descriptions. Addition-

ally, they assist in generating summaries, translating

legal terminology, and ﬂagging potential discrepan-

cies, thus providing comprehensive support through-

out the audit process.

Building on these capabilities, we propose a sys-

tem leveraging the LLaMA2-7B and Falcon-7B mod-

els to automate the critical information extraction

from notary ofﬁce documents. This system is de-

signed to identify anomalous monetary values and

unauthorized services by analyzing various printed

documents. Once anomalies are detected, the system

associates these irregularities with the corresponding

dates and beneﬁciaries, providing a comprehensive

overview of ﬁnancial discrepancies. By integrating

these LLMs into the auditing workﬂow, our proposal

aims to enhance the detection of fraudulent activities

and streamline the accountability of interim managers

in notary ofﬁces, ultimately improving transparency

and operational efﬁciency. The proposed method

achieved a BLEU metric superior to 0.67, demonstrat-

ing the system’s effectiveness in accurately extracting

critical ﬁnancial information and its potential to en-

hance the auditing process in notary ofﬁces. Addi-

tionally, The system enhances auditing by automating

ﬁnancial analysis, enabling analysts to focus on fraud

detection, optimizing public resource allocation, and

increasing judicial revenue for reinvestment in public

services.

The remainder of this paper is organized as fol-

lows: Section 2 presents the fundamental concepts

necessary for understanding this research, along with

related work. Section 3 outlines the AI methods em-

ployed to support the audit process discussed in this

paper. Section 4 details the experiments we carried

out, and the results obtained from the AI methods.

Finally, Section 5 concludes the paper with ﬁnal re-

marks and directions for future work.

2 BACKGROUND AND RELATED

WORK

2.1 Notary Ofﬁces and Managers

Extrajudicial notary ofﬁces play a vital role in the le-

gal system, handling civil registration, deeds, real es-

tate documentation, and signature notarization. These

services are provided by tenured notaries, but vacan-

cies may arise due to unforeseen events, requiring

a temporary appointment to ensure continuity. The

interim notary, designated by a competent authority

such as the General Inspectorate of Justice or a dis-

trict judge, assumes responsibility for managing the

ofﬁce during the vacancy.

The interim ofﬁcer’s duties include overseeing ad-

ministrative operations, executing notarial acts, en-

suring compliance with legal norms, reporting ﬁnan-

cial accountability, and fulﬁlling judicial orders. Our

focus is speciﬁcally on accountability, where we ex-

plore how LLMs can assist court analysts in detecting

anomalies in ﬁnancial reports submitted by interim

ofﬁcers. By leveraging LLM-based anomaly detec-

tion, we aim to enhance oversight efﬁciency and ac-

curacy.

2.2 Related Work

Devlin (2018) introduced BERT (Bidirectional En-

coder Representations from Transformers), a power-

ful transformer-based model that revolutionized the

ﬁeld of NLP. BERT’s architecture leverages bidirec-

tional context, enabling it to understand a word’s

meaning based on its preceding and following con-

text. This approach has been successfully applied to

various NLP tasks, including document classiﬁcation

and information extraction. It is highly relevant to au-

diting processes where extracting speciﬁc data points

from diverse document types is crucial. Despite its

impressive performance, BERT’s application in ﬁnan-

cial auditing and anomaly detection remains underex-

plored, particularly in adapting the model for domain-

speciﬁc challenges like notary ofﬁce document anal-

ysis.

Rud

zionis et al. (2022) presents a sophisticated

approach to detecting irregular ﬁnancial transactions

by applying NLP techniques. The researchers fo-

cus on accountant comments within ledger entries,

often containing informal yet insightful annotations.

By leveraging cosine similarity and term frequency-

inverse document frequency (TF-IDF) methods, they

analyze the semantic content of these comments to

identify outlier ﬁnancial operations. Their experi-

ments, conducted on data from Dutch companies, re-

vealed that only 0.3% of transactions ﬂagged for fur-

ther review were potentially suspicious, substantially

reducing the workload for human auditors. However,

the study notes that the model’s performance may

vary depending on the quality and availability of com-

ments, highlighting a need for further reﬁnement and

application across diverse datasets.

Beltran (2023) explores how NLP can enhance the

extraction of critical ﬁscal data from audit reports,

particularly in the context of subnational supreme au-

Using Large Language Models to Support the Audit Process in the Accountability of Interim Managers in Notary Ofﬁces

989

dit institutions (SAIs) in Mexico. The research fo-

cuses on audit reports from Sinaloa’s SAI, systemat-

ically converting scanned documents into machine-

readable text using Optical Character Recognition

(OCR). It then employs a text classiﬁcation model

to ﬁlter relevant content and a named entity recog-

nition (NER) system to extract monetary values asso-

ciated with budget discrepancies. The paper under-

scores the importance of leveraging machine learning

for large-scale document analysis, addressing inefﬁ-

ciencies and accessibility issues in traditional audit-

ing. While the approach is promising, potential areas

for further research include expanding the methodol-

ogy to audits from different regions and improving the

handling of OCR errors for non-English languages.

Fisher et al. (2016) provide an extensive review

of NLP applications in these domains, highlighting

its use for classiﬁcation, information retrieval, fraud

detection, and ﬁnancial predictions. The authors em-

phasize the interdisciplinary nature of NLP and its

potential to transform traditional ﬁnancial and audit

processes. They discuss various computational meth-

ods to extract insights from unstructured textual data,

including text mining, sentiment analysis, and ma-

chine learning. Despite progress, the paper identiﬁes

gaps in the literature, such as the need for more re-

ﬁned tools for taxonomy generation and integrating

NLP with continuous auditing systems. These areas

present opportunities for further exploration, partic-

ularly in enhancing the automation and accuracy of

ﬁnancial document analysis.

3 METHODS

This section describes the approach to addressing the

use of LLMs to extract crucial information from doc-

uments to assist interim managers in notary ofﬁces.

3.1 System Model

The role of interim managers in notary ofﬁces is com-

plex, requiring them to process large volumes of legal,

administrative, and ﬁnancial documents. Manual data

extraction from these documents is time-consuming

and error-prone, particularly when dealing with un-

structured or semi-structured formats like contracts,

deeds, and invoices. Ensuring accuracy and consis-

tency is critical, as errors may lead to legal or ﬁnancial

discrepancies. Additionally, time-sensitive deadlines

demand an efﬁcient system for handling large-scale

document processing.

To address these challenges, we propose leverag-

ing LLMs for automated information extraction from

printed document images. Advanced NLP models

like GPT and BERT (Zheng et al., 2021) can efﬁ-

ciently process structured text, mitigating the com-

plexities of handwriting recognition, which suffers

from high variability, poor resolution, and segmenta-

tion issues (Alhamad et al., 2024; Ingle et al., 2019).

By focusing on printed documents, we enhance ac-

curacy and efﬁciency, enabling the identiﬁcation of

outlier values along with relevant dates, descriptions,

and creditors.

As illustrated in Figure 1, our system consists of

four key components: preprocessing, OCR, LLM-

based extraction, and performance evaluation. Pre-

processing enhances document images, while OCR

converts them into machine-readable text. The LLM

then extracts and organizes critical data points, sup-

porting legal and ﬁnancial reporting. Finally, an eval-

uation step ensures accuracy and consistency, stream-

lining document analysis in notary ofﬁces.

3.2 Preprocessing

In the preprocessing, the following steps are per-

formed:

1. Binarization: Converts the input image to a bi-

nary format, enhancing contrast and eliminating

noise for improved OCR accuracy. This step is

crucial for unstructured or poorly scanned docu-

ments (Saini, 2015).

2. Alignment: Corrects misaligned text by evaluat-

ing skew angles θ within a range ℓ. A scoring

function S(θ) determines the best alignment, and

the document is rotated accordingly to produce a

corrected version I

′

, improving OCR and LLM

performance (de Elias et al., 2019).

3. Printed or Handwritten Classiﬁcation: A CNN-

based classiﬁer distinguishes between printed and

handwritten text. The model extracts spatial fea-

tures and assigns a probability p of a document

being printed. If p ≥ τ, it follows the OCR

pipeline; otherwise, it is ﬂagged for specialized

processing or manual review (Alhamad et al.,

2024)(Ingle et al., 2019).

4. Text Line Detection: The CRAFT model detects

and segments text lines by identifying charac-

ter regions and linking them into coherent struc-

tures. Given an input image I

′

, CRAFT gener-

ates bounding boxes B = {b

,...,b

} repre-

senting detected text lines, reﬁning overlapping

regions and ﬁltering noise. This ensures accurate

text extraction while preserving document struc-

ture, making it essential for handling diverse for-

mats and layouts (Baek et al., 2019).

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

990

Figure 1: Complete system for extracting valuable information from documents.

3.3 OCR

OCR technology converts images of text into

machine-readable text, enabling automated document

processing and analysis. Tesseract OCR, an open-

source tool developed by Google, is used to extract

text from images, employing connected component

analysis to identify text regions and segment the im-

age into lines and words. It utilizes pattern matching

and statistical modeling, along with language-speciﬁc

dictionaries, to improve accuracy and correct errors.

Tesseract supports multiple languages and scripts and

can handle complex layouts, such as tables or multi-

column text, making it highly versatile for document

digitization (Hegghammer, 2022).

Tesseract works best with preprocessed images,

where noise, skew, and distortions are minimized.

The CRAFT model, which highlights text regions,

improves Tesseract’s focus on text, enhancing both

speed and accuracy (Hegghammer, 2022). The out-

put can be plain text or structured formats like HOCR

or ALTO XML, which preserve the spatial layout and

provide additional details such as word coordinates,

font size, and conﬁdence scores. This ﬂexibility al-

lows Tesseract to cater to a range of use cases, from

simple text extraction to complex document analysis

(Hegghammer, 2022).

3.3.1 Filter of Quality

Ensuring the reliability of recognized text is crucial,

especially when dealing with noisy or low-resolution

documents. To address this, our system implements a

conﬁdence-based Filter of Quality, leveraging Tesser-

act OCR’s conﬁdence scores, which range from 0 to

100. For each recognized token, Tesseract assigns a

conﬁdence value, and the system calculates an aver-

age conﬁdence score ¯c across all detected tokens in a

document segment:

¯c =

∑

i=1

(1)

where c

represents the conﬁdence of the i-th to-

ken, and N is the total number of tokens. If ¯c ≥ 40, the

text is considered reliable for downstream processing,

including entity recognition and monetary value ex-

traction. Otherwise, it is ﬂagged as low conﬁdence

and either rerouted for manual veriﬁcation or sub-

jected to additional reprocessing, such as enhanced

image preprocessing or alternative OCR models.

Figure 2: Graphical representation of the conﬁdence-based

Filter of Quality mechanism. This diagram illustrates the

workﬂow for calculating average conﬁdence scores and de-

termining their suitability for further processing.

As illustrated in Figure 2, this ﬁltering mech-

anism prevents unreliable text from contaminating

later analysis stages, such as anomaly detection or

LLM processing. The threshold τ = 40 was deter-

mined through empirical tuning, balancing the need

to retain useful text while minimizing recognition er-

rors. This approach improves workﬂow efﬁciency by

allowing a more targeted document auditing process,

ensuring that only high-conﬁdence text proceeds for

further analysis.

3.4 LLM for Extracting Information

3.4.1 Prompt Engineering

In the context of extracting critical information from

documents, the quality of the OCR output plays a piv-

otal role. However, OCR often produces outputs that

Using Large Language Models to Support the Audit Process in the Accountability of Interim Managers in Notary Ofﬁces

991

lack clarity and coherence, particularly when deal-

ing with unstructured or semi-structured documents.

These outputs can include incomplete sentences, mis-

placed characters, and inconsistent formatting, which

pose challenges for downstream tasks like informa-

tion extraction. We leverage prompt engineering to

optimize the interaction between the extracted text

and the LLM to address this (Grabb, 2023).

The designed prompt serves as a structured query

framework that enhances the LLM’s ability to com-

prehend and process the noisy OCR output (Wang

et al., 2024). Speciﬁcally, the prompt includes contex-

tual instructions and speciﬁc task directives to guide

the model in identifying and extracting key pieces

of information. This includes critical details such as

dates, monetary values, creditor names, and descrip-

tive content related to transactions or agreements. We

improve the model’s accuracy and reliability in pars-

ing unclear or fragmented text by framing the prompt

to include examples, clariﬁcations, and explicit ex-

traction goals.

Moreover, the prompt dynamically adapts to the

input text’s structure by instructing the LLM to in-

fer missing details or correct minor inconsistencies.

This methodology enhances the precision of extracted

data and ensures that essential information is con-

sistently retrieved across diverse document formats

(Wang et al., 2024). In doing so, the prompt becomes

a critical bridge between the raw OCR output and

the structured, actionable insights required by Interim

Managers in notary ofﬁces.

Example of possible prompt:

"You are a highly intelligent assistant

tasked with extracting specific

information from a text document. The

text may contain noise, be incomplete,

or lack formatting. Your goal is to

extract the following details:

Date: (e.g., 2023-05-14, May 14, 2023)

Monetary Value: (e.g., $1,250.00

or R$ 1.250,00)

Creditor Name: (e.g., ABC Corporation,

John Doe)

Description: (e.g., Payment for services,

Invoice for shipment)

This is the text: {Output of the OCR}"

3.4.2 Proposed LLMs

The Falcon-7B and LLaMA2-7B models have been

selected for this study due to their advanced architec-

tures and suitability for handling unstructured data.

Falcon-7B, developed by the Technology Innovation

Institute, leverages a multi-query attention mecha-

nism and pre-normalization to ensure computational

efﬁciency and faster inference (HuggingFace, 2024).

In contrast, LLaMA2-7B, designed by Meta AI, in-

corporates multi-head attention and task-speciﬁc op-

timizations, offering enhanced contextual understand-

ing at the cost of higher memory usage (MetaAI,

2024). Both models utilize rotary positional em-

beddings (Almazrouei et al., 2023)(Touvron et al.,

2023), ensuring stability and adaptability across di-

verse tasks. Additionally, both models were trained

using mixed precision techniques, which allow for ef-

ﬁcient utilization of modern graphics processing unit

(GPU) resources. This complementary use of Falcon-

7B and LLaMA2-7B ensures robust performance in

extracting critical information from complex docu-

ment structures. The key architectural differences be-

tween these two models are summarized in Table 1.

Table 1: Comparison of Falcon-7B and LLaMA2-7B Ar-

chitectures.

Aspect Falcon-7B LLaMA2-7B

Transformer Block Design Standard transformer with

RoPE (Almazrouei et al., 2023)

Enhanced transformer with

RoPE and task-speciﬁc opti-

mizations (Touvron et al., 2023)

Attention Mechanism Multi-query attention (MQA)

for efﬁciency (Almazrouei

et al., 2023)

Grouped-query attention

(GQA) with multi-head base-

line (Touvron et al., 2023)

LayerNorm Placement Pre-normalization for stability

(Almazrouei et al., 2023)

Pre-normalization for efﬁciency

(Touvron et al., 2023)

Feedforward Network

(FFN)

GELU activation (Almazrouei

et al., 2023)

SwiGLU activation (Touvron

et al., 2023)

Parameter Efﬁciency Lower computational cost (Al-

mazrouei et al., 2023)

Task-speciﬁc adaptability (Tou-

vron et al., 2023)

Training Precision Mixed precision (BF16) (Al-

mazrouei et al., 2023)

Mixed precision for GPU com-

patibility (Touvron et al., 2023)

Inference Optimization Optimized for memory efﬁ-

ciency (Almazrouei et al., 2023)

Focuses on precision with

higher memory usage (Touvron

et al., 2023)

3.5 Evaluation

The classiﬁcation model was evaluated using Preci-

sion, Recall, and F1-Score:

• Precision: Proportion of correctly predicted posi-

tives among all positive predictions:

Precision =

T P

T P + FP

(2)

• Recall: Proportion of correctly predicted posi-

tives among actual positive cases:

Recall =

T P

T P + FN

(3)

• F1-Score: Harmonic mean of Precision and Re-

call:

F1-Score = 2 ×

Precision × Recall

Precision + Recall

(4)

For text similarity evaluation, we employed

BLEU and Cosine Similarity:

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

992

• BLEU: A precision-based metric comparing n-

grams of generated and reference texts, with a

brevity penalty to prevent bias towards shorter

sentences (Papineni et al., 2002):

BLEU = BP · exp

∑

n=1

log p

(5)

where p

is n-gram precision, w

are weights, and

BP is:

BP =

(

1 if c > r

(1−

)

if c ≤ r

(6)

• Cosine Similarity: Measures the cosine of the an-

gle between text embeddings to assess semantic

similarity (Huang et al., 2008):

Cosine Similarity =

⃗

A ·

⃗

∥

⃗

A∥∥

⃗

B∥

(7)

where

⃗

A and

⃗

B are vector embeddings of the texts.

4 EXPERIMENTS AND RESULTS

The experiments are detailed in this section, and the

results are presented.

4.1 Setup

The system setup features an Intel® Xeon® Sil-

ver 4208 CPU, operating at a base clock speed

of 2.10GHz, providing robust multi-threading ca-

pabilities ideal for handling computationally inten-

sive tasks. Complementing the processor is 62GB

of RAM, ensuring smooth multitasking and efﬁcient

handling of large datasets. For graphical and paral-

lel computations, the setup includes an NVIDIA RTX

5000 GPU with 16GB of dedicated RAM, which is

well-suited for machine learning, deep learning, and

high-performance computing applications. The sys-

tem runs on Ubuntu 20.04.6 LTS, a stable and reli-

able Linux distribution that offers a secure and ver-

satile environment for development and deployment.

The Hugging Face API was employed to access the

selected language models (Falcon-7B and LLaMA 2-

7B).

4.2 Validation Dataset

Two datasets were used: one for document classiﬁca-

tion (Dataset 1) and another to evaluate the entire sys-

tem model (Dataset 2). For training and testing the

document classiﬁcation model (Dataset 1), we used

a subset of The RVL-CDIP Dataset (Harley et al.,

2015). Although RVL-CDIP originally includes mul-

tiple categories, only the “Handwritten” and “Printed”

classes were selected. Downsampling was applied

to balance both classes, and the data was split into

training (75%), validation (15%), and test (10%) sets,

each containing the same number of images per class:

4650 images per class for training, 930 for validation,

and 621 for testing. For the system model evaluation

(Dataset 2), the dataset contains 112 documents of

different types (e.g., payment slips, receipts, coupons,

etc.). Of these 112 documents, 48 are images of hand-

written documents, and 65 are printed. Among the 65

printed documents, 47 are of good quality, while 18

are of poor quality. This dataset can’t be made avail-

able because it is conﬁdential.

4.3 Experiments

The parameters utilized in the experiments are sum-

marized in Table 2. The table outlines the essential

components and conﬁgurations of the system, detail-

ing the models, hardware, preprocessing steps, and

inference criteria used throughout the process. The

experiments were conducted using two state-of-the-

art LLMs: LLaMA2-7B and Falcon-7B. Inference

was performed on an NVIDIA RTX 5000 GPU with

16GB of dedicated VRAM to ensure efﬁcient pro-

cessing. Preprocessing involved alignment correc-

tion with a threshold of ±5

◦

, improving the accuracy

of text detection. The OCR stage employed the ro-

bust Tesseract OCR engine for text extraction from

document images. The system was designed to pro-

cess textual data effectively, with a maximum prompt

length of 500 tokens for the LLMs. An ONNX-based

image classiﬁcation model was used to distinguish

between handwritten and printed documents, with a

conﬁdence threshold of 0.7 to ensure reliability. Ad-

ditionally, a quality ﬁlter was applied during the OCR

process, requiring a minimum conﬁdence score of 40

for extracted text to proceed to the following stages.

To handle potential inference issues, a maximum of

three retries was permitted for each document. The

ﬁnal extracted information was presented in a stan-

dardized JSON structure, providing consistent and in-

terpretable results across all experiments.

4.4 Results

To evaluate the proposed system’s effectiveness, a se-

ries of experiments were conducted using two dis-

tinct datasets: one for document classiﬁcation and

another for system-level evaluation. These experi-

ments assessed the model’s performance in classify-

ing document types, extracting critical information,

Using Large Language Models to Support the Audit Process in the Accountability of Interim Managers in Notary Ofﬁces

993

Table 2: Parameters Used in the Experiments.

Parameter Value / Description

Models LLaMA2-7B, Falcon-7B

Device GPU (NVIDIA RTX 5000, 16GB VRAM)

Preprocessing Steps Alignment correction, threshold ±5

◦

OCR Engine Tesseract OCR

Prompt Length 500 tokens

Image Classiﬁcation ONNX model, conﬁdence threshold = 0.7

Quality Threshold Minimum OCR conﬁdence = 40

Inference Attempts Maximum retries = 3

Output Format Standardized JSON structure

and identifying anomalies in ﬁnancial and adminis-

trative records. The metrics used for evaluation, in-

cluding precision, recall, F1-score, BLEU, and co-

sine similarity, were chosen to comprehensively un-

derstand the system’s accuracy, reliability, and con-

textual understanding.

4.4.1 Results Document Classiﬁcation

Since our dataset contains various types of documents

(Dataset 2) and our proposed method is only ap-

plied to printed documents, we trained a classiﬁcation

model to recognize whether a document is printed or

handwritten (Dataset 1). Table 3 presents the preci-

sion, recall, and F1-score.

Table 3: Evaluation Metrics for Document Classiﬁcation

(Dataset 1).

Label Precision (%) Recall (%) F1-Score (%)

Printed (0) 95.84 96.46 96.15

Handwriting (1) 96.43 95.81 96.12

In Table 4, we presented the confusion matrix of

document classiﬁcation. The model can correctly rec-

ognize 595 handwriting documents over 26 misunder-

stood as printed documents in this speciﬁc dataset.

Table 4: Confusion Matrix for Document Classiﬁcation

(Dataset 1).

Predicted Class

Printed (0) Handwriting (1)

Actual Class

Printed (0) 599 22

Handwriting (1) 26 595

4.4.2 Results LLM

In our experiments, LLaMA2-7B achieved a BLEU

score of 0.673 and a cosine similarity of 0.707,

demonstrating its ability to generate outputs closely

aligned with reference data in semantic relevance and

linguistic accuracy. Falcon-7B performed slightly

better, with a BLEU score of 0.691 and a cosine sim-

ilarity of 0.734, highlighting its robustness in produc-

ing text that is both syntactically precise and semanti-

cally meaningful, making it suitable for high-quality

document understanding and summarization tasks.

To contextualize these results, we compare them

to values from the literature, such as Yuan and F

arber

(2023) , where BLEU scores ranged from 0.505 to

0.802, and de Vos et al. (2022) , where cosine simi-

larity scores reached 0.738 and 0.703 in the ’Vehicles’

dataset. However, direct comparisons should be made

cautiously due to differences in dataset characteris-

tics, evaluation paradigms, and task objectives. While

BLEU is traditionally used for machine translation,

our evaluation involves different linguistic and con-

textual challenges, making absolute numerical com-

parisons less straightforward.

5 CONCLUSIONS AND FUTURE

WORK

This paper introduced a system leveraging LLMs,

speciﬁcally LLaMA2-7B and Falcon-7B, to enhance

audit processes in notary ofﬁces by automating the

extraction and analysis of ﬁnancial data from var-

ious document types. The proposed approach im-

proved transparency, accuracy, and efﬁciency in au-

diting by addressing inefﬁciencies, high costs, and the

complexity of unstructured data. The system deliv-

ered strong performance in BLEU and cosine similar-

ity metrics, demonstrating its effectiveness in infor-

mation extraction and anomaly detection. Key ben-

eﬁts include assisting court analysts in identifying

fraud cases, optimizing public resource management

by eliminating unjustiﬁed expenses, and potentially

increasing court revenues to reinvest in public ser-

vices, further reinforcing the system’s impact on ﬁ-

nancial oversight and accountability.

Future work aims to expand the capabilities of

the system, particularly in the processing of hand-

written documents through handwriting recognition

or specialized training. Integrating multimodal learn-

ing to analyze text alongside visual elements like

stamps and signatures could further enhance its ro-

bustness. Additionally, developing multilingual and

cross-jurisdictional models would improve the sys-

tem’s adaptability to different languages and regu-

latory environments, ensuring broader usability and

compliance with international standards. Further-

more, domain-speciﬁc ﬁne-tuning for legal and ﬁ-

nancial contexts, real-time auditing features, and im-

proved interpretability would make the system more

precise and user-friendly. These advancements will

contribute to greater accountability, resource manage-

ment, and public trust in legal and ﬁnancial oversight.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

994

ACKNOWLEDGMENTS

The results presented in this paper have been devel-

oped as part of a project at SiDi, ﬁnanced by Samsung

Eletr

onica da Amazonia Ltda., under the auspices of

the Brazilian Federal Law of Informatics no. 8248/9.

REFERENCES

Alhamad, H. A., Shehab, M., Shambour, M. K. Y., Abu-

Hashem, M. A., Abuthawabeh, A., Al-Aqrabi, H.,

Daoud, M. S., and Shannaq, F. B. (2024). Handwrit-

ten recognition techniques: A comprehensive review.

Symmetry, 16(6):681.

Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A.,

Cojocaru, R., Debbah, M., Gofﬁnet,

E., Hesslow,

D., Launay, J., Malartic, Q., et al. (2023). The fal-

con series of open language models. arXiv preprint

arXiv:2311.16867.

Baek, Y., Lee, B., Han, D., Yun, S., and Lee, H. (2019).

Character region awareness for text detection. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 9365–9374.

Beltran, A. (2023). Fiscal data in text: Information ex-

traction from audit reports using natural language pro-

cessing. Data & Policy, 5:e7.

de Elias, E. M., Tasinaffo, P. M., and Hirata, R. (2019).

Alignment, scale and skew correction for optical mark

recognition documents based. In 2019 XV Workshop

de Vis

ao Computacional (WVC), pages 26–31. IEEE.

de Vos, I. M. A., Boogerd, G. L., Fennema, M. D., and Cor-

reia, A. D. (2022). Comparing in context: Improving

cosine similarity measures with a metric tensor. arXiv

preprint arXiv:2203.14996.

Devlin, J. (2018). Bert: Pre-training of deep bidirec-

tional transformers for language understanding. arXiv

preprint arXiv:1810.04805.

Fisher, I. E., Garnsey, M. R., and Hughes, M. E. (2016).

Natural language processing in accounting, auditing

and ﬁnance: A synthesis of the literature with a

roadmap for future research. Intelligent Systems in Ac-

counting, Finance and Management, 23(3):157–214.

Grabb, D. (2023). The impact of prompt engineering in

large language model performance: a psychiatric ex-

ample. Journal of Medical Artiﬁcial Intelligence, 6.

Harley, A. W., Ufkes, A., and Derpanis, K. G. (2015). Eval-

uation of deep convolutional nets for document image

classiﬁcation and retrieval. CoRR, abs/1502.07058.

Hegghammer, T. (2022). Ocr with tesseract, amazon tex-

tract, and google document ai: a benchmarking ex-

periment. Journal of Computational Social Science,

5(1):861–882.

Huang, A. et al. (2008). Similarity measures for text doc-

ument clustering. In Proceedings of the sixth new

zealand computer science research student conference

(NZCSRSC2008), Christchurch, New Zealand, vol-

ume 4, pages 9–56.

HuggingFace (2024). Falcon-7b model card. https://

huggingface.co/tiiuae/falcon-7b. Accessed: 2024-11-

08.

Ingle, R. R., Fujii, Y., Deselaers, T., Baccash, J., and Popat,

A. C. (2019). A scalable handwritten text recognition

system. In 2019 International conference on docu-

ment analysis and recognition (ICDAR), pages 17–24.

IEEE.

Karanikolas, N., Manga, E., Samaridi, N., Tousidou, E.,

and Vassilakopoulos, M. (2023). Large language mod-

els versus natural language understanding and genera-

tion. In Proceedings of the 27th Pan-Hellenic Con-

ference on Progress in Computing and Informatics,

pages 278–290.

Kumar, P. (2024). Large language models (llms): survey,

technical frameworks, and future challenges. Artiﬁcial

Intelligence Review, 57(10):260.

MetaAI (2024). Llama 2-7b model card. https://

huggingface.co/meta-llama/Llama-2-7b. Accessed:

2024-11-08.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th annual meet-

ing of the Association for Computational Linguistics,

pages 311–318.

Rud

zionis, V., Lopata, A., Gudas, S., Butleris, R., Veitait

I., Dilijonas, D., Gri

sius, E., Zwitserloot, M., and

Rudzioniene, K. (2022). Identifying irregular ﬁnan-

cial operations using accountant comments and natu-

ral language processing techniques. Applied sciences,

12(17):8558.

Saini, R. (2015). Document image binarization tech-

niques, developments and related issues: a re-

view. International Journal of Computer Applica-

tions, 116(7):0975–8887.

Santana, A. F. B., de Faria, J. A., and Sena, T. R. (2024).

Editorial volume 05, n

umero 02, 2024.: Auditoria e

seus desaﬁos (ainda) atuais! Revista Controladoria e

Gest

ao, 5(2):1–3.

Simunic, D. A. (1980). The pricing of audit services: The-

ory and evidence. Journal of accounting research,

pages 161–190.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,

A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava,

P., Bhosale, S., et al. (2023). Llama 2: Open foun-

dation and ﬁne-tuned chat models. arXiv preprint

arXiv:2307.09288.

Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W.,

Li, Q., and Li, J. (2024). Prompt engineering in con-

sistency and reliability with the evidence-based guide-

line for llms. npj Digital Medicine, 7(1):41.

Yuan, S. and F

arber, M. (2023). Evaluating generative

models for graph-to-text generation. arXiv preprint

arXiv:2307.14712.

Zheng, X., Zhang, C., and Woodland, P. C. (2021). Adapt-

ing gpt, gpt-2 and bert language models for speech

recognition. In 2021 IEEE Automatic speech recogni-

tion and understanding workshop (ASRU), pages 162–

168. IEEE.

Using Large Language Models to Support the Audit Process in the Accountability of Interim Managers in Notary Ofﬁces

995