GAIus: Combining Genai with Legal Clauses Retrieval for

Knowledge-Based Assistant

Michał Matak

and Jarosław A. Chudziak

Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland

Keywords:

Legal Technologies, Knowledge-Based Systems, Explainable Artiﬁcial Intelligence, Cognitive Systems.

Abstract:

In this paper we discuss the capability of large language models to base their answer and provide proper ref-

erences when dealing with legal matters of non-english and non-chinese speaking country. We discuss the

history of legal information retrieval, the difference between case law and statute law, its impact on the le-

gal tasks and analyze the latest research in this ﬁeld. Basing on that background we introduce gAIus, the

architecture of the cognitive LLM-based agent, whose responses are based on the knowledge retrieved from

certain legal act, which is Polish Civil Code. We propose a retrieval mechanism which is more explainable,

human-friendly and achieves better results than embedding-based approaches. To evaluate our method we cre-

ate special dataset based on single-choice questions from entrance exams for law apprenticeships conducted in

Poland. The proposed architecture critically leveraged the abilities of used large language models, improving

the gpt-3.5-turbo-0125 by 419%, allowing it to beat gpt-4o and lifting gpt-4o-mini score from 31% to 86%. At

the end of our paper we show the possible future path of research and potential applications of our ﬁndings.

1 INTRODUCTION

The appearance of large language models in recent

years has unlocked new possibilities in natural lan-

guage processing, transforming the ﬁeld. These ad-

vances have signiﬁcantly improved many NLP tasks

and have allowed one to perform them with one sin-

gle model (Raffel et al., 2023). With the appearance

of more advanced models they came into the main-

stream and become a subject of public discussion due

to their near human-like capabilities.

One of the most promising ﬁelds for recent ad-

vancements is the legal industry, where tasks often in-

volve reading large volumes of documents, extracting

key information from them, and writing substantial,

repetitive amounts of text in speciﬁc jargon.

This fact, along with the aforementioned progress,

raises hopes for the automation of certain tasks. The

demand for such a solution remains high, as govern-

ments constantly struggle with lengthy court proceed-

ings (Biuro Rzecznika Praw Obywatelskich, 2022;

Committee of Ministers, 2000).

A common process in the legal industry is legal re-

search, which involves identifying and obtaining the

https://orcid.org/0009-0001-3755-7158

https://orcid.org/0000-0003-4534-8652

information necessary to resolve a legal issue. De-

pending on the country, this process will differ. Two

main legal systems in the world may be distinguished.

The ﬁrst, the common law system, is characterized

by the stability of the law. Court rulings in previ-

ous similar cases play a signiﬁcant role. It is mainly

present in Anglo-Saxon countries. The second is the

statutory law system, where current legal provisions,

which may frequently change, are of greater impor-

tance. This system dominates in countries of con-

tinental Europe and countries whose legal customs

originated from them, such as Mexico or Japan.

In countries with a statutory law system, it is im-

portant to address the search for relevant regulations

in documents that are key sources of law (as opposed

to case law), such as statutes, regulations, constitu-

tions, and similar legal texts, due to their signiﬁcant

role.

Automation in this area may not seem beneﬁcial

to experienced lawyers, but it will serve two important

purposes.

The ﬁrst, signiﬁcant from a societal perspective,

is the democratization of law, ensuring better under-

standing and access to legal information for all inter-

ested individuals who are not experts in the ﬁeld and

do not have time to read hundreds of pages that are

difﬁcult to understand.

868

Matak, M. and Chudziak, J.

GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-Based Assistant.

DOI: 10.5220/0013191800003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 868-875

ISBN: 978-989-758-737-5; ISSN: 2184-433X

The second, more technical purpose, is the poten-

tial to integrate such a system into more advanced

architectures (e.g., multi-agent systems) capable of

providing basic legal advice or even issuing rulings

in cases of low complexity. Such systems were

successfully implemented for agile software engi-

neering (Chudziak and Cinkusz, 2024; Qian et al.,

2024) and software product management (Cinkusz

and Chudziak, 2024).

Our goal in this work is to develop a system that

effectively enriches an AI agent with knowledge of

the Polish Civil Code and to establish an evaluation

process in this ﬁeld. The resulting assistant will be

capable of answering legal questions by referencing

the appropriate provisions of the Code.

The key contributions of this paper are the archi-

tecture of the system that signiﬁcantly boosts the le-

gal knowledge of a large language model, legal act

segmentation and representation method, along with

a retrieval approach dispensing embeddings. Addi-

tionally, we created the benchmark based on state ex-

ams conducted in Poland, which enables a direct com-

parison between the performance of large language

models and human capabilities, and we test currently

available LLMs accordingly.

In the second part of the article, we review the cur-

rent state of research, especially concentrating on dif-

ferent methods of augmenting large language model

with domain-speciﬁc knowledge, and the idea of AI

agent. The third section details our proposed solu-

tion. In the fourth section, we outline the experiments

datasets and methodology along with the results. Last

but not least, the ﬁfth section offers our conclusions,

summarizes the paper, and explores potential direc-

tions for future work.

2 BACKGROUND

To outline the background for our discussion, we fo-

cus on three key areas: methods of augmenting large

language models, challenges and improvements in le-

gal information retrieval, and legal assistants evalua-

tion methods.

2.1 Augmenting Large Language

Models

There are several methods to augment LLMs to en-

hance it with domain-speciﬁc knowledge and provide

better responses. One of the most traditional methods

is ﬁne-tuning, which involves training LLM with data

speciﬁc to a given ﬁeld of knowledge. This method

was proved to be efﬁcient in several cases, such as

ﬁnance (Jeong, 2024).

The use of ﬁne-tuning to teach a model legal reg-

ulations can be quite problematic. Legal regulations

are often subject to frequent changes, including the

addition of new rules, modiﬁcations, and the removal

of old ones. Due to the large number of legislative

changes, if a large number of legal acts were incor-

porated, ﬁne-tuning would need to occur frequently

— potentially every few days — to ensure up-to-date

performance. This requirement makes ﬁne-tuning

both resource-intensive and difﬁcult to maintain over

time.

An alternative method to ﬁne-tuning is Retrieval-

Augmented Generation (RAG) proposed by (Lewis

et al., 2021). The authors introduced a model with

types of memory: a parametric memory, implemented

as a pre-trained seq2seq transformer, and a non-

parametric memory, which encodes documents and

queries using pre-trained encoder models. The re-

trieval process selects the top k documents by maxi-

mizing the inner product between their encoded forms

and the encoded query. The method achieved state-of-

the-art results on open Natural Questions, WebQues-

tions, and CuratedTrec and strongly outperform re-

cent approaches that use specialized pre-training ob-

jectives on TriviaQA.

Research shows that RAG leads to improvements

that are comparable to those achieved by ﬁne-tuning

(Balaguer et al., 2024). Additionally, RAG requires

lower costs at the initial stage; however, it might incur

higher costs in the long run due to increased token

usage (Balaguer et al., 2024).

To solve more complex tasks, LLMs were inte-

grated with other tools such as internet search, mem-

ory, planning, and external tool usage. In these ar-

chitectures, the role of LLM is limited not only to re-

turn the proper answer, but also to break the task into

subtasks and orchestrate the use of different tools to

achieve the desired result.

A straightforward example of this concept is Re-

Act introduced in (Yao et al., 2023). It is a reasoning

framework where LLM can either perform actions on

external environment or perform actions in language

space that lead to the generation of thoughts that

are aimed to compose useful information to improve

performing reasoning in current context. On Hot-

potQA and Fever it achieved better results that stan-

dard LLM, Act framework and chain-of-thoughts-

prompting.

The landscape of agents and large language mod-

els was structured in (Sumers et al., 2024), where

three uses of large language models from agent sys-

tem perspective were distinguished: producing output

GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-Based Assistant

869

from input (raw LLM), a language agent in a direct

feedback loop with the environment, and a cognitive

language agent that manages its internal state through

learning and reasoning. The paper further explores

different architectures of cognitive agents and orga-

nizes them in three dimensions: information storage,

action space, and decision-making procedure.

2.2 Legal Information Retrieval

When retrieving information for legal issues, there

are several challenges worth considering to under-

stand the speciﬁc requirements of this ﬁeld. Some of

them are exceptionally relevant in the context of using

LLMs for this task.

First, when obtaining provisions from legal acts, it

is crucial to accurately convey their content. In many

cases, subtle details - such as punctuation marks, as

well as the difference between "among others" and

"in particular" - can be extremely important. In cer-

tain cases, even the placement of a provision within

a statute can be signiﬁcant. When using ﬁne-tuned

LLMs, there is a substantial risk of returning im-

perfect answers. This risk arises from the potential

for hallucinations (i.e., generating plausible but incor-

rect content) and the difﬁculty of verifying responses

without specialized legal knowledge.

Second, the goal in legal research is to ﬁnd all rele-

vant information on a given topic, not just a portion of

it. Considering that in most legal systems, a speciﬁc

provision overrides a general one, omitting it may re-

sult in a completely erroneous legal interpretation.

Finally, it is also important to consider the cus-

toms adopted in different countries when using base

models. Unlike in ﬁelds such as medicine or ﬁnance,

there is no universal "objective truth" in law. Legal

regulations can vary widely across jurisdictions, and

in some cases, regulatory approaches may even be

completely opposite.

One of the ﬁrst approaches to align LLMs with the

legal domain was LegalBert introduced in (Chalkidis

et al., 2020). In this study, two variants based on

BERT architecture were proposed: one was BERT

model further pre-trained on legal corpora and the

second one was BERT model pre-trained from the

scratch on this corpus. LegalBert achieved state-of-

the-art results on three tasks (multi-label classiﬁcation

in ECHR-CASES and contract header, lease details in

CONTRACTS-NER).

A similar approach was used in LawGPT (Zhou

et al., 2024) that was designed for Chinese legal ap-

plications. In addition to legal domain pre-training

on a corpus consisting of 500K legal documents,the

authors proposed the step of Legal-supervised ﬁne-

tuning (LFT) to enhance the results. The model

achieved better results than open source models in 5

out of 8 cases; however, it still largely falls behind the

GPT-3.5 Turbo and GPT-4 models.

The platform that fully integrated large lan-

guage models and retrieval-augmented generation

was CaseGPT (Yang, 2024), a system designed for

case-based reasoning in the legal and medical do-

mains. It consisted of three modules: a query pro-

cessing module that transformed user queries for re-

trieval, a case retrieval engine based on a dense vector

index with a semantic search algorithm, and an in-

sight generation module that analyzes retrieved cases

within the user’s query context.

A different approach to case retrieval was pre-

sented in (Wiratunga et al., 2024). In this approach,

each case consists of a question, an answer, support-

ing evidence for the answer, and the named entities

extracted from this support. For retrieval, the au-

thors proposed dual embeddings: intra-embeddings

for case query encoding and inter-embeddings for en-

coding support and entities. The evaluation demon-

strated that this architecture improves performance by

1.94% compared to the baseline without RAG.

A notable solution in legal information retrieval

for statute law was the winning solution for Task

3 in the Competition on Legal Information Extrac-

tion/Entailment 2023 (COLIEE 2023), developed by

the CAPTAIN team (Nguyen et al., 2024a). Its pur-

pose was to extract a subset of articles of the Japanese

Civil Code from the entire Civil Code to answer

questions from Japanese legal bar exams. The solu-

tion was based on negative sampling, multiple model

checkpoints (with the assumption that each check-

point is biased toward speciﬁc article categories),

and a speciﬁc training process that included a data-

ﬁltering approach.

Another study conducted on the COLIEE 2023

dataset proposes the use of prompting techniques

in the ﬁnal stage of the retrieval system, preceded

by BM25 pre-ranking and BERT-based re-ranking

(Nguyen et al., 2024b). The solution in this study

achieved a score that outperformed the aforemen-

tioned approach (the solution of the winning team) by

4%.

2.3 Legal Assistants Evaluation

Methods

There are several datasets for evaluating legal agents.

One of the most notable is LegalBench (Guha et al.,

2023), which consists of 162 tasks designed and hand-

crafted by legal professionals. The tasks cover six

distinct types of legal reasoning. This benchmark in-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

870

spired LegalBench-RAG (Pipitone and Alami, 2024),

a dataset tailored speciﬁcally for retrieval-augmented

generation in the legal domain. LegalBench-RAG in-

cludes 6,858 query-answer pairs over a corpus con-

taining more than 79 million characters, created from

documents sourced from four distinct datasets.

For the task of Extreme Multi-Label Legal Text

Classiﬁcation (Chalkidis et al., 2019), a dataset was

created from 57,000 legislative documents in Eu-

rlex—the public document database of the European

Union. These documents were annotated with EU-

ROVOC, a multidisciplinary thesaurus used to stan-

dardize legal terminology across various domains.

Finally, an interesting evaluation benchmark was

used for Task 3 of the 2023 Competition on Legal

Information Extraction/Entailment (COLIEE 2023).

The dataset consists of questions that require the eval-

uated agent to answer yes/no and to provide the rele-

vant articles of the Japanese Civil Code. Notably, this

dataset is one of the very few examples that focuses

speciﬁcally on the Statute Law Retrieval Task.

3 ASSISTANT ARCHITECTURE

The architecture of gAIus consists of two key compo-

nents. The ﬁrst is the query ﬂow, which deﬁnes how

the agent interacts with human input and the available

tools. The second is the retrieval mechanism, respon-

sible for selecting documents relevant to the given

query.

3.1 Query Flow

The ﬂow in our architecture illustrated in Figure 1.

User

AI assistant

"Who can be

incapacitated?"

"Incapacitation"

retriever

art. 13

k retrieved

documents

"According to article

16 of Polish Civil

Code incapacitated

can be"

Figure 1: Visualisation of query ﬂow.

The architecture consists of a single agent that re-

ceives a query from the user and interacts with a sin-

gle tool: the retriever. Upon receiving the request,

the agent reformulates the query to retrieve relevant

documents. The purpose of this architecture is to en-

able the LLM to extract the semantically important

part of the question and generate a more general re-

trieval query.

For example, a speciﬁc question such as "Who

can be incapacitated?" would be transformed into the

more general query "Incapacitation." By generalizing

the query, the system can increase the likelihood of

retrieving the correct documents.

The agent can use the retriever multiple times, al-

lowing it to attempt different queries if the relevant

documents are not found on the ﬁrst attempt. This it-

erative approach increases the chance of successfully

retrieving the correct information.

3.2 Retrieval

To provide context for a large language model (LLM),

one approach is to include the entire document (e.g.,

a legal act) in the model’s context. However, there

are several challenges associated with this approach.

First, the document length may be too long and sim-

ply does not ﬁt within the LLM’s context window.

Second, LLMs tend to perform worse when process-

ing long contexts, often failing to extract the neces-

sary information. Third, most commercial LLMs are

billed based on token usage, so including the entire

document in the context would result in signiﬁcantly

increased costs.

For this reason, we decided to chunk the Polish

Civil Code before storing it in the document database.

The chunking method can vary and signiﬁcantly inﬂu-

ence the quality of the retrieval process.

The Polish Civil Code is already divided by the

lawmaker into various entities, such as books, titles,

chapters, and sections. However, the main editorial

unit, which consists of one or more sentences, is the

article. We chose to split the entire Code into indi-

vidual articles, which we will further refer to as doc-

uments. The chunking method is illustrated in Figure

art.8

art.9

art.10

art.11

art.13

art.12

art.14

documents

database

fragment of Civil

Code

per article

document split

Figure 2: Visualisation of Civil Code chunking for retrieval.

To assess the relevance of the document to the

query, we score the documents using a custom func-

tion and retrieve the top-k documents with the highest

scores (Algorithm 1 and 2.

GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-Based Assistant

871

Data: document, query

Result: score of the document

max_score ← 0 ;

for i ← 0 to |document| −|query| do

part ← document[i : i + |query|];

part_score ← SCOREPART(part, query);

if part_score > max_score then

max_score ← part_score;

end

Algorithm 1: Document scoring function.

Data: part, query

Result: score of the part

score ← 0 ;

for i ← 0 to |part| − 1 do

if part[i] = query[i] then

score ← score + 1

end

Algorithm 2: Part scoring function.

The algorithm’s behavior can be described as cal-

culating the number of matching letters between the

query and each part of the document, constrained by

the query length. This fuzzy text search approach is

inspired by the behavior of law students when search-

ing for regulations. Typically, they visually scan the

text, focusing on target words to identify the required

regulation. Additionally, this method leverages exist-

ing LLM knowledge, as models often have some un-

derstanding of legal texts and are able to cite relevant

portions of regulations.

To fully realize the potential of the retrieval pro-

cess, we utilize the earlier decision to split the legal

code into articles. All chunks are relatively small,

with an average length of approximately 140 tokens.

For this reason, we chose to retrieve a relatively

large number of chunks (k = 50). The cognitive ca-

pabilities of the large language model allow it to ﬁl-

ter out irrelevant articles, ensuring that the pipeline

functions correctly as long as the retrieval achieves

satisfactory recall. In the worst-case scenario, where

the longest chunks are retrieved, no more than 30

chunks ﬁt within the context window of smaller mod-

els (e.g., GPT-3.5-turbo). Although this could the-

oretically cause the pipeline to fail, our experiments

indicate that this risk is relatively low.

To differentiate the impact of the document scor-

ing method and retrieval parameters (including code

segmentation), we also implemented the architecture

using embeddings. We refer to this variant as gAIus-

RAG.

3.3 Used Technologies

The entire project is written in Python. To create

the assistant, we used the LangChain framework with

its agents API (LangChain, 2024). We implemented

a custom retriever based on the LangChain retriever

interface. For the vectorstore, we used ChromaDB,

and for Polish language embeddings, we employed

mmlw-roberta-large encodings from (Dadas, 2019).

The Polish Civil Code was downloaded from the

ofﬁcial government website in PDF format (Sejm

Rzeczypospolitej Polskiej, 1964). The document was

subsequently converted into a TXT ﬁle and prepro-

cessed by removing footnotes, publication metadata,

and annotations.

4 EXPERIMENTS

In the experiments part of the article we would like

to discuss the dataset we created for evaluation pur-

poses, the methodology and the results obtained.

4.1 Dataset

We created the evaluation dataset similar to the one

used on COLIEE 2023 competition. It consists from

questions from entrance exams for law apprentice-

ships from the years 2021-2023. These exams are

state exams conducted every year in Poland. Law

graduates have to pass them to enroll for a 3 years

apprenticeship program. After its completion and an-

other exam (counterpart of bar exam) they become

law professionals. Every year three such exams are

conducted: for notarial apprenticeship, for bailiff ap-

prenticeship, and for attorney and legal advisers’ ap-

prenticeship.

Each of the exams consists of 150 closed ques-

tions with three possible answers: a, b, or c. To pass

the test, one has to answer at least 100 correctly. The

questions are usually quite simple if one knows the

proper regulation and they do not need extensive rea-

soning. Hence comes our decision to use them as an

evaluation framework for our assistant. The questions

were downloaded from the government website (Min-

isterstwo Sprawiedliwo

sci, 2023) along with the an-

swers. The answers ﬁle also contained the indices of

law regulations needed to answer each question. Af-

ter ﬁltering out all questions that did not referred to

the Civil Code the dataset consisted of 146 different

questions.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

872

4.2 Methodology

We compare our architecture, powered by GPT-3.5-

turbo-0125 and GPT-4o-mini models, with other Ope-

nAI LLMs without any additional enhancements. All

models were conﬁgured with a temperature of 00 to

ensure reproducibility. The evaluation was performed

on RAG variants to compare them with our retrieval

method.

The primary goal of the proposed law assistant

is to provide accurate and relevant information about

various legal issues. For this reason, during the eval-

uation process, we do not only expect the assistant

to return the correct answer but also to justify its re-

sponse and cite the relevant legal article.

To achieve this, each evaluated assistant was set

up with the following system prompt:

You are a helpful assistant specializing

in Polish law. You will receive questions from

an exam, each consisting of a question or an

incomplete sentence followed by three possible

answers labeled a, b, and c.

Your task is to:

1. Choose the correct answer.

2. Provide a detailed explanation for

your choice.

3. Refer to the relevant article(s)

in the one of polish regulations.

Please ensure your responses are precise and

informative. Respond in polish.

When assessing the performance of the assistant,

we check if the correct answer was chosen and if ap-

propriate regulation was referred to. We measure the

number of successfully answered questions (answer

score), the number of correctly referred contexts (con-

text score), and the number of situations in which both

answer and context are correct (joint score). If the as-

sistant refers to more regulations than necessary, we

still consider it a good answer, as long as the extra

references are minimal (typically no more than two

articles).

To automate the evaluation, we developed a sim-

ple evaluation agent that extracts the chosen answer

and the cited articles into a structured JSON for-

mat. Based on these extracted values, the assistant

response is evaluated.

4.3 Results

To present the detailed behavior of the assistants we

show and discuss sample question from dataset with

responses from gpt-4o and gAIus based on gpt-3.5-

turbo in 1.

In this table, a sample question from the evalua-

tion dataset is shown, alongside the correct answer,

the correct context, and the responses from the two

evaluated assistants. The original source information

was in Polish, but was translated into English for the

reader’s convenience. The correct context clearly in-

dicates the proper answer to the question.

Our assistant, gAIus, correctly identiﬁes and ref-

erences the relevant article (although the index num-

ber became part of the ID due to preprocessing). As a

result, gAIus provides the correct answer. In contrast,

GPT-4o chooses the wrong answer and hallucinates

a context that appears similar to the desired one. In

particular, in the Polish Civil Code, there is no para-

graph 109

§2, and the referenced text lacks the key

information necessary to answer the question.

The evaluation results for all questions from the

dataset are presented in Table 2, with several notable

outcomes:

At ﬁrst, ll raw commercial large language models

performed better in the answer score than in the con-

text score. Among these models, except for GPT-4o,

a signiﬁcant disparity between the answer score and

the context score is observed, with the answer score

prevailing. In contrast, the gAIus assistants excelled

in providing the correct context. GPT-4o emerged as

the best-performing commercial model. However, its

answer score ranked third compared to all evaluated

models.

Secondly, the gAIus architecture signiﬁcantly en-

hanced the performance of the underlying LLMs. In

the case of gpt-3.5-turbo-0125, the joint score im-

proved by 419% (from 21 to 109) achieving better

results, outperforming the best commercial model,

GPT-4o. The results of gpt-4o-mini improved by

173% and gAIus, with gAIus achieving the highest

joint score and context score, and the second-best an-

swer score.

Furthermore, architectures based on our scoring

function outperformed those using RAG. However,

the RAG variant based on GPT-4o-mini achieved the

best answer score, which was slightly higher than that

of the corresponding gAIus variant using search func-

tions.

5 CONCLUSIONS, SUMMARY

AND FURTHER WORK

Several possible developments can be identiﬁed for

our research:

First, the legal system in Poland has sources be-

yond the Civil Code. A natural path for further de-

velopment would be to expand the assistant to cover

GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-Based Assistant

873

Table 1: Example question from evaluation dataset and assistants responses (originally in Polish, translated to English).

Question According to the Civil Code, the commercial power of attorney (prokura):

a) can always be limited with effect against third parties;

b) can never be limited with effect against third parties;

c) cannot be limited with effect against third parties, unless a speciﬁc provision states oth-

erwise.

Correct answer C

Correct context Art. 109

. § 2. The commercial power of attorney (prokura) cannot be limited with effect

against third parties, unless a speciﬁc provision states otherwise.

gAIus answer The correct answer is:

c) cannot be limited with effect against third parties, unless a speciﬁc provision states oth-

erwise.

According to Article 1091 § 2 of the Civil Code, a commercial power of attorney (prokura)

cannot be limited with effect against third parties unless a speciﬁc legal provision allows

such a limitation. This means that a power of attorney granted by an entrepreneur cannot be

restricted with respect to third parties, unless there is a speciﬁc legal regulation permitting

such a restriction. This is crucial for the protection of third parties who may rely on the

agent’s authority in dealings with the entrepreneur.

Legal provision: Article 1091 § 2 of the Civil Code.

gpt-4o answer Answer: b) can never be limited with effect against third parties.

Explanation: According to the Civil Code, the commercial power of attorney (prokura) is a

special type of power of attorney granted by an entrepreneur who is subject to registration

in the business register. Prokura includes authorization for judicial and extrajudicial actions

related to running the business. According to Article 109

§ 2 of the Civil Code, the limita-

tion of prokura is not effective against third parties. This means that any limitations on the

scope of prokura are effective only internally, between the entrepreneur and the agent, but

not against third parties.

Legal basis: Article 109

§ 2 of the Civil Code: "The commercial power of attorney cannot

be limited with effect against third parties."

Table 2: Results of evaluation on 146 questions.

Assistant Answer score Context score Joint score

gpt-3.5-turbo-0125 77 26 21

gpt-4o-mini 101 58 46

gpt-4 115 53 44

gpt-4o 127 106 98

gAIusRAG based on gpt-3.5-turbo-0125 101 116 91

gAIusRAG based on gpt-4o-mini 130 128 117

gAIus based on gpt-3.5-turbo-0125 112 126 109

gAIus based on gpt-4o-mini 128 139 126

additional legal acts. This could be achieved by either

integrating an additional retrieval tool for the single-

agent system or designing a multi-agent architecture

with a routing agent to manage queries across multi-

ple sources.

Second, another signiﬁcant enhancement would

be the addition of case retrieval relevant to the legal

issue at hand. Although case law plays a limited role

in statutory law systems, it still serves as a powerful

instrument for argumentation, particularly when cases

originate from renowned courts.

Finally, our solution can serve as the foundation

for an AI system capable of solving more complex

legal cases. The questions in our current evaluation

framework were relatively simple and focused on di-

rect queries. It would be valuable to examine how

large language models perform in a more realistic

legal environment, where reasoning and multi-step

analysis are required.

ACKNOWLEDGEMENTS

The translation (from Polish to English) of the ex-

ample dataset question with the answers and other

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

874

cited regulations was performed using the ChatGPT-

4o model (OpenAI, 2024).

REFERENCES

Balaguer, A., Benara, V., de Freitas Cunha, R. L., de M. Es-

tevão Filho, R., Hendry, T., Holstein, D., Marsman, J.,

Mecklenburg, N., Malvar, S., Nunes, L. O., Padilha,

R., Sharp, M., Silva, B., Sharma, S., Aski, V., and

Chandra, R. (2024). Rag vs ﬁne-tuning: Pipelines,

tradeoffs, and a case study on agriculture.

Biuro Rzecznika Praw Obywatelskich (2022). 2 lata

czekania na termin pierwszej rozprawy s ˛adowej. rpo

interweniuje w ministerstwie sprawiedliwo

sci. jest

odpowied´z. Accessed: 2024-09-29.

Chalkidis, I., Fergadiotis, E., Malakasiotis, P., Aletras, N.,

and Androutsopoulos, I. (2019). Extreme multi-label

legal text classiﬁcation: A case study in EU legisla-

tion. In Aletras, N., Ash, E., Barrett, L., Chen, D.,

Meyers, A., Preotiuc-Pietro, D., Rosenberg, D., and

Stent, A., editors, Proceedings of the Natural Legal

Language Processing Workshop 2019, pages 78–87,

Minneapolis, Minnesota. Association for Computa-

tional Linguistics.

Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N.,

and Androutsopoulos, I. (2020). Legal-bert: The mup-

pets straight out of law school.

Chudziak, J. A. and Cinkusz, K. (2024). Towards llm-

augmented multiagent systems for agile software en-

gineering. In Proceedings of the 39th IEEE/ACM In-

ternational Conference on Automated Software Engi-

neering (ASE 2024).

Cinkusz, K. and Chudziak, J. A. (2024). Communicative

agents for software project management and system

development. In Proceedings of the 21th International

Conference on Modeling Decisions for Artiﬁcial Intel-

ligence (MDAI 2024).

Committee of Ministers (2000). Excessive length of judi-

cial proceedings in italy. Adopted by the Committee

of Ministers on 25 October 2000 at the 727th meet-

ing of the Ministers’ Deputies. Interim Resolution

ResDH(2000)135.

Dadas, S. (2019). A repository of polish NLP resources.

Github.

Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A.,

Narayana, A., Chohlas-Wood, A., Peters, A., Waldon,

B., Rockmore, D. N., Zambrano, D., Talisman, D.,

Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickin-

son, G. M., Porat, H., Hegland, J., Wu, J., Nudell, J.,

Niklaus, J., Nay, J., Choi, J. H., Tobia, K., Hagan, M.,

Ma, M., Livermore, M., Rasumov-Rahe, N., Holzen-

berger, N., Kolt, N., Henderson, P., Rehaag, S., Goel,

S., Gao, S., Williams, S., Gandhi, S., Zur, T., Iyer, V.,

and Li, Z. (2023). Legalbench: A collaboratively built

benchmark for measuring legal reasoning in large lan-

guage models.

Jeong, C. (2024). Domain-specialized llm: Financial

ﬁne-tuning and utilization method using mistral 7b.

Journal of Intelligence and Information Systems,

30(1):93–120.

LangChain (2024). Agents documentation. Accessed:

2024-09-29.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V.,

Goyal, N., Küttler, H., Lewis, M., tau Yih, W., Rock-

täschel, T., Riedel, S., and Kiela, D. (2021). Retrieval-

augmented generation for knowledge-intensive nlp

tasks.

Ministerstwo Sprawiedliwo

sci (2023). Zestawy py-

n testowych wraz z wykazami prawidłowych

odpowiedzi, w oparciu o które przeprowadzone

zostały w dniu 30 wrze

snia 2023 r. egzaminy wst˛epne

na aplikacje adwokack ˛a, radcowsk ˛a, notarialn ˛a i ko-

mornicz ˛a. Accessed: 2024-09-01.

Nguyen, C., Nguyen, P., Tran, T., Nguyen, D., Trieu, A.,

Pham, T., Dang, A., and Nguyen, L.-M. (2024a). Cap-

tain at coliee 2023: Efﬁcient methods for legal infor-

mation retrieval and entailment tasks.

Nguyen, H.-L., Nguyen, D.-M., Nguyen, T.-M., Nguyen,

H.-T., Vuong, T.-H.-Y., and Satoh, K. (2024b). En-

hancing legal document retrieval: A multi-phase ap-

proach with large language models.

OpenAI (2024). Chatgpt. Accessed: 2024-10-16.

Pipitone, N. and Alami, G. H. (2024). Legalbench-rag: A

benchmark for retrieval-augmented generation in the

legal domain.

Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang,

C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z.,

and Sun, M. (2024). Chatdev: Communicative agents

for software development.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2023).

Exploring the limits of transfer learning with a uniﬁed

text-to-text transformer.

Sejm Rzeczypospolitej Polskiej (1964). Ustawa z dnia 23

kwietnia 1964 r. - kodeks cywilny. Accessed: 2024-

09-01.

Sumers, T. R., Yao, S., Narasimhan, K., and Grifﬁths, T. L.

(2024). Cognitive architectures for language agents.

Wiratunga, N., Abeyratne, R., Jayawardena, L., Martin, K.,

Massie, S., Nkisi-Orji, I., Weerasinghe, R., Liret, A.,

and Fleisch, B. (2024). Cbr-rag: Case-based reason-

ing for retrieval augmented generation in llms for legal

question answering.

Yang, R. (2024). Casegpt: a case reasoning framework

based on language models and retrieval-augmented

generation.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,

K., and Cao, Y. (2023). React: Synergizing reasoning

and acting in language models.

Zhou, Z., Shi, J.-X., Song, P.-X., Yang, X.-W., Jin, Y.-X.,

Guo, L.-Z., and Li, Y.-F. (2024). Lawgpt: A chinese

legal knowledge-enhanced large language model.

GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-Based Assistant

875