GOLLUM: Guiding cOnﬁguration of ﬁrewaLL Through aUgmented

Large Language Models

Roberto Lorusso

, Antonio Maci

and Antonio Coscia

Cybersecurity Laboratory, BV TECH S.p.A., Milan, Italy

Keywords:

Computer Networks, Conversational Agent, Cybersecurity, Firewall Conﬁguration, Large Language Model,

RAGAS, Retrieval Augmented Generation.

Abstract:

Artiﬁcial intelligence (AI) tools offer signiﬁcant potential in network security, particularly for addressing is-

sues like ﬁrewall misconﬁguration, which can lead to security ﬂaws. Conﬁguration support services can help

prevent errors by providing clear general-purpose language instructions, thus minimizing the need for manual

references. Large language models (LLMs) are AI-based agents that use deep neural networks to understand

and generate human language. However, LLMs are generalists by construction and may lack the knowledge

needed in speciﬁc ﬁelds, thereby requiring links to external sources to perform highly specialized tasks. To

meet these needs, this paper proposes GOLLUM, a conversational agent designed to guide ﬁrewall conﬁg-

urations using augmented LLMs. GOLLUM integrates the pfSense ﬁrewall documentation via a retrieval

augmented generation approach, providing an example of actual use. The generative models used in GOL-

LUM were selected based on their performance on the state-of-the-art NetConfEval and CyberMetric datasets.

Additionally, to assess the effectiveness of the proposed application, an automated evaluation pipeline, involv-

ing RAGAS as test dataset generator and a panel of LLMs for judgment, was implemented. The experimental

results indicate that GOLLUM, powered by LLama3-8B, provides accurate and faithful support in three out

of four cases, while achieving > 80% of answer correctness in conﬁguration-related queries.

1 INTRODUCTION

Modern technologies that leverage artiﬁcial intelli-

gence (AI) can positively contribute to the imple-

mentation of tactical cybersecurity services. Accord-

ingly, several AI-based methods are being used to

tackle the challenges posed by such a domain, includ-

ing deep learning (DL) and natural language process-

ing (NLP). Bridging them resulted in breakthrough

advances in the performance achieved by computer-

based agents in terms of their efﬁciency in handling

challenging tasks related to human language. Cur-

rently, a clear example is given by the large language

models (LLMs), representing the most widely dis-

tributed tools capable of understanding and generat-

ing general-purpose languages. The underlying de-

sign of such models is based on the use of trans-

formers, i.e., artiﬁcial neural networks that implement

the self-attention mechanism to extract linguistic con-

tent and to infer the relationship between words and

https://orcid.org/0009-0007-8640-7109

https://orcid.org/0000-0002-6526-554X

https://orcid.org/0000-0002-7263-4999

sentences contained within. As it learns the patterns

by which words are sequenced, the model can make

predictions about how sentences should probably be

structured (Min et al., 2023). In the cybersecurity do-

main LLMs can be mainly adopted for (Sarker, 2024):

threat analysis, incident response, training and aware-

ness, phishing detection, penetration testing, and im-

plementation of conversational agents to provide real-

time assistance to users. For the last use case, it

has been observed that interaction with conversational

agents can provide concrete support to humans in re-

lation to the task at hand (Ross et al., 2023). Mitigat-

ing human error through assistance tools is crucial,

especially when these errors occur in large and inter-

connected contexts, such as in the case of misconﬁgu-

rations generated by network administrators on tacti-

cal devices, like ﬁrewalls (Alicea and Alsmadi, 2021).

This represents a primary concern as it can result in

serious performance and security problems, making

the need for anomaly resolution and optimization al-

gorithms that act on ﬁrewall security policies (Cos-

cia et al., 2023; Coscia et al., 2025). In any case,

it is paramount to avoid in advance the existence of

Lorusso, R., Maci, A. and Coscia, A.

GOLLUM: Guiding cOnﬁguration of ﬁrewaLL Through aUgmented Large Language Models.

DOI: 10.5220/0013221900003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 1, pages 489-496

ISBN: 978-989-758-737-5; ISSN: 2184-433X

489

these abnormal setups, as they would hinder the de-

vice from being used in safety-critical contexts gov-

erned by strict standards (Anwar et al., 2021). In this

setting, LLMs can be placed as powerful engines to

assist operators with network conﬁgurations (Huang

et al., 2024). Guiding the setup of speciﬁc ﬁrewall

functionalities can be accomplished by employing the

LLM as an agent that replies to speciﬁc questions

posed by the operator, i.e., performing a question an-

swering (QA) task (Ouyang et al., 2022). As a gen-

eral rule, LLMs embody the knowledge acquired dur-

ing training, thus limiting their reliability when em-

ployed in unknown contexts, potentially leading to

a phenomenon called hallucination (Ji et al., 2023).

This problem can be mitigated by expanding the

model knowledge through ﬁne-tuning, i.e., updating

its weights using task-speciﬁc data. However, it re-

quires more time and computational resources when

scaled up with larger models. A viable alternative is

represented by the so-called retrieval augmented gen-

eration (RAG) technique (Lewis et al., 2020), con-

sisting of using dynamically retrieved external knowl-

edge from custom documents in response to an in-

put query. This approach mitigates hallucinations by

providing more accurate models, which are also tol-

erant of ﬂuctuations in context-speciﬁc information,

ensuring consistency. In response to the challenge

posed, leveraging the capabilities of the increasingly

disruptive AI paradigm, this article proposes an ap-

plication to assist network administrators in conﬁgur-

ing ﬁrewall functionalities, namely, guiding conﬁg-

uration of ﬁrewall using augmented large language

models (GOLLUM). To achieve this, the prior knowl-

edge of small-sized LLMs was evaluated in the corre-

sponding instruct version to assess their expertise in

suggesting network conﬁgurations and answering cy-

bersecurity questions, using the NetConfEval and Cy-

berMetric datasets, respectively (Wang et al., 2024;

Tihanyi et al., 2024). Through an ad hoc pipeline

RAG, the most accurate and fastest models on the two

datasets were equipped with external knowledge pro-

vided by the pfSense documentation. In summary, the

main contributions of this paper are:

• The proposal of a conversational agent, namely

GOLLUM, which can assist network administra-

tors in ﬁrewall conﬁgurations.

• The implementation of an automated evaluation

pipeline for the RAG chain that exploits LLMs

in both the test case generation and the judgment

phases.

2 LITERATURE REVIEW

Due to the radical diffusion of LLMs, recent stud-

ies have evaluated their deployment in emerging

telecommunication technologies, such as 6G (Xu

et al., 2024). In such a scenario, several collab-

orative LLMs, each with different scopes, are dis-

tributed among the network infrastructure to per-

form user-agent interactive tasks. In (Wu et al.,

2024), a low-rank data-driven network context adap-

tation method was proposed to signiﬁcantly mini-

mize the ﬁne-tuning effort of LLM employed in com-

plex network-related tasks. This framework, called

NetLLM, turned out to be useful in three speciﬁc net-

working use cases. Likewise, a study by (Chen et al.,

2024) proposes the application of language models in

the management of distributed models in cloud edge

computing. This proposal is called NetGPT (accord-

ing with the LLM leveraged) has the objective of re-

leasing a collaborative framework for effective and

adaptive network resource management and orches-

tration. The comprehensive investigations conducted

by (Huang et al., 2024) discuss how LLMs impact and

can enhance networking tasks, such as computer net-

work diagnosis, design and conﬁguration, and secu-

rity. In (Ferrag et al., 2024), a Bidirectional Encoder

Representations from Transformers (BERT)-based ar-

chitecture is leveraged as a thread detector. It was

ﬁne-tuned on a Internet-of-things (IoT) data gener-

ated trough a combination of novel encoding and tok-

enization strategies. The AI-based agent presented in

(Loevenich et al., 2024) is equipped with an AI-based

chatbot that leveraged LLM and knowledge graph-

based RAG technique to provide a human-machine

interface capable of performing the QA task, accord-

ing to the ﬁndings of the autonomous agent. In (Padu-

raru et al., 2024), an augmented LLama2 model pro-

vides a chatbot to support security analysts with the

latest trends in information and network security. This

was achieved by combining RAG and safeguarding

with the LLM capabilities. According to the review

we conducted, there appears to be a strong need to fo-

cus research activities on testing LLMs across various

networking contexts. In addition, the same tools can

greatly stimulate support activities, thereby fostering

the achievement of a better cyber posture by users.

To the best of our knowledge, no previous study has

investigated the use of LLMs for assisting in the con-

ﬁguration of critical network security devices, such as

ﬁrewalls. In such a scenario, a user could ask to an AI-

based agent to support him in: (i) conﬁguring network

policies; (ii) indicating which are the steps to follow

to conﬁgure a speciﬁc functionality according to the

manual of a speciﬁc product (e.g., pfSense); (iii) un-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

490

derstanding which are the attack vectors related to a

line of defense (e.g., what are SQL injection (SQLi)

attacks and how prevent them enabling web applica-

tion ﬁrewall (WAF) proxy-based plugin like (Coscia

et al., 2024)). Depending on the operative context and

the proposed purpose, it is essential to develop safety-

oriented agents, preferably through local LLMs, since

these can be deployed in closed private scenarios.

3 GOLLUM DESIGN

GOLLUM consists of a RAG pipeline equipped with

a parent document retriever that augments the con-

text understanding capabilities of a language model.

Regarding the language model adopted in GOLLUM

a deeper discussion on the selection criterion is pro-

vided in section 4.1.2.

3.1 Knowledge Base

The knowledge base is the main source of information

during retrieval. It complements the implicit knowl-

edge encoded by the model parameters with struc-

tured or unstructured information, such as textual data

or even images. To provide a real-world use case of

GOLLUM we refer to the pfSense ﬁrewall. In par-

ticular, we retained the instruct-oriented text of in-

terest based on its clarity, simplicity, and concise-

ness and deprived of any not-relevant reference to

pure ﬁrewall topics. Moreover, each book content has

been pre-processed so that only chapter content is re-

tained, thereby removing prefaces, frontispieces, and

any outlines. As a consequence of the overall content

analysis and to avoid redundancies, the book (Zien-

tara, 2018) was chosen as the ﬁnal knowledge base in

view of its substantial textual content and the limited

presence of ﬁgures and tables. A descriptive analysis

of topics covered in this book is presented in Table 1.

Table 1: Descriptive analysis of knowledge base.

Topic No. pages No. tokens

Captive Portal 35 49536

Conﬁguration 38 50648

DNS 31 48465

Firewall/NAT 51 72184

Installation 40 62118

Multiple WANs 29 45267

NTP/SNMP 12 16578

Routing and Bridging 43 62523

Trafﬁc Shaping 27 53073

Troubleshooting 74 101840

VPN/IPsec 49 68859

3.2 Document Chunker

Chunking documents is a standard practice that aims

to improve the accuracy of information retrieval (IR)

systems and ensures that the augmentation process

does not saturate the length of the context window of

LLMs. However, ﬁne-grained chunks may lose im-

portant contextual information; therefore, it is impor-

tant to balance the trade-off in chunk length. A com-

mon approach involves linking smaller fragments to

the original full documents or larger chunks, with the

goal of maximizing the accuracy of the retrieval pro-

cess while preserving the broader context to be used

in the generation process. According to this proce-

dure, we employ a recursive character splitter with a

chunk size of 256 for child nodes and 1700 for par-

ent nodes with an overlap of 100 characters to pre-

serve continuity between adjacent chunks. The length

of the parent nodes was chosen to be close to the

mean value of the lengths of documents. In addi-

tion, the chunk length was set so as to take advantage

of as much as possible of the context window of the

adopted language model, while avoiding saturating it

with the number of chunks that can be retrieved.

3.3 Embedding Model

In a RAG pipeline, the embedding model plays a

crucial role, as it transforms text into a searchable

format that allows efﬁcient and relevant information

retrieval. Such an encoded structure is a vector (a

high-dimensional numerical representation) that cap-

tures the semantic meaning of a sequence of text

(e.g., a query), ensuring that similar concepts are

close to each other in the vector space, even if they

are phrased differently. The embedding model se-

lected for GOLLUM was the so-called mxbai-embed-

large-335M (d = 1024) as it stably appears in the top

lightweight performers on the massive text embed-

ding benchmark (MTEB) leaderboard (Muennighoff

et al., 2023). In addition, as stated by Mixedbread,

such an embedding model is appropriate for RAG-

related use cases. As for user queries, the embedding

model encodes the knowledge base (such a phase hap-

pens ofﬂine); thus, all external knowledge documents

are transformed into vectors that are stored in a vector

database for fast lookup.

3.4 Vector Store

Vector stores represent a fundamental component in

RAG applications, as they are needed for storing, in-

dexing and managing any type of data in the form

of high-dimensional, dense numerical vectors, com-

GOLLUM: Guiding cOnﬁguration of ﬁrewaLL Through aUgmented Large Language Models

491

monly produced by an embedding model. These vec-

tors convey a semantic representation of the data, al-

lowing for similarity-based retrieval through metrics

such as cosine similarity or maximum marginal rel-

evance. Embeddings are stored and retrieved using

Chroma, an open-source, lightweight vector database

with integrated quantization, compression, and the

ability to dynamically adjust the database’s size in re-

sponse to changing needs.

3.5 Retriever

The vector stores can store multiple representations of

the same documents. We employ a parent-document

retrieval strategy that generates and indexes two dis-

tinct embeddings: one for the child chunks and an-

other for the parent documents obtained at 3.2. Then,

cosine similarity is computed between the embed-

dings of the input query and the child nodes to retrieve

the most relevant matches; these are then used to re-

trieve the larger information segments from the parent

nodes, which are ﬁnally used in the augmentation pro-

cess. The number and size of the retrieved documents

must be balanced according to the LLMs context win-

dow and available hardware resources. A larger con-

text window increases the computational load. For

our purposes, we retrieve the top four relevant par-

ent chunks, each consisting of 1700 characters, to be

fed into a context window of size 8192, i.e., the lower

bound of maximum context window sizes supported

by the employed models. Hence, we ensure that the

context window is not saturated, thus leaving room

for generation. As shown in Figure 1, embeddings of

chunks related to the same topic exhibit spatial prox-

imity.

Figure 1: Parent embeddings per topic computed using the

model outlined in section 3.3.

4 EXPERIMENTAL EVALUATION

4.1 Materials and Methods

4.1.1 Datasets

To evaluate stand-alone LLM capabilities in both

computer networks and cybersecurity domains, the

following two datasets have been used: (i) NetCon-

fEval

(Wang et al., 2024); (ii) CyberMetric

(Tihanyi

et al., 2024). Then, to assess the responses of the en-

tire RAG pipeline against the golden answers, a cus-

tom test set comprising 330 question-golden answer

pairs, 30 for each topic at Table 1, was constructed

using the appropriate RAGAS (Es et al., 2024) util-

ity. Due to the uneven distribution of content lengths

across topics, as depicted in Table 1, it was necessary

to develop an appropriately balanced test set. Tak-

ing into account possible noise, duplicates, and errors

while generating the test set using the RAGAS util-

ity, we started from a target number of 50 question-

answer pairs for each topic. Then, we pre-processed

the test set by removing duplicates, empty, and invalid

golden answers. Finally, each question was manu-

ally inspected, reducing the number of QA pairs to 30

samples per topic. This was achieved by selecting the

test samples based on the following criteria: (i) topic

coherence; (ii) speciﬁcity, i.e., we retained the most

speciﬁc and accurate QA pairs by discarding samples

for which the related question does not require exter-

nal knowledge to be answered. Given the same set of

pfSense documentation used by the RAG chain, the

LLMs involved in the generation of the synthetic test

set were chosen to be different, but with a comparable

size of LLMs in section 4.1.2, were: (i) Gemma-7B

exploited as the generator; (ii) Gemma2-9B was used

as a critic model to validate the generation process;

(iii) Mxbai-embed-large-335M to produce the embed-

dings as in section 3.3. The generator-critic model

pair is set so that the former is an older release than

the latter, as recommended by the RAGAS utility.

4.1.2 Large Language Models Evaluated

From the models evaluated in (Tihanyi et al., 2024),

we selected the latest versions with available instruct

models to translate requirements into formal speci-

ﬁcations for reachability policies, emulating ﬁrewall

trafﬁc management. Consequently, the lightweight

local LLMs assessed for the generation text compo-

nent of GOLLUM are: (i) Llama3-8B and Llama3.1-

8B; (ii) Mistral-7B and Neural-Chat-7B.

https://github.com/NetConfEval/NetConfEval

https://github.com/cybermetric/CyberMetric

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

492

4.1.3 Metrics

First, to evaluate the effectiveness of LLMs in deal-

ing with the translation of network speciﬁcations,

the accuracy is calculated by dividing the number

of correctly translated requirements by the expected

ones given a ﬁxed batch size b (number of speci-

ﬁcations). In particular, each test case comprised

m =

MAX

samples, where, in our case b

MAX

= 10

b ∈ [1, 5, 10, 20,50, 100]. On the other hand, because

CyberMetric consists of multiple choices QA test, the

average (since each experiment has been repeated ﬁve

times to ensure good statistical signiﬁcance, as done

by the authors of the dataset in their experiments)

LLM capability of selecting the correct answer among

the four different options (accuracy). In addition, the

medium inference time was recorded for both tasks

because we were interested in models that were ac-

curate and quick to respond in their practical employ-

ment. Then, according to (Adlakha et al., 2024), QA

instruction-following and context-specialized models

should be evaluated mainly along two aspects, which

aim to point out the correctness of the provided an-

swer and how the information conveyed by the agent

is sourced from the external knowledge provided. The

paradigm adopted for such an evaluation is the so-

called LLM-as-a-judge (Zheng et al., 2023). How-

ever, to ensure a more impartial judgment, we formed

a panel of judges, consisting of an ensemble of three

local medium-sized LLMs not involved in any setup

adopted so far, i.e.: Phi3-14B; Vicuna-13B; Qwen2.5-

14B. We choose a trio of models, given the nature of

the judgment requested to each LLM, that is, binary.

This choice is necessary to implement a majority vot-

ing schema to infer deﬁnitive decisions. Therefore,

at least two models can provide agreeable judgments.

To realize evaluation purposes, the following metrics

were considered:

• Answer correctness (AC), that is, an indicator of

how well the generated answer compares to the

ground truth. First, to compute such a measure,

the factual correctness (FC) is derived as the F1

score calculated on the number of: (i) statements

shared by the answer and the ground truth (true

positives (TPs)); (ii) facts in the generated an-

swer that do not belong to the expected answer

(false positives (FPs)); (iii) statements found in

the ground truth but not in the generated answer

(false negatives (FNs)). These measures are de-

rived according to the decisions inferred by the

aforementioned majority voting scheme. It should

be noted that TPs and FPs are mutually exclusive,

which makes the decision binary. Second, the co-

sine similarity between the answer (e

) and the

ground truth (e

) encoded as embeddings is cal-

culated and then averaged with the FC:

AC =

|TP|

|TP|+

×(|FP|+|FN|)

·e

∥e

∥∥e

∥

(1)

• Faithfulness (FF), i.e, how consistent is the gen-

erated answer with respect to the given context.

To realize this, the generated answer is initially

split into N

single statements. Then, the panel

of LLMs judge determines how many statements

) are effectively retrieved from the context:

FF =

(2)

4.2 Results and Discussion

4.2.1 LLMs Knowledge of Computer Networks

and Cybersecurity Analysis

Figure 2: LLM accuracy and inference time (in seconds)

achieved on NetConfEval per different weight quantization

and batch size.

Figure 2 displays the average accuracy and inference

time trends achieved by the evaluated LLMs for dif-

ferent b values. For b = b

MAX

→ m = 1, and this

justiﬁes the absence of the conﬁdence interval in the

correspondence of b = 100. However, and in ac-

cordance with the results obtained by the authors of

NetConfEval, for b = 100, we noticed that the LLM

outputs produced are always cut, which leads to per-

formance degradation. Ranging the performance for

each model, the following ﬁndings are derived:

• Llama3-8B achieved high accuracy at smaller

batch sizes without quantization. As the batch

GOLLUM: Guiding cOnﬁguration of ﬁrewaLL Through aUgmented Large Language Models

493

size increases (beyond 20), there is a rapid drop

in accuracy. The model quantization also fol-

lows a similar pattern but starts slightly lower.

The inference time is relatively low at smaller

batch sizes and increases gradually as the batch

size increases. The fp16 weight precision requires

slightly more time compared to int4 quantization,

which indicates that the latter offers faster infer-

ence while retaining similar accuracy.

• Llama3.1-8B demonstrated relatively high accu-

racy at smaller batch sizes for both the fp16 and

int4 conﬁgurations. This improvement is signiﬁ-

cant compared to Llama3-8B for b = 50. As pre-

viously observed, the accuracy decreased as the

batch size increased, with a signiﬁcant drop-off

beyond batch size 50. The int4 version demon-

strated better accuracy than fp16 (for b ∈ [1, 5, 20])

and produced the highest average accuracy among

all compared models (higher than 60%) for b =

50. The inference time remained low and stable

for all batch sizes, demonstrating that this lan-

guage model was more efﬁcient in handling larger

batches than the other models. The int4 conﬁgu-

ration remained slightly faster than the fp16.

• Except for small batch sizes, Mistral-7B achieved

low accuracy scores among all compared models

(from b = 10) regardless of whether model quanti-

zation is adopted or not. The model is completely

insolvent even at b = 20. In addition, it requires a

longer inference time than Llama models for both

the fp16 and int4 setups.

• Neural-Chat-7B showed a sharp increase in infer-

ence time with increasing batch sizes, peaking at

batch size 50 (around one minute) and stabilizing

afterward. The int4 conﬁguration demonstrated a

faster inference time (except for b ≥ 50) than the

fp16 conﬁguration, although the speed gains were

accompanied by low ﬂuctuating accuracy. This

measure is the major drawback of this model be-

cause it is under 20% for b ≥ 20.

As a general overview, the accuracy and infer-

ence time appear to decrease and increase, respec-

tively, with increasing b. Considering the trade-off

between accuracy and inference time, the models be-

longing to the Meta Llama3 family appear to be the

best among the evaluated in providing network policy

translations. This result is critical because it opens

up the scenario of declining these models for the net-

work analysis policy task, which is primary in ﬁre-

wall applications. Furthermore, adopting quantiza-

tion does not lead to average performance degrada-

tion, ensuring faster inferences and, inevitably, lower

GPU memory consumption.

Figure 3: LLM accuracy and inference time (in seconds)

achieved on CyberMetric per different weight quantization.

Figure 3 indicates that the accuracy achieved by

Mistral-7B, which is close to ε = 73.65 (human accu-

racy) if quantization is enabled and the lowest among

those achieved by all models compared in the case of

the fp16 weight precision. Therefore, this model is

the worst performer on the 80 questions of the Cy-

berMetric dataset, examining also the inference time

obtained, which is the longest compared to competi-

tors per quantization adopted. The remaining three

models outperformed human accuracy, and this result

is notable for Neural-Chat quantized, which achieved

the highest average accuracy score (with a negligible

standard error) in these experiments. Despite this, the

average inference time was longer than that achieved

by the Llama models in both quantization settings.

Finally, Llama3 and Llama3.1 achieve comparable

performance in terms of accuracy and inference time

(equal for this metric). Models without quantization

require an average inference time of approximately 10

s, which indicates that they can answer ∼480 ques-

tions in just a minute. On the other hand, this abil-

ity almost doubles that of adopting quantization while

maintaining acceptable accuracy. Intersecting the re-

sults in both Figures 2 and 3, the Llama3 models ap-

pear to be more suitable in understanding contexts re-

lated to computer networks and security, producing

accurate results in a very reactive manner.

4.2.2 RAG-Based Analysis

Equipping GOLLUM with Llama3-8B or Llama3.1-

8B results in a shift in AC and FF trends on the basis

of how shown in Figure 4. To be speciﬁc:

• Llama3-8B achieves ∼ 76% of AC with a stan-

dard error of approximately ±0.02. With regard to

the alignment between the generated content and

the context retrieved, this language model results

in FF ∼ 75%.

• Llama3.1-8B yields ∼ 78% of AC, i.e., the up-

per bound of the aforementioned model AC conﬁ-

dence interval, again with a standard deviation of

±2%. In this case, despite the increase in accu-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

494

racy compared to Llama3-8B, a drop in FF is evi-

dent, which appeared to be close to 70%. Similar

to Llama3-8B, the standard deviation was wider

than the AC.

According to the two above points, GOLLUM is

more correct and faithful if Llama3-8B is used as the

language model, i.e., a more balanced AC-FF trade-

off is obtained compared to Llama3.1-8B usage. Es-

tablishing a balance between AC and FF can help en-

sure the response not only provides factually correct

information and accurately represents the intended

message, reasoning, or data. The achievement of 76%

correctness implies that the model is accurate in three

out of four answers, which is a promising starting

point. This shows that the retrieval pipeline is gen-

erally effective; however, it could still be improved to

capture more precise and relevant answers. A 75%

faithfulness score means that, on average, one of four

responses may include unfaithful or hallucinated con-

tent.

Figure 4: RAG metric scores per LLM leveraged by

GOLLUM.

Figure 5: RAG metric scores per LLM achieved by

GOLLUM on different topics.

Figure 5 depicts the average performance

achieved by GOLLUM per topic. Speciﬁcally,

Lllama3.1-8B outperforms Llama3-B in answer cor-

rectness on topics such as Captive Portal, Installation,

and Multiple/WANs, while the opposite trend is

observed for VPN/IPsec. Similarly, the FF achieved

by GOLLUM equipped with Lllama3.1-8B is better

for Firewall/NAT and DNS topics than the same score

obtained by Llama3-8B, which provides more faith-

ful answers in all the remaining topics. A remarkable

result concerns the AC achieved on questions related

to the conﬁguration topic, which exceeds the 80%

in both cases denoting how GOLLUM can provide

accurate guidelines on conﬁguration tasks. Similarly,

very promising results are obtained for questions

concerning routing and bridging topic, validating

the efﬁciency in understanding network reachability

requirements, as previously assessed with the Net-

ConfEval benchmark, in a more specialized context

(as the top FF score is exhibited). The consistency

in model performance on various topics highlights

the potential of GOLLUM to serve as a robust,

context-sensitive support tool in network security.

5 CONCLUSION

Conﬁguring ﬁrewalls is a challenging and error-prone

task that can compromise network security. Given

the complexity, especially in large-scale and dynamic

networks, human operators can beneﬁt from intelli-

gent, context-aware tools for accurate ﬁrewall setup

guidance. In this paper, we introduced GOLLUM,

an AI-powered assistant designed to mitigate conﬁg-

uration challenges using a proper RAG pipeline. The

proposed method exploits generative models that in-

corporate adequate prior knowledge of computer net-

works and cybersecurity. By combining LLMs with

an extensive structured knowledge base, GOLLUM

provides reliable and context-speciﬁc support to net-

work administrators. Based on the experiments,

GOLLUM equipped with Llama3-8B demonstrated

more balanced performance, with considerable thor-

oughness in providing support on the topic of pfSense

conﬁgurations. Future developments may include ex-

panding the knowledge base to include additional net-

work security resources, further optimizing the re-

trieval mechanism.

ACKNOWLEDGEMENTS

This work was supported in part by the Fondo

Europeo di Sviluppo Regionale Puglia Programma

GOLLUM: Guiding cOnﬁguration of ﬁrewaLL Through aUgmented Large Language Models

495

Operativo Regionale (POR) Puglia 2014-2020-Axis

I-Speciﬁc Objective 1a-Action 1.1 (Research and

Development)-Project Title: CyberSecurity and

Security Operation Center (SOC) Product Suite

by BV TECH S.p.A., under Grant CUP/CIG

B93G18000040007.

REFERENCES

Adlakha, V., BehnamGhader, P., Lu, X. H., Meade, N., and

Reddy, S. (2024). Evaluating correctness and faith-

fulness of instruction-following models for question

answering. Transactions of the Association for Com-

putational Linguistics, 12:681–699.

Alicea, M. and Alsmadi, I. (2021). Misconﬁguration in ﬁre-

walls and network access controls: Literature review.

Future Internet, 13(11).

Anwar, R. W., Abdullah, T., and Pastore, F. (2021). Fire-

wall best practices for securing smart healthcare envi-

ronment: A review. Applied Sciences, 11(19).

Chen, Y., Li, R., Zhao, Z., Peng, C., Wu, J., Hossain, E., and

Zhang, H. (2024). Netgpt: An ai-native network archi-

tecture for provisioning beyond personalized genera-

tive services. IEEE Network.

Coscia, A., Dentamaro, V., Galantucci, S., Maci, A., and

Pirlo, G. (2023). An innovative two-stage algorithm

to optimize ﬁrewall rule ordering. Computers & Secu-

rity, 134:103423.

Coscia, A., Dentamaro, V., Galantucci, S., Maci, A., and

Pirlo, G. (2024). Progesi: A proxy grammar to en-

hance web application ﬁrewall for sql injection pre-

vention. IEEE Access, 12:107689–107703.

Coscia, A., Maci, A., and Tamma, N. (2025). Frog: A

ﬁrewall rule order generator for faster packet ﬁltering.

Computer Networks, 257:110962.

Es, S., James, J., Espinosa Anke, L., and Schockaert, S.

(2024). RAGAs: Automated evaluation of retrieval

augmented generation. In 18th Conference of the Eu-

ropean Chapter of the Association for Computational

Linguistics: System Demonstrations, pages 150–158.

Ferrag, M. A., Ndhlovu, M., Tihanyi, N., Cordeiro,

L. C., et al. (2024). Revolutionizing cyber threat

detection with large language models: A privacy-

preserving bert-based lightweight model for iot/iiot

devices. IEEE Access, 12:23733–23750.

Huang, Y., Du, H., Zhang, X., Niyato, D., Kang, J., Xiong,

Z., Wang, S., and Huang, T. (2024). Large language

models for networking: Applications, enabling tech-

niques, and challenges. IEEE Network.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,

Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey

of hallucination in natural language generation. ACM

Computing Surveys, 55(12):1–38.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,

V., et al. (2020). Retrieval-augmented generation for

knowledge-intensive nlp tasks. In Advances in Neu-

ral Information Processing Systems, volume 33, pages

9459–9474.

Loevenich, J. F., Adler, E., Mercier, R., Velazquez, A., and

Lopes, R. R. F. (2024). Design of an autonomous cy-

ber defence agent using hybrid ai models. In 2024

International Conference on Military Communication

and Information Systems (ICMCIS), pages 1–10.

Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen,

T. H., Sainz, O., Agirre, E., Heintz, I., and Roth, D.

(2023). Recent advances in natural language process-

ing via large pre-trained language models: A survey.

ACM Computing Surveys, 56(2).

Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.

(2023). MTEB: Massive text embedding benchmark.

In 17th Conference of the European Chapter of the As-

sociation for Computational Linguistics, pages 2014–

2037.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,

C., Mishkin, P., et al. (2022). Training language mod-

els to follow instructions with human feedback. In

Advances in Neural Information Processing Systems,

volume 35, pages 27730–27744.

Paduraru, C., Patilea, C., and Stefanescu, A. (2024). Cy-

berguardian: An interactive assistant for cybersecu-

rity specialists using large language models. In 19th

International Conference on Software Technologies -

Volume 1: ICSOFT, pages 442–449.

Ross, S. I., Martinez, F., Houde, S., Muller, M., and Weisz,

J. D. (2023). The programmer’s assistant: Conversa-

tional interaction with a large language model for soft-

ware development. In 28th International Conference

on Intelligent User Interfaces, page 491–514.

Sarker, I. H. (2024). Generative AI and Large Language

Modeling in Cybersecurity, pages 79–99. Springer

Nature Switzerland.

Tihanyi, N., Ferrag, M. A., Jain, R., Bisztray, T., and Deb-

bah, M. (2024). Cybermetric: A benchmark dataset

based on retrieval-augmented generation for evaluat-

ing llms in cybersecurity knowledge. In 2024 IEEE

International Conference on Cyber Security and Re-

silience (CSR), pages 296–302.

Wang, C., Scazzariello, M., Farshin, A., Ferlin, S., Kosti

D., and Chiesa, M. (2024). Netconfeval: Can llms

facilitate network conﬁguration? Proceedings of the

ACM Networking, 2(CoNEXT2).

Wu, D., Wang, X., Qiao, Y., Wang, Z., Jiang, J., Cui, S.,

and Wang, F. (2024). Netllm: Adapting large lan-

guage models for networking. In ACM SIGCOMM

2024 Conference, page 661–678.

Xu, M., Niyato, D., Kang, J., Xiong, Z., Mao, S., Han, Z.,

Kim, D. I., and Letaief, K. B. (2024). When large

language model agents meet 6g networks: Perception,

grounding, and alignment. IEEE Wireless Communi-

cations, pages 1–9.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,

Zhuang, Y., et al. (2023). Judging llm-as-a-judge with

mt-bench and chatbot arena. In Advances in Neu-

ral Information Processing Systems, volume 36, pages

46595–46623.

Zientara, D. (2018). Learn pfSense 2.4: Get up and running

with Pfsense and all the core concepts to build ﬁrewall

and routing solutions. Packt Publishing.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

496