CyberGuardian 2: Integrating LLMs and Agentic AI Assistants for

Securing Distributed Networks

Ciprian Paduraru

, Catalina Camelia Patilea

and Alin Stefanescu

1,2

Department of Computer Science, University of Bucharest, Romania

Institute for Logic and Data Science, Romania

Keywords:

Large Language Models, Cybersecurity Assistant, Security Ofﬁcers, Agentic AI, Fine-Tuning, Retrieval

Augmented Generation.

Abstract:

Robust cybersecurity measures are essential to protect complex information systems from a variety of cyber

threats, which requires sophisticated security solutions. This paper explores the integration of Large Lan-

guage Models (LLMs) to improve cybersecurity operations within Security Operations Centers (SOCs). The

proposed framework has a modular plugin architecture where Agentic AI controls the information ﬂow, in-

cludes Retrieval Augmented Generation (RAG), protection methods for human-chatbot interactions and tools

for managing tasks such as database interactions, code generation and execution. By utilizing these tech-

niques, the framework aims to streamline the workﬂows of SOC analysts, allowing them to focus on critical

tasks rather than redundant activities. The study also explores the dynamic customization of LLMs based on

client data, user experience, potential risks and language preferences to ensure a user-centric approach. The

results show improvements in efﬁciency and effectiveness and highlight the potential of LLMs in cybersecu-

rity applications.

1 INTRODUCTION

Large Language Models (LLMs) have shown remark-

able potential in areas such as natural language pro-

cessing, machine learning, and, more recently, cyber-

security: (Motlagh et al., 2024), (Shafee et al., 2024),

(Paduraru et al., 2024). Our research explores the ap-

plication of these advances in the development of in-

teractive assistants that support security profession-

als by automating and optimizing the data collection

processes and understanding for problem identiﬁca-

tion and providing informed guidance for problem

remediation. Organizations and institutions have set

up specialized teams with different areas of expertise

to combat threats in real-time. One notable exam-

ple is the Security Operations Center (SOC) (Mughal,

2022), where SOC analysts work around the clock.

Their tasks include monitoring and detecting threats

in real-time, investigating and escalating incidents,

and managing security tools and information. They

work with network engineers and architects to im-

prove infrastructure protection against potential fu-

ture attacks.

Contributions. Our research aims to leverage the lat-

est advances in LLMs to support security profession-

als in real-time. The goal is to streamline their work-

ﬂows so that they can focus on critical tasks rather

than redundant and time-consuming activities. The

goal is to allow customers to interact with systems,

logs, databases, code generation and execution, user

interface components, and internal tools using the

provided methods via natural language queries, with

the assistant acting as an intermediary. The contri-

butions of this work can be summarized as follows,

as an improvement of our previous work (Paduraru

et al., 2024):

• Improving the process of ﬁne-tuning the LLM

model and datasets compared to our previous work.

We further ﬁne-tune the model from two differ-

ent angles: a) doubling the original size of the

cybersecurity-related dataset with new sources and

b) matching the answers generated by the assis-

tant with the answers annotated by humans using

the Direct Preference Optimization (DPO) (Guo

et al., 2024) technique and human-in-the-loop. On

the technical level, we ﬁne-tune the base model

from scratch, moving from Llama-2 (Touvron et al.,

2023) to the latest improvements in Llama-3.1

(Dubey et al., 2024).

• The problems reported in our previous work were

solved by adding the new mechanisms of Adaptive-

Retrieval Augmented Generation (RAG) (Jeong

et al., 2024), (Asai et al., 2023), (Yan et al., 2024),

660

Paduraru, C., Patilea, C. C. and Stefanescu, A.

CyberGuardian 2: Integrating LLMs and Agentic AI Assistants for Securing Distributed Networks.

DOI: 10.5220/0013406000003928

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2025), pages 660-667

ISBN: 978-989-758-742-9; ISSN: 2184-4895

to allow the model to self-check whether the re-

trieved data is relevant to the user’s prompt, evalu-

ate the quality of the generated response, including

the hallucination validator, and iterate to improve it.

In addition, the reiteration of the response now in-

cludes mechanisms that allow for variety and nov-

elty in the data retrieved from queries instead of a

greedy selection.

• Our work architecturally reworks the previous

project to be more independent and customized by

using a plugin architecture: clients are now able to

inject their ﬁne-tuning data, data for RAG, tools,

and validators into an Agentic AI (Khanda, 2024)

ﬂow.

Reusable open-source code and a dataset collec-

tion are provided at https://github.com/unibuc-cs/

CyberGuardian.

This article is structured as follows. The next sec-

tion presents works that have inspired our project or

that we have adapted to our speciﬁc context and use

cases. Section 3 describes the methods of data col-

lection and processing. Section 4 details the proposed

architecture, methods, components and their interac-

tions. The evaluation and discussions are presented

in Section 5. Finally, the last section concludes the

paper and outlines future work plans.

2 RELATED WORK

Recent research has investigated the use of LLMs in

both defensive and adversarial cybersecurity contexts,

offering a thorough review of contemporary practices

and identifying research gaps. The extensive evalu-

ation by (Motlagh et al., 2024) provides valuable in-

sights into the potential risks and advantages of in-

corporating LLMs into cybersecurity strategies. This

section will outline the primary sources of inspiration

for the present study.

In their study, (Al-Hawawreh et al., 2023) explore

the potential applications of ChatGPT in the ﬁeld of

cybersecurity, especially in penetration testing and

threat defense. They present a case study in which

ChatGPT is used to create and execute false data in-

jection attacks on critical infrastructure, such as in-

dustrial control systems, and discuss the challenges

and future directions for its integration into cyberse-

curity. Other studies, such as those by authors ref-

erenced in (Franco et al., 2020) and (Shaqiri, 2021),

utilize conventional NLP techniques to aid in high-

level cybersecurity planning and management. These

studies focus on identifying logs of past cyberattacks,

proposing solutions, and providing insights for deci-

sion making through user interactions that lead to the

extraction and deployment of various solutions.

In (Arora et al., 2023), the authors develop a chat-

bot that uses AI and sentiment analysis of Twitter

data to predict and assess cyber threats on social me-

dia. (Tejonath Reddy, 2024) examines chatbots that

use deep learning as a proactive solution for detect-

ing persistent threats and phishing attacks in real-

time and concludes that deep learning algorithms can

adapt to new paradigms and detect subtle phishing in-

dicators, creating a stronger defense against evolv-

ing threats. The work by (Abdelhamid and et al.,

2023) proposes chatbots integrated with social net-

works as collaborative agents for continuous cyberse-

curity awareness, education, and training. Similarly,

another study in (Fung et al., 2022) presents a cyber-

security chatbot built on Google Dialogﬂow that ed-

ucates users about cyber risks, provides a knowledge

base, self-quizzes and advice, effectively improving

cybersecurity awareness.

Conversely, ensuring data security in chatbot in-

teractions is an important issue. For example, the

research and demonstration applications we propose

contain sensitive user data and introduce users to

important system functions without violating current

best practices. As chatbots become increasingly pop-

ular in various ﬁelds, they pose signiﬁcant security

risks and vulnerabilities. (Yang et al., 2023) has con-

ducted a systematic literature review to identify po-

tential threats, propose effective solutions, and outline

promising future research directions.

3 DATASETS COLLECTION AND

INDEXING

This section shows how the data sources for

ﬁne-tuning the LLM model were collected, post-

processed, and indexed for faster retrieval. The

overview of the process is shown in Figure 1 and is

explained below. The principles in (Paduraru et al.,

2024) are followed to collect and index the data to

align the base model with foundational knowledge

from the latest data and research in the ﬁeld of cyber-

security. A formalized deﬁnition of this dataset can

be found in Eq. (1). It consists of documents, video

transcripts, and markdown post-processed data.

D = {PDF

, Transcript(Video

), Markdown

} (1)

We follow the principles of Direct Preference Op-

timization (DPO) (Guo et al., 2024) to improve the

performance of the model by aligning the response of

the smaller model trained with both a teacher model

and a human-in-the-loop.

CyberGuardian 2: Integrating LLMs and Agentic AI Assistants for Securing Distributed Networks

661

A. Data for base

model fine-tuning

PDF files

Markdown

Youtube

transcripts

Teacher (GPT4-o)

and humman feedback

B. Data for DPO

fine-tuning

Chunk 1

Chunk 2

Chunk N

...........

2. Organize in groups

and chunks

4.Store metadata

for each chunck

Mongo DB

VectorStore

Data

Embeddings

3.Create

embedings

5. Persistent

store

1. Gather data

Figure 1: The data collection process. First, the data set, D, is obtained from various sources of information that can be

exported in text format ( Step 1.A). Then, the GPT-4o model is used as a teacher to generate pairs of questions, answers and

ideal answers. People randomly sample from this dataset to annotate the ideal answers from their point of view, resulting in

the dataset D

( Step 1.B). In addition, the metadata for both datasets is stored on a MongoDB server to simplify the update

process and avoid deduplication ( Step 4). Finally, the data is split into smaller parts, embedded and permanently stored in a

vector database (Steps 3 and 5). The red colored blocks represent the output of this process.

A teacher model(OpenAI et al., 2024), in particu-

lar the GPT-4o

, is used to collect cybersecurity data

by extracting initial N

qdoc

questions from a subset

of D documents (especially from foundational books

and recent tutorials or blogs) using prompts as de-

scribed in (Ivison et al., 2024)). The same model

is also asked to create N

qbase

questions on various

basic sub-areas of cybersecurity (see Section 5) that

were not part of the ﬁrst set, resulting in a total of

= N

qbase

+ N

qdoc

questions. The teacher model is

then asked to respond to each of these questions with

choices

= 4 different versions, using a temperature of

0.8. Finally, we ask the teacher again to look at each

question and answer and choose the best one from its

point of view. We further denote this set of questions

and answers by D

Following the work in (Tihanyi et al., 2024), we

then use three human skill levels (beginner, interme-

diate, senior) to check N

qhumansP

percent of the set D

and select which version out of the n

choices

answers

they would prefer. Intuitively, a senior person might

prefer shorter, concise answers without too much de-

tail, as opposed to beginners who might want more

information. Each person was given a random sub-

set of the original and had access to a random source

of information when answering. The results are sum-

marized in the set D

⊂ D

and sorted by exper-

tise in D

Hlevel

⊂ D

. Including the human in the

loop ensures that the dataset incorporates human feed-

back into the training loop to reﬁne the model’s un-

derstanding of what constitutes a preferred response.

For efﬁciency reasons, the datasets are stored in an

embedded format, which we refer to as D

emb

, re-

spectively D

emb

. In our implementation, we use

the sentence transformer embedding model (Reimers

and Gurevych, 2019). Speciﬁcally, we use the open

https://openai.com/index/hello-gpt-4o/

source variant all-mpnet-base-v2

. Its purpose is to

transform text data into ﬂoat value vectors, a neces-

sary format for subsequent processing by Faiss’ in-

dexing and querying system.

4 ARCHITECTURE AND

IMPLEMENTATION

The architecture and components are implemented

using a plugin pattern, allowing higher capacity

models, alternative ﬁne-tuning methods, various

functionalities, and components to be toggled on or

off based on speciﬁc use cases, as shown in Figure 2.

This approach, known in software engineering as the

separation of concerns, is a core concept applied in

the proposed methods.

Client-End Data Customization. The client can add

its own speciﬁc data to control the assistant according

to the use case. For example, the model can be

ﬁne-tuned with SFT to more easily adapt to a speciﬁc

use case terminology, language, and culture within

its organization. RAG data can be speciﬁed such

that the assistant can retrieve information that needs

to be exact or fast to get, e.g., manuals of products,

architecture of the infrastructure, high-level APIs,

and frameworks available inside the system. Data can

be organized in groups, each with its own metadata to

follow the trend of hierarchical information retrieval

(Goel and Chandak, 2024). Tools are speciﬁed such

that the assistant can connect reasoning with actions

performed on the client’s infrastructures. A concrete

example can be: suggesting SQL source code to

blacklist an IP range of addresses in the ﬁrewall’s

https://huggingface.co/sentence-transformers/all-

mpnet-base-v2

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

662

database, provision new virtual machines, add new

users to a group, etc. In general, these are operations

that are implemented in the client’s source code,

and the assistant has the information on all of these

at the description or prototype level. This concept

is discussed further in the paper and is known as

Agentic AI.

RAG Support. To efﬁciently summarize or extract

content from local system infrastructures such as

logs, databases, and source code, a retrieval aug-

mented generation (RAG) functionality is used. This

reduces human effort and is important for users,

particularly SOC teams, facing constantly updated

internal knowledge. Various types of information and

formats can be indexed and queried during the LLM

reasoning process, including internal documents

(e.g., whitepapers, communication, architecture, and

infrastructure documentation), source code APIs and

conﬁguration ﬁles, and data sources like databases,

JSON or CSV ﬁles. To ensure diverse information

retrieval between iterations, the framework uses the

MMR (Carbonell and Goldstein, 1998) sampling

algorithm. Introducing noise into the RAG process

helps address issues reported in previous studies

(Paduraru et al., 2024), where users might prefer

different data retrieved after being dissatisﬁed with

the initial response.

User Preference and Conversation Safeguarding.

In general, organizations have speciﬁc cultural

communication rules. More, users may have diverse

preferences when interacting with an assistant, such

as wanting concise or detailed responses or preferring

formal or polite tones. These preferences can be

managed both through prompt engineering and

ﬁne-tuning for speciﬁc groups. Also, users may have

varying levels of expertise, which can be matched

to speciﬁc ﬁne-tuned models. This is illustrated in

Figure 3. For security measures, as prototyped in our

framework, they can be registered in an Access Con-

trol List (ACL), granting different levels of access

to internal resources. For example, some users may

have the privilege to execute code or initiate services

(e.g., a honeypot). To ensure that assistant conver-

sations adhere to various rules (e.g., gender-neutral

language, politeness), the Llama Guard model (Inan

et al., 2023) ﬁlter is utilized, along with its commonly

used taxonomy in LLMs, protecting conversations by

enforcing these guidelines.

The Fine-Tuning Process. The process for ﬁne-

tuning a base model to better align with cybersecurity

knowledge and related tasks is illustrated in Figure 3.

This workﬂow and the source code are independent

of the chosen base input model; for our prototype

and evaluation, the Llama 3.1-8B-Instruct (Dubey

et al., 2024) was used along with the datasets pre-

sented in Section 3. The ﬁnal version of the model,

CyberGuardian2LLM, is split into three categories

based on expertise level. This example is generic and

can be adapted according to user needs. With larger

preference datasets, the methods can also be tailored

to other factors, such as the text style of assistant

responses and combinations thereof.

Base Fine-Tuning. For this step, we follow the stan-

dard supervised training (SFT) methodology com-

monly used in previous research. The model pro-

cesses batches of dataset samples and compares them

to the corresponding ground truth using cross-entropy

as a loss function. At predeﬁned intervals, the model’s

performance is evaluated using the perplexity metric.

Samplebatch B ∼ D

emb

# c on tai ni ng i np ut s an d d es ir ed m od el

an sw er s

# P ass t hi s t hr ou gh t he m ode l to get th e p r ed ic te d answers

Predictions = CyberGuardian2LLM (B [”In put ”])

# L oss rep re se nt s ho w c lo se th e o ut pu t o f th e m od el is

co mp are d to the g ro un d tr ut h i n th e b at ch

Loss = CrossEntropy (Predictions, B [”Answer”])

U pdate weights o f CyberGuardian2LLM

Listing 1: Fine-tuning process.

Direct Responses and Preferences Fine-Tuning.

The loss function of DPO (Chen et al., 2024) is

adapted to our use case. Speciﬁcally, during the ﬁne-

tuning process sample pairs of prompts (x), chosen

and rejected answers (y

, respectively y

) are sampled

from dataset D

. These are passed through two mod-

els: a) the base model, π

base

, being the trained model

in the previous step CyberGuardian2LLM

base

), and

b) the model under current ﬁne-tuning π

, represent-

ing the output model CyberGuardian2LLM.

Conversation Management. The observations from

the work of (Maharana et al., 2024) are followed to

avoid losing the conversation context over many iter-

ations. The conversation history between the user and

assistant is handled using two structures:

• The pair of last 5 messages exchanged in the

conversation between them: short-term memory,

Mem

• The conversation summary of all other older

messages condensed into a single paragraph: the

long memory, Mem

Adaptive RAG. Our methods follow the ideas from

(Jeong et al., 2024), in using RAG in combination

with the AI agent to check two main things:

CyberGuardian 2: Integrating LLMs and Agentic AI Assistants for Securing Distributed Networks

663

Client

Client training dataset

Client RAG data

Vector

Store

RAG

....

Vector

Store

Train

EMB

SFT

Original

RAG data

Tools

Figure 2: Adding client speciﬁcs to the assistant knowledge. Three types of data can be added to control completely the

answers generations: the base ﬁne-tuning data, RAG documents, and tools to interact with one infrastructure. The blue

colored components represent the output of adding the client-speciﬁc data into the previously ﬁne-tuned model and RAG

support.

Llama 3.1-8B-Instruct

1.Base fine-tuning

Supervised

Fine-tuning (SFT)

Fine-tune with

DPO

2. Preference fine-tuning

Expert

Middle

Beginner

3.Final fine-

tuning using

Figure 3: The process of ﬁne-tuning the Llama 3.1-8B-Instruct (as an example) to a model aligned more on the cybersecurity

ﬁeld. There are two main steps in this process: 1) the base model ﬁne-tuning, and 2) aligning with a teacher and human-in-

the-loop. The ﬁnal models are slightly ﬁne-tuned (Step 3) for different speciﬁcs, in our demonstration, to consider the level

of expertise of the human users.

• Is the retrieved information relevant to the user

question? This is needed speciﬁcally because the

retrieval mechanisms tend to score high enough

pieces of documents that might not be relevant.

Thus, the LLM is used as a judge by using its natu-

ral language understanding capabilities.

• Can get the information from other sources, for ex-

ample using existing tools registered by the client

(e.g., web-search, internal platform search, etc.),

or even human-in-the-loop, by reporting that no

sources were found and asking for help.

Due to space constraints, the prompts utilized

are available in the repository. Brieﬂy, the LLM is

instructed to: (a) determine if the retrieved content

is relevant to the question (binary response: yes or

no); (b) perform a hallucination check by assessing

whether the ﬁnal generated answer is grounded in

the provided documents; and (c) evaluate the quality

of the response by ascertaining if it adequately

addresses the original user’s question.

Tools and Interaction with the Backend Systems.

To enhance SOC specialists’ productivity and re-

sponse times, our work utilizes advanced methods

focusing on user interaction with LLM and system

processes. These methods, referred to as agents and

tools. ReACT agents (Yao et al., 2023) are the core

of our Agentic AI implementation. This method in-

volves ﬁrst prompt engineering to provide the LLM

with brief descriptions of available tools. The LLM

model is ﬁne-tuned over a few epochs using 50 exam-

ples of tool calls and parameter extraction. The main

steps are: act (call speciﬁc tools), observe (pass tool

output back to the model), and reason (determine the

next action based on tool output). Figure 4 shows how

the LLM interacts with various tools in the frame-

work. At each step, the LLM evaluates the problem,

the available tools, and decides whether to invoke a

tool that interacts with the organization’s infrastruc-

ture or provide a response. This method allows for

task decomposition, where complex queries are bro-

ken down into manageable pieces and resolved us-

ing appropriate tools. The set of tools implemented

in the source code as exempliﬁcation are commonly

used such as ACL operation, databases access, ofﬁce

tools, python code generation (by invoking internally

another specialized model, Code Llama 7B (Rozi

ere

et al., 2024)), and safety checker tool (invoking Llama

Guard (Inan et al., 2023)).

5 EVALUATION

5.1 Evaluation Setup

To iterate over the architectural quality of the frame-

work, and test Agentic AI and the RAG component,

the solution presented was tested ﬁrst with a group

of students (master’s degree, from the University of

Bucharest). Each group was asked to simulate a dis-

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

664

User/ SoC

specialist

Entity Solver

Action Solver

Set of tools

(default and

client)

Conv

History

Client systems

Act

Initial result

1. Prompt

5.Reason / Decision making: provide the answer or invoke existing tools and get their results

2. Update

Vector

Store

RAG

3.Query

Prompt

updated and

Contextual

Data

Filter only

relevant content

Hallucination

check

Conversation

Safeguarding

Rework

Final response

4.Rag retrieval and filtering

6. Response checker

Figure 4: The ﬂow of the Agentic AI: (1) receiving the user’s prompt, (2) reformulating the current query based on the

memories of previous conversations, (3) giving the reformulated prompt, retrieving and ﬁltering relevant content from the

indexed data, (4) deciding whether the answer can be given or must be given via the exposed services on the client side (using

tools/function calls); the decision is based on LLM and some NLP-based processing techniques for easier identiﬁcation of the

entities and actions requested in the prompt, (5) checking the response for hallucinations and security; if the response matches

the guidelines, returning it to the user and updating the internal knowledge; if it needs to be revised instead, add the reviewer’s

response to the history so that the model can assess what was wrong with the original response the next time. Note that while

the framework repository only contains illustrative examples, the tool suite, models, and validators can be customized by each

client site.

tributed application considering a use-case from dif-

ferent domains such as medical, gas stations, smart

home systems, etc. In total, we counted 17 differ-

ent applications across various domains. For each,

the request was to gather at least 3-4 PDF manuals

own made or from ithe nternet that could represent

the client data for that particular use case. The tar-

get was to have ∼ 100 pages of data on each case

to perform RAG tests. Each team of students repre-

sented the Client class in Figure 2. The focus from an

architectural point of view was to make the methods

reusable across use cases.

Table 1 shows different values that are used to

ﬁne-tune the current version of the model as men-

tioned in Section 3. The high temperature used when

generating the responses from the teacher model, 0.8

was able to produce diverse answers, but still almost

correct. In this sense, we followed the observations of

(Ivison et al., 2024) which mentions that better pref-

erence data leads to the largest improvements from all

other factors.

5.2 Quantitative and Qualitative

Evaluation

There are two research questions that we address in

our study:

• RQ1. How well does the Cyberguardian2LLM

understand the cybersecurity domain?

• RQ2. What is the real feedback of human users

engaged in the experiments?

RQ1 Analysis. To quantitatively assess the effec-

tiveness of the ﬁne-tuned CyberGuardian2LLM, we

measure the response usefulness using an established

method (Zheng et al., 2023), involving automated

Table 1: Table containing the values used in Section 4. The

ﬁrst group represents the dataset dimensions, the second

is the number of humans who participated in the experi-

ments, while the retrieval parameters are presented in the

last group.

Description Value

Dataset

qdoc

5000

qbase

4000

qdoc

1000

qhumansP

30%

No. of humans participating 83

beginners 40

intermediate 39

seniors 4

evaluation with a larger model, speciﬁcally GPT-4,

serving as the judge. This is also applied in the cy-

bersecurity domain in the work of (Paduraru et al.,

2024) and (Tihanyi et al., 2024), which datasets we

also reuse in the process of ﬁne-tuning our model.

Five topics in cybersecurity relevant to SOC spe-

cialists: have been selected:

• System protection against security risks and mal-

ware.

• Cryptography and authentication mechanisms.

• Conﬁguring security protocols such as ﬁrewalls and

intrusion detection systems (IDS).

• Network security infrastructure (including ﬁre-

walls, VPNs, web proxies, IDS/IPS).

• Investigation of data breaches and leaks.

These clusters are denoted as

Topics = (t p

)

i=1,...5

. The judge, GPT-4, is prompted

to generate 20 questions for each of the ﬁve topics,

CyberGuardian 2: Integrating LLMs and Agentic AI Assistants for Securing Distributed Networks

665

totaling 100 questions, denoted by Que. This set

is created by inserting the topic variable into the

template shown in Listing 2.

Co ns ide r y ourse lf an i nt erv ie we r fo r a SO C sp ec i al is t

po si tio n . As k 20 r el ev an t . d i ff er en t q ue st io ns on th e

to pi c {topic var}

Listing 2: The template prompt used to ask questions to the

judge LLM.

We measure the CyberGuardian2LLM responses

for each Q ∈ Que against two vanilla (no ﬁne-tuned)

Llama 3.1 models (Llama 3.1-8B-Instruct, and 70B-

Instruct), and the model from (Paduraru et al., 2024).

To compare the answers head-to-head, the judge is

prompted to respond to which one he prefers, List-

ing 3. The template variable for each question Q

var

ﬁlled in, along with the responses of the two mod-

els compared, LLM1 resp, and LLM2 resp. The

observation made from the results shown in Table

2, is that in the synthetic experiments, the Cyber-

Guardian2LLM’s answers are preferable over the pre-

vious ones, except the bigger class model (70B)

which needs signiﬁcantly more computational re-

sources to infer (GPU memory: ∼ 9.6GB vs ∼ 84GB).

The results shown suggest that a ﬁne-tuned model can

overpass or come close to the much larger (sometimes

impractical for end-user machines), as reported also

in literature (Yao et al., 2023).

Table 2: Head-to-head comparison of response preferences,

with GPT-4 serving as the evaluator. The second col-

umn shows the percentage of cases where each model was

preferred over CyberGuardian2LLM. The Llama versions

shown in the ﬁrst two rows are the public vanilla versions,

without any ﬁne-tuning.

Model

Preferred over

CyberGuardian2LLM

Llama3.1-8B

29%

Llama3.1-70B

61%

CyberGuardianLLM

33%

Gi ve n t he q ue st io n : {Q var}, re sp on d wh ic h o f th e f o ll ow in g

tw o a ns we rs y ou prefer . Writ e on ly V er si on 1 or

Ve rs ion 2 .

# V er si on 1 : {LLM1 resp}.

# V er si on 2 : {LLM2 resp}.

Listing 3: The template for classifying the answers with

judge LLM.

RQ2 analysis. To evaluate from a different perspec-

tive, during each case, the assistant emulated signals

that could indicate such an attack. The 83 human par-

ticipants were asked to use the assistant to solve the

issue. Solving these issues involves things as such:

ﬁnding the ﬂooding IPs, blacklisting them within the

ﬁrewall database, starting a honeypot server, analyz-

ing data ﬂow speciﬁcs in terms of size and procu-

rance, and so on. An emulator for different types

of attacks, including DDoS or ransomware attacks

(de Neira et al., 2023), has been used for evaluation.

The snapshots of the attacks provided various data

tables with statistics on the utilization of resources

(e.g. servers, routers, local hub systems), and con-

nection logs of users including their location, time,

and resources consumed in the network. The deploy-

ment interface was handled via RestAPI and the client

was implemented with Streamlit

libraries and tools

(including visualizations). All these simulations and

screenshots can be found in the repository.

The task was correctly solved by 65 people within

a single instance of the test. The rest had to restart

the test and try once or twice. Their feedback after

each question-response pair is collected using a rat-

ing score 1-5 and optional natural language descrip-

tion. The average score of responses is 4.17. One im-

portant observation is that users considered that the

cybersecurity assistant helped them in two ways:

• they can ask contextual items around the topic that

is trying to solve.

• the functionalities of each use-case were faster re-

trieved (RAG), and actions could be taken using

natural language requests.

6 CONCLUSIONS AND FUTURE

WORK

The CyberGuardian framework demonstrates the po-

tential to enhance the efﬁciency and effectiveness of

Security Operations Centers (SOCs) through the ap-

plication of LLMs and Agentic AI. By leveraging

techniques such as Retrieval Augmented Generation

(RAG) and ensuring secure human-chatbot interac-

tions, CyberGuardian 2 addresses various cybersecu-

rity tasks, including database management, ﬁrewall

conﬁguration, and code execution. The framework’s

adaptability to different user experience levels and

its client-independent architecture make it a versatile

tool for diverse applications. Future work will focus

on further ﬁne-tuning the model, expanding its ca-

pabilities, and integrating more complex multi-agent

functions to support advanced cybersecurity opera-

tions.

ACKNOWLEDGEMENTS

This research was partially supported by the project

“Romanian Hub for Artiﬁcial Intelligence - HRIA”,

https://streamlit.io/

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

666

Smart Growth, Digitization and Financial Instru-

ments Program, 2021-2027, MySMIS no. 334906

and European Union’s Horizon Europe research and

innovation programme under grant agreement no.

101070455, project DYNABIC.

REFERENCES

Abdelhamid, S. and et al. (2023). Cybersecurity aware-

ness, education, and workplace training using socially

enabled intelligent chatbots. In Creative Approaches

to Technology-Enhanced Learning for the Workplace

and Higher Education, pages 3–16, Cham. Springer

Nature Switzerland.

Al-Hawawreh, M., Aljuhani, A., and Jararweh, Y. (2023).

ChatGPT for cybersecurity: practical applications,

challenges, and future directions. Cluster Computing,

26(8):3421–3436.

Arora, A., Arora, A., and McIntyre, J. (2023). Developing

chatbots for cyber security: Assessing threats through

sentiment analysis on social media. Sustainability,

15(17).

Asai, A. et al. (2023). Self-rag: Learning to retrieve, gener-

ate, and critique through self-reﬂection.

Carbonell, J. and Goldstein, J. (1998). The use of mmr,

diversity-based reranking for reordering documents

and producing summaries. In Proc. of ACM SIGIR,

pages 335–336.

Chen, C., Liu, Z., Du, C., Pang, T., Liu, Q., Sinha, A.,

Varakantham, P., and Lin, M. (2024). Bootstrapping

language models with dpo implicit rewards.

de Neira, A. B., Kantarci, B., and Nogueira, M. (2023).

Distributed denial of service attack prediction: Chal-

lenges, open issues and opportunities. Computer Net-

works, 222:109553.

Dubey, A. et al. (2024). The llama 3 herd of models.

Franco, M. F. et al. (2020). Secbot: a business-driven con-

versational agent for cybersecurity planning and man-

agement. In 2020 16th International Conference on

Network and Service Management (CNSM), pages 1–

Fung, Y.-C. et al. (2022). A chatbot for promoting cyber-

security awareness. In Cyber Security, Privacy and

Networking, pages 379–387, Singapore. Springer Na-

ture Singapore.

Goel, K. and Chandak, M. (2024). Hiro: Hierarchical infor-

mation retrieval optimization. CoRR, abs/2406.09979.

Guo, S. et al. (2024). Direct language model alignment from

online ai feedback.

Inan, H., Upasani, et al. (2023). Llama guard: Llm-

based input-output safeguard for human-ai conversa-

tions. arXiv preprint arXiv:2312.06674.

Ivison, H. et al. (2024). Unpacking dpo and ppo: Disentan-

gling best practices for learning from preference feed-

back.

Jeong, S. et al. (2024). Adaptive-rag: Learning to adapt

retrieval-augmented large language models through

question complexity.

Khanda, R. (2024). Agentic ai-driven technical trou-

bleshooting for enterprise systems: A novel weighted

retrieval-augmented generation paradigm.

Maharana, A. et al. (2024). Evaluating very long-term con-

versational memory of LLM agents. In Proceedings

of ACL, pages 13851–13870, Bangkok, Thailand. As-

sociation for Computational Linguistics.

Motlagh, F. N. et al. (2024). Large language models in cy-

bersecurity: State-of-the-art.

Mughal, A. A. (2022). Building and securing the modern

security operations center (soc). International Jour-

nal of Business Intelligence and Big Data Analytics,

5(1):1–15.

OpenAI et al. (2024). GPT-4 technical report.

Paduraru, C., Patilea, C., and Stefanescu, A. (2024). Cyber-

Guardian: An interactive assistant for cybersecurity

specialists using large language models. In Proc. of

ICSOFT’24, volume 1, pages 442–449. SciTePress.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. arXiv

preprint arXiv:1908.10084.

Rozi

ere, B. et al. (2024). Code llama: Open foundation

models for code.

Shafee, S., Bessani, A., and Ferreira, P. M. (2024). Evalua-

tion of llm chatbots for osint-based cyber threat aware-

ness.

Shaqiri, B. (2021). Development and reﬁnement of a chat-

bot for cybersecurity support. Master’s thesis, Univer-

sity of Zurich, Zurich, Switzerland.

Tejonath Reddy, K. (2024). How deep learning chatbots

empower cybersecurity against phishing attacks. In-

ternational Center for AI and Cyber Security Re-

search and Innovations (CCRI).

Tihanyi, N., Ferrag, M. A., Jain, R., Bisztray, T., and Deb-

bah, M. (2024). Cybermetric: A benchmark dataset

based on retrieval-augmented generation for evaluat-

ing llms in cybersecurity knowledge. In 2024 IEEE

International Conference on Cyber Security and Re-

silience (CSR), pages 296–302.

Touvron, H. et al. (2023). Llama 2: Open foundation and

ﬁne-tuned chat models.

Yan, S.-Q. et al. (2024). Corrective retrieval augmented

generation.

Yang, J. et al. (2023). A systematic literature review of

information security in chatbots. Applied Sciences,

13(11).

Yao, S. et al. (2023). ReAct: Synergizing reasoning and

acting in language models. In ICLR’23.

Zheng, L. et al. (2023). Judging llm-as-a-judge with mt-

bench and chatbot arena. In Advances in Neural In-

formation Processing Systems (NeurIPS, volume 36,

pages 46595–46623.

CyberGuardian 2: Integrating LLMs and Agentic AI Assistants for Securing Distributed Networks

667