Fine Tuning LLMs vs Non-Generative Machine Learning Models: A

Comparative Study of Malware Detection

Gheorghe Balan

1,2

, Ciprian-Alin Simion

1,2

and Dragos¸ Teodor Gavrilut¸

1,2

”Al.I. Cuza” University, Faculty of Computer Science, Iasi, Romania

Bitdefender Laboratory, Iasi, Romania

Keywords:

Fine Tuning LLMs, Neural Network, API Sequence, Malware Detection.

Abstract:

The emergence of Generative AI has provided various scenarios where Large Language Models can be used

to replace older technologies. Cyber-security industry has been an early adopter of these technologies, but in

particular for scenarios that involved security operation centers, support or cyber attack visibility. This paper

aims to compare how well Large Language Models behave against traditional machine learning models for

malware detection wrt. various constrains that apply to a security product such as inference time, memory

footprint, detection and false positive rate. In this paper we have ﬁne tuned 3 open source models (LLama2-

13B, Mistral, Mixtral) and compared them with 18 classical machine learning models (feed forward neural

networks, SVMs, etc) using more than 135,000 benign and malicious binary samples. The goal was to identify

scenarios/cases where large language models are suited for the task of malware detection.

1 INTRODUCTION

The rise of Generative AI has opened multiple pos-

sibilities in terms of automation that allowed cyber-

security vendors to use Large Language Models for

tasks like:

• support

• attack visibility and explainability

• security operation centers

In most cases, these models are used as a second opin-

ion in security operation centers or to validate detec-

tions given by other technologies.

However, since this type of models rely on vast

datasets for their training, they have the potential to

be used in other security-related scenarios. One thing

that needs to be taken into consideration is that these

models rely on various forms of natural language and

as such they are more likely to provide a proper result

for data presented in the same way.

In terms of malware identiﬁcation, this could be

obtained using a list of API calls (as they would re-

ﬂect the behavior of a malware). This input could also

be obtained from technologies that already exist in a

cyber-security technology stacks such as sandboxes

or emulators.

On the other hand, there are several constraints

that various technologies in a security suite have (such

as performance, detection rate (i.e. recall), inference

time, etc). It is also important to notice that the tem-

perature hyper-parameter (speciﬁc to LLMs) might

not be that useful in a situation where a determin-

istic result is required (for example in cases where

detection validation or various QA

processes are re-

quired).

Another relevant aspect is that while LLMs started

as cloud services, there are several models

that can

be tested locally. This potentially reduces the ser-

vice cost and various privacy issues that came with

a cloud-based model.

With this in mind, this paper attempts to evaluate

if such out-of-the-box models and a ﬁne-tuned ver-

sion of them can actually be used for malware detec-

tion and in what capacity.

The evaluation is done taken into consideration

various constraints and limitations of current detec-

tion technology stacks. We focused on API calls as

they describe the way a program works (and assuming

various description were part of the training dataset,

using them might provide a model with enough infor-

mation to infer the maliciousness quality of a ﬁle).

It is also important to evaluate if these large lan-

guage models are comparable with existing machine

quality assurance

https://huggingface.co/models

Balan, G., Simion, C.-A. and Gavrilu¸t, D. T.

Fine Tuning LLMs vs Non-Generative Machine Learning Models: A Comparative Study of Malware Detection.

DOI: 10.5220/0013177300003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 715-725

ISBN: 978-989-758-737-5; ISSN: 2184-433X

715

learning technologies. From this perspective we elab-

orated an experiment where multiple non-generative

machine learning models are evaluated against 3 large

language models over a sample set of malicious and

benign ﬁles that were executed in an emulator so

that the ordered list of API calls could have been ex-

tracted.

The comparison takes into consideration various

metrics such as the detection rate (i.e. recall), false

positive rate, inference time and memory footprint.

The rest of the paper is organized as follows: Sec-

tion 2 presents similar research, Section 3 explains

the problem we are tackling in this paper, Section 4

and Section 5 present our experiment, Section 6 dis-

cusses the results of our experiment and ﬁnally Sec-

tion 7 draws some conclusions.

2 RELATED WORK

As with all other ﬁelds, cyber-security researchers

got very excited when seeing LLMs’ capabilities and

tried to use them as creatively as possible. Even if

there are a lot of papers discussing the use of LLMs in

an offensive manner (Charan et al., 2023), (Karanjai,

2022), (Botacin, 2023), (Pa Pa et al., 2023) we will

intentionally leave them out as this paper focuses on

using them to test their malicious behaviour detection

power.

The authors of (Motlagh et al., 2024) have gath-

ered multiple papers that focus on protecting, defend-

ing, detecting, but also on adversarial uses of LLMs.

A very interesting analysis they have done is count-

ing the number of papers published by the function

presented. Their research shows that by the time of

their publishing there were at least 30 papers that were

focused on offensive functions like Reconnaissance,

Initial Access, Execution, Defense Evasion or Cre-

dential Access. All of these are MITRE ATT&CK

techniques

On the other hand, the papers that focus on de-

fensive techniques are spread across directions like

Identify, Protect, Detect, Respond and Recover, as the

authors (Motlagh et al., 2024) show they have found

over 30 papers that delve in the techniques aforemen-

tioned as well.

One great application of LLMs is with web con-

tent ﬁltering as shown in (V

os et al., 2023). In

this paper the authors show how they achieved bet-

ter results (up to 9% increase in accuracy) using

LLMs when compared to standard deep learning algo-

rithms. In addition, they showed how they ﬁne-tuned

https://attack.mitre.org/matrices/enterprise/

a large language model using only 10000 samples and

achieved better performance than the current state-of-

the-art solutions that was trained on 10 million sam-

ples. One of the biggest problems with this kind of

algorithms is even in their name: they are large, some-

times too large to make them usable in a practical sce-

nario. Probably the most impressive result of this pa-

per (V

os et al., 2023) is that using a smaller model

(175 times smaller) they attained performance levels

comparable to the original LLM (770 million param-

eters). Some of the models they used are BERT (De-

vlin et al., 2019), eXpose or GPT-3 Babbage.

A lot of malware, especially zero-day threats,

leverage the use of exploits in legitimate software.

These exploits end up in these programs most of-

ten by mistake. This paper (Omar, 2023) proposes a

new framework named VulDetect tasked with detect-

ing vulnerable code. The framework utilizes GPT-2,

BERT and LSTM to detect vulnerabilities in C, C++

and Java code by employing Knowledge Distillation

in a teacher-student conﬁguration. The results were

compared with VulDeBERT (Kim et al., 2022) and

LSTM, all trained and tested on four datasets of vul-

nerable code SARD (Zhou and Verma, 2022), SeVC

(Shoeybi et al., 2020), Devign (Zhou et al., 2019)

and D2A (Zheng et al., 2021). Their best perform-

ing model was the one based on GPT-2 with 93.59%

accuracy when tested against SARD dataset.

Another interesting research done by Rahali et. al.

proposes a malware detecting framework named Mal-

BERT (Rahali and Akhlouﬁ, 2021). In their experi-

ments they used a dataset of Android APK ﬁles that

they downloaded and processed by extracting the An-

droidManifest.xml and removing unnecessary infor-

mation from it. They started from over 13 million ﬁles

(APKs) from the Androzoo public dataset and, after

processing, ended up with 12K benign samples and

10K malware samples. The authors pursued two en-

deavours, binary classiﬁcation (malware/benign) and

multi-classiﬁcation(ex. spyware, dropper, clicker,

etc.). When compared with LSTM, XLNet (Yang

et al., 2020a), RoBERTa (Liu et al., 2019) and Distil-

BERT (Sanh et al., 2020), the best performing model

was BERT (Devlin et al., 2019), in both binary clas-

siﬁcation and multi classiﬁcation. On the ﬁrst task,

BERT achieved an accuracy of 97% with the next

model, XLNet, at 95%. On the multi classiﬁcation

task, BERT attained an accuracy of 91% with the next

model, LSTM, at 85%.

When it comes to analysing malware samples one

of the most used formats is Application Programming

Interface (API) sequences. The SLAM (Chen et al.,

2020) (Sliding Local Attention Mechanism) frame-

work takes advantage of just that, and more. After the

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

716

authors get a handle on the API sequences for their

samples, they also categorize them by behaviour in

17 categories. After that, they construct these 2 di-

mensional input vectors containing the API sequence

(numbers) and the category index sequence. The re-

searchers argue that by doing so, the sample embeds

a stronger correlation between semantics and struc-

tural information. The dataset contained 110K be-

nign samples and 27K malware samples. Each sample

was truncated at 2000 APIs or padded with 0’s at the

end of the sequence to match that size. The proposed

framework involved ﬁve steps:

• Splitting the input vector

• Initializing the local attention window from the

split

• Running the Convolutional Neural Network train-

ing step

• Concatenating the results

• Applying Softmax on the concatenation

Finally, they used Random Forest (RF), Attention

CNN LTSM (ACLM) (shiqi, 2019) and a two-stream

CNN-Attention model (TCAM) (Yang et al., 2020b)

to compare SLAM to. After a 10 fold cross valida-

tion, their research shows that the SLAM framework

attained an average accuracy of 97% with the next

model being TCAM with 92%.

3 MALWARE DETECTION

CHALLENGES

Cyber-security has always been depicted by the cat-

and-a-mouse game between security products and

malware writers where each one reacts to the changes

added by the other. The advancement in machine

learning ﬁeld in the last decade indicated a poten-

tial edge for the security vendors as complex neural

networks architecture could be harder to bypass by a

malware. However, even if in theory this seems to be

correct, the same advancement in GANs (Generative

Adversarial Networks) balanced the ﬁeld.

We evaluate a machine learning model using two

metrics (related to cyber-security):

• proactivity - the ability to detect new malware

samples that are discovered long after the model

was trained.

• genericity - the ability to detect new samples that

have nothing or very little in common with the

samples used in the training set.

Those two metrics are always strongly linked to

the input data. This means that on one hand, using

a relevant feature set (for example behavior informa-

tion) could potentially create a model that is harder

to evade. On the other hand, knowing certain limi-

tations might allow you to ﬁnd exactly what kind of

things you need to change to a malware sample to

avoid detection. With this in mind, let’s enumerate

certain constraints that are required for a model to be

used in practice:

• inference speed

• false positive rate

• detection rate

• memory footprint

• model update size

Some of these constraints are correlated with the

way a model is being used, as follows:

• using a model for real-time protection

implies

that a model must be fast (in terms of inference)

since access to the scanned object is locked un-

til the verdict from the model is received. Usu-

ally this implies smaller models. It also implies a

lower (close to 0) false positive rate as blocking

a clean object might have a serious impact (e.g.,

blocking the access to a system ﬁle might block

the entire endpoint).

• a on-premise model requires a small memory

footprint (as one need to be certain that model

runs on multiple architectures with limited re-

sources in some cases). This also implies a re-

duced update size.

• in contrast, a cloud model is not limited by size or

hardware architectures. However, using a cloud

model implies you can not scan all accessed ob-

jects as one needs to take into consideration the

time needed to connect to the cloud. This means

that a cloud model is not always a good choice for

real-time protection - where all accessed objects

have to be scanned.

The more a product relies on real-time protection,

the more attacks are being stopped before they hap-

pen, but models have to be small to achieve a good

inference time as well as small memory footprint and

reduced update size. This is speciﬁc to the anti-

malware protection component of a security product.

However, using larger models implies scanning ob-

jects asynchronously and as such lose the protective

capabilities. At the same time, using a cloud model

is not limited by memory size, architecture or update

block access to an object until its scan is complete and

delete the object afterwards if it is deemed malicious

Fine Tuning LLMs vs Non-Generative Machine Learning Models: A Comparative Study of Malware Detection

717

requirements and can be harder to evade. This is usu-

ally the case with security analytics components such

as EDR

or XDR

With this in mind, we are attempting to validate if

large language models can be used in any of the pre-

viously described scenarios. Due to their complex ar-

chitecture we expect them to be more resilient on eva-

sion techniques and as such provide a better proactiv-

ity and genericity in terms of identifying new threats.

At the same time, all of the above restrictions must be

preserved leading to the question whether these mod-

els can be truly used for threat identiﬁcation and if so,

in what capacity?

4 DATABASES

LLMs inference time for each sample is relatively

long for a real-world scenario where a decision should

be taken as fast as possible. Therefore it is more

suitable to use them as an additional decision layer,

where the previous layer(s) will likely ﬁlter out com-

mon benign ﬁles (for speed improvements). With this

in mind, we wanted to mimic such a training envi-

ronment where the number of benign samples to be

processed is lower than the one of malicious samples.

Hence, our initial-api-sequences-database consists of

sequences of APIs extracted from 136,383 samples

(58,472 benign and 77,911 malicious). This initial

ﬁle database was split in two smaller ﬁle-databases

(training-api-sequences-database - 38,472 benign /

57,911 malicious and testing-api-sequences-database

- the remaining 20,000 for each class).

The APIs were extracted using a proprietary em-

ulator provided by a security company. The average

number of APIs extracted for clean samples is 276

while for malicious samples is 1392. As an obser-

vation, malicious ﬁles yielded more APIs during the

emulation phase.

Moreover, differences between benign and mali-

cious ﬁles can also be observed by looking at the top

ten APIs extracted for each class (Table 1 and Table

2).

Two particular APIs stood out in the malicious

dataset, kernel32 ReadFile and kernerl32 SleepEX.

This is due to how malicious ﬁles are often imple-

mented. Usually an attacker tends to evade automated

analysis by implementing long sleep (now an obso-

lete technique) and multiple read operations in order

to gather data from running environment (used mostly

by Ransomware, Password / Data Stealers).

Endpoint Detection and Response

Extended Detection and Response

Table 1: Top ten APIs seen in benign train dataset.

# API Count

1 kernel32 FlsGetValue 2015725

2 kernel32 HeapFree 1177867

3 kernel32 GetProcAddress 1121622

4 kernel32 TlsGetValue 773809

5 kernel32 SetLastError 630800

6 kernel32 WriteFile 350663

7 kernel32 MultiByteToWideChar 215998

8 kernel32 ReadFile 199316

9 kernel32 WideCharToMultiByte 188462

10 kernel32 lstrcmpiW 185232

Table 2: Top ten APIs seen in malicious train dataset.

# API Count

1 kernel32 TlsGetValue 4848697

2 kernel32 lstrcpynA 4200079

3 kernel32 FlsGetValue 3155210

4 kernel32 GetProcAddress 3030600

5 kernel32 SleepEx 2492721

6 msvbvm60 vbaFreeVar 2239934

7 kernel32 ReadFile 2126757

8 kernel32 HeapFree 1956913

9 oleaut32 SysFreeString 1857905

10 kernel32 WriteFile 1444253

Our databases were next used as follows:

• step1 - obtained a new database training-llm-vt-

detecting-engines-count by acquiring the number

of VirusTotal engines

detecting each sample in

training-api-sequences-database; for each sam-

ple we also stored the sequence of API Calls; this

database will be used to ﬁne tune LLMs

• step2 - testing-api-sequences-database - the se-

quences were fed to ﬁne tuned LLM models and

the results were saved (llm-results-database)

• step3 - applied a feature selection algorithm over

the training-api-sequences-database and created

a new database (training-api-sequences-feature-

database) which contained for each sample only

binary values (1 if the selected feature is found in

the API sequence list, 0 otherwise)

• step4 - trained and tested several machine learn-

ing models; testing the resulted models on testing-

api-sequences-database; stored obtained results

in ml-models-results-database

• step5 - compared the obtained results (ml-models-

results-database, llm-results-database)

https://www.virustotal.com/gui/home/search

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

718

5 EXPERIMENT SETUP

We divided our experiment in two larger parts:

the ﬁrst part, where we ﬁne tune and evaluate

3 open source Large Language Models (LLama2-

13b, Mistral-7b-v03 and Mixtral-8x7b-v.01) and one

where we looked into non-generative machine learn-

ing models that can be used for malware detection.

In both cases we evaluated the detection rate (recall),

false positive rate and inference time. For both ﬁne-

tuned LLMs / ML models, we use the same database,

training-api-sequences-database, to ﬁne-tune / train

and testing-api-sequences-database to test the ob-

tained LLMs / ML models.

The experiment was conducted on a virtual ma-

chine with 4 RTX 4090TI GPUs, 128 Gi RAM, 8

vCPU, and 512Gi storage.

5.1 Fine Tuning LLMs

The Large Language Models used in this experiment

were: LLama2-13b (Rozi

ere et al., 2024), Mistral

(Jiang et al., 2023), Mixtral (Jiang et al., 2024). Each

of these models were pulled from the HuggingFace

repository and deployed locally.

To interact with these models we ﬁrst constructed

the prompts. After some empirical tests with differ-

ent forms of prompts where we requested qualiﬁers,

simple binary verdicts or even class attribution conﬁ-

dence percentages, we settled with asking for a num-

ber on a scale from 0 to 9 where 0 is unlikely malware

and 9 likely malware. We constructed the prompt by

concatenating 3 parts.

The ﬁrst part, the preﬁx, was this: ”Given the fol-

lowing API call sequence: ”. The second part, the

api seq, was obtained for every sample’s sequence

by taking each API call name and stripping its class

name, resulting only in the function name. We

stripped them of the class name for two reasons: min-

imizing the size of the prompts and having as little

repetitive strings as possible in the prompt. After ob-

taining all the function names(acledit EditAuditInfo

−→ EditAuditInfo), we then added them into a se-

quence separated by a commas, whilst maintaining

their original order. Lastly, we added the sufﬁx, where

we asked the models to respond with a digit: ”On a

scale from 0 to 9 where 0 is very unlikely and 9 is very

likely, how likely is it that this sequence belongs to a

malware ﬁle? respond with a single digit. Don’t pro-

vide additional information, just the digit. Even if you

are not sure, just provide a digit.”.

Even if the number of available VirusTotal engines

https://huggingface.co/

is 101

we found out that in our database training-

llm-vt-detecting-engines-count the maximum number

of detecting engines is 73. Moreover, we have to

consider that from time to time an antivirus engine

might generate false positives, therefore a clean ﬁle

might get detected by a small number of engines.

On the other hand, some newly malicious ﬁles are

ﬁrstly detected by few antivirus engines. Hence, in

order to down-scale these scores to [0, 9] and keep a

balanced approach, we used the following formula:

max(min(int(vt detecting engines count/7.5), 9),0)

(Table 3)

Table 3: Train score distribution.

Score Count Score Count

0 39009 5 7183

1 1371 6 12961

2 1978 7 17879

3 2972 8 7578

4 5291 9 161

The score was appended to the prompt as the ex-

pected answer (”Answer: {digit}”). Due to train-

ing environment constraints we decided to ﬁne tune

on maximum 4096 tokens; Hence, the length of API

sequences was limited in order to ﬁt this restriction,

keeping only the ﬁrst aprox. 4000 APIs.

The ﬁne tuning process was implemented in

Python and made use of transformers, LoraConﬁg

and peft HuggingFace Python modules.

• Each of the LLM models were loaded in 4bit,

with double quant, quant type to ”nf4” and com-

pute type to bﬂoat16, being mapped on all 4

GPUs.

• From the tokenizer perspective, for each prompt

we added the bos and eos tokens and a padding

(with eos) to the right to ensure the ﬁxed tokens

length of 4096. This is necessary to ensure a

smooth ﬁne tuning process.

• For LoraConﬁg we decided to set bias to all

as we are working with sequences of APIs and

each token might be important. Rank was set to

32, Alpha Parameter to 64, dropout to 0.05 and

task type ”CAUSAL LM”. For llama2 and mis-

tral we set the target modules to q proj, k proj,

v proj, o proj, gate proj, up proj, down proj,

lm head. For mixtral, we replaced gate proj,

up proj, down proj with w1, w2 and w3.

• For TrainingArguments we used a train batch

size of 3 with 2 steps gradient accumulation,

a small learning rate of 2.5e-5, bf16, and

https://virustotal.readme.io/docs/list-ﬁle-engines

Fine Tuning LLMs vs Non-Generative Machine Learning Models: A Comparative Study of Malware Detection

719

”paged adamw 8bit” as suggested in Hugging-

Face docs

. The number of steps was set to 3750.

The times needed to ﬁne-tune the models are the

following: Llama2-13b - 62h 14m, Mistral-7b - 35h

28m, Mixtral-8x7b - 46h 50m.

5.2 Testing Fine Tuned LLMs

In the testing phase we used the same prompt archi-

tecture, this time applied on testing-api-sequences-

database. For each call to the LLMs we gathered

the response and the duration of the call. We then

needed to parse the textual response so we could ex-

tract a numerical verdict. We did so in two steps.

The ﬁrst step was to apply a regular expressions on

the response received, ”Answer (\d{1})”. If no valid

match was found, we then moved to the second step

were we applied two more generic regular expres-

sions: ”(\d{1})”, ”([0|1|2|3|4|5|6|7|8|9]{1})”. If we

still did not get any matches, we moved on consid-

ering that this response was unusable and discarded

it.

At this point, we had a relation between a sample,

a model, a numerical verdict and the time needed to

evaluate all the samples. With all these, we calculated

false positive rate, detection rate (i.e. recall) and av-

erage response time for every model.

In an effort to improve the LLM results we also

compounded their individual results in three voting

mechanisms:

1. Average - in this approach we choose the result to

be the average value of all 3 models

2. Based on majority - in this system we went

through all samples and checked all 3 models for

their numerical verdict on the 0 to 9 scale. In the

case of two or three models with the same numer-

ical verdict we would select that. This decision

would help limit the impact these models would

have in a real life scenario.

3. Based on Veto Mechanism - in this approach we

looked for at least one model’s numerical verdict

to be in the interval ( 0-t for clean or t-9 for mal-

ware, the t is the threshold value).

Like for the individual models, we also computed

the results of the voting systems with respect to 8 se-

lected threshold values.

https://huggingface.co/docs/transformers/v4.29.1/en/

perf train gpu one

5.3 Non-Generative Models: Feature

Mining and Selection

In order to use the traditional, non-generative models,

we have to extract speciﬁc features from API Calls

sequences. Hence, from a feature mining perspective

a 3-steps process was implemented:

1. from training ﬁles database (training-api-

sequences-database) three boolean features types

were derived:

• traditional features - 1 if a certain API was

found in the API sequences, 0 otherwise.

• mapping features - 1 if two speciﬁc APIs was

found in the API sequence, 0 otherwise.

• sub-sequence features - 1 if a sub-sequence of

length 2 was found in the API sequence, 0 oth-

erwise.

2. next, the obtained features were sorted by using a

F2 metric score (round(5.0∗F

[malicious])/(5.0∗

[malicious]+4.0∗(total benign−F

[benign])+

[benign]) ∗ 100.0,2) - where F

[malicious] de-

notes how many malicious ﬁles contain F

feature,

[benign] denotes how many benign ﬁles contain

feature and total benign the total number of be-

nign ﬁles; round is a function to round the obtain

value at two decimals.

3. based on the previous work done by (Balan et al.,

2023) we have limited the number of features used

to validate our non-generative machine learning

models at 600. The dataset containing samples

with these 600 features will be further refered to

as training-api-sequences-feature-database).

After the ﬁrst step from the described methodol-

ogy, we resulted in 2730 traditional features, 677716

mapping features and 55784 sub-sequence features of

length 2, for each ﬁle. At the end of our feature selec-

tion algorithm, the resulted database contained sam-

ples with 76 traditional features, 510 mapping fea-

tures and 14 sub-sequences features.

5.4 Non-Generative ML Models

A number of 18 Machine Learning conﬁgurations

were used to validate our approach. To implement

the models we used sklearn-python package

and

xgboost

. For each model we used a 3-fold cross-

validation approach. We have two reasons behind this

approach:

https://scikit-learn.org/stable/

https://xgboost.readthedocs.io/en/stable/python/

python api.html

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

720

• We wanted to check the variance between differ-

ent results obtained from the 3 validation splits.

This would outline if the model is highly sensitive

to training data and if it is prone to over-ﬁt. It was

also a good way to identify if potential outliers re-

side in our dataset.

• As we trained multiple models, training time was

something we also needed to take into considera-

tion. A larger fold number (e.g., 20 folds) would

have increased the training time and associated

costs.

From a time perspective, each model’s training time

was negligible (only a matter of seconds). We used a

virtual machine with 6vCPUs, 72Gi and 1 RTX 2080

Ti.

• XGBoost - XGB - as shown in many conducted

studies, XGBoost should obtain good results in

binary classiﬁcation problem; the authors showed

in (Li et al., 2024) that it can be used to slightly

reduce the number of false positives; in our im-

plementation, we decided to keep the default pa-

rameters.

• Multinomial Naive Bayes (MultinomialNB) -

MNB - if we are to consider the API sequence

as a story that a sample tells us, then Multino-

mialNB might yield decent results as shown in

(Singh et al., 2019); on the other hand, from pre-

vious similar research (Balan et al., 2023) the au-

thors stated that MultinomialNB model is not per-

forming well on APIs; however, we decided to

keep it so we would have a large number of mod-

els to compare LLMs to; similarly, we kept the

default parameters as given by the python scikit-

learn package implementation.

• Logistic Regression, Support Vector Machines,

Decision Trees and Random Forest are ones of

the most tested models in the malicious software

detection problem (Senanayake et al., 2021); We

made the following variations in hyper parame-

ters:

– LogisticRegression - LR1 - max iter set to 1000

and l2 regularization.

– LogisticRegression - LR2 - max iter set to 100

and l2 regularization.

– LinearSVC - SV1 - default sklearn implemen-

tation, with C=0.0001.

– LinearSVC - SV2 - default sklearn implemen-

tation, with C=0.001.

– DecisionTreeClassiﬁer - DTG - criterion set to

gini

– DecisionTreeClassiﬁer - DTE - criterion set to

entropy.

– RandomForestClassiﬁer - RF1 - n estimators

set to 30, max depth set to 9.

– RandomForestClassiﬁer - RF2 - n estimators

set to 50, max depth set to 12.

• a BaggingClassiﬁer - BGL - applied over a Deci-

sionTreeClassiﬁer with gini criterion; when com-

pared to multiple linear regression model-based

classiﬁers, Bagging-DT scored almost the best ac-

curacy as shown in (S¸ahın et al., 2022).

• an AdaBoostClassiﬁer - ADB - applied over a De-

cisionTreeClassiﬁer with gini criterion; AdaBoost

has been widely used in malware detection prob-

lem; recent research (Al-haija et al., 2022) shows

how it can outperform state-of-the-art models.

• a VotingClassiﬁer-Hard - VCH - (voting hard)

which has all the above deﬁned models as estima-

tors; similar research (Bakır, 2024) yielded good

results.

• a CustomOneSideVotingClassiﬁer-Benign - VCB

- a custom implementation of voting classiﬁer; if

at least one classiﬁer yields a benign prediction

then the ﬁle is classiﬁed as benign; applied on the

same models as VotingClassiﬁer-Hard.

• a CustomOneSideVotingClassiﬁer-Malicious

- VCM - a custom implementation of voting

classiﬁer; if at least one classiﬁer yields a

malicious prediction then the ﬁle is classiﬁed

as malicious; applied on the same models as

VotingClassiﬁer-Hard.

• Neural Networks implemented in three different

conﬁgurations

– LegacyNN - LeN - and LightNN - LiN - as de-

ﬁned in (Balan et al., 2023)

– A ThirdNN - TNN - 2 hidden layers of 32 and

16 with ’ReLU’ activation, output layer with

’sigmoid’, ’RMSProp’ as optimizer and ’bi-

nary crossentropy’ for the loss function.

These models were used with the features resulted

after applying the feature selection method described

in the previous subsection.

6 RESULTS

6.1 Fine Tuned LLMs

After concluding with all processing we obtained

39956 usable responses for LLama2-13b, 39961 for

Mistral, 39958 for Mixtral.

Given the way we constructed our prompt, in or-

der to analyze our results, we needed a way to clearly

Fine Tuning LLMs vs Non-Generative Machine Learning Models: A Comparative Study of Malware Detection

721

separate the malicious verdicts from the benign ones.

To do this, we chose 1,2,3,4,5,6,7,8 as thresholds. Ev-

erything equal or above the threshold was considered

as malicious and everything below was considered be-

nign. Tables 4, 5, 6, 7, 8, 9, 10 show the accuracy,

detection rate (i.e. recall) and false positive rate for

each considered threshold and each Large Language

Model.

Table 4: Llama2-13b Results.

Threshold Acc Recall FPR

1 82.25 99.06 34.54

2 82.95 98.79 32.87

3 82.85 97.82 32.1

4 82.72 94.74 29.29

5 80.69 85.53 24.13

6 76.71 71.23 17.83

7 68.9 41.82 4.06

8 54.03 8.07 0.09

Table 5: Mistral-7B Results.

Threshold Acc Recall FPR

1 91.02 99.19 17.13

2 90.87 98.87 17.12

3 90.65 98.38 17.06

4 90.27 97.57 17.02

5 89.82 96.31 16.66

6 85.32 83.94 13.3

7 70.3 41.15 0.6

8 52.75 5.42 0.0

Table 6: Mixtral-8x7B Results.

Threshold Acc Recall FPR

1 86.67 99.16 25.8

2 86.78 99.15 25.55

3 86.79 99.15 25.55

4 86.76 98.87 25.33

5 84.08 90.72 22.54

6 78.57 78.85 21.71

7 68.74 39.74 2.33

8 52.13 4.16 0.0

One noticeable observation here is that none of

the models were able to obtain a proper balance (in

terms of practical usage) between detection rate and

false positive rate. For example, Llama2-13b obtained

99.06% Recall for t=1; at the same time the false pos-

itive rate is 34.54% (making it unfeasible for practical

usage). On the other hand, a model with a low false

positive rate (0.6%) such as Mistral-7b (t=7) has only

managed to obtain a recall of 41.15% (also not good

enough for industry detection standards).

Table 7: Average Results.

Threshold Acc Recall FPR

1 80.62 99.69 38.41

2 83.53 99.57 32.47

3 88.01 99.14 23.11

4 89.52 97.64 18.58

5 89.26 90.24 11.71

6 79.02 64.69 6.68

7 64.47 29.01 0.14

8 51.11 2.12 0.0

Table 8: Majority Results.

Threshold Acc Recall FPR

1 90.5 98.81 17.8

2 90.43 98.54 17.66

3 90.15 97.8 17.48

4 89.3 95.7 17.09

5 85.61 86.93 15.72

6 81.42 77.23 14.39

7 69.1 39.12 0.97

8 52.29 4.49 0.0

When it comes to response times, Table 11 shows

the average time per request and the total time for a

model.

6.2 Non-Generative ML Models

By applying the training methodology described for

ML Models we managed to obtain (Table 12) the best

Accuracy for DTE - 96.98%. However, this is not

necessarily the model that one may choose in a real

world scenario. Depending on one’s goal, it may be

more suitable to use VCM or VCB as the best Re-

call was obtained by VCM (98.64%) and the lowest

FP Rate by VCB (1%). However, it is important to

keep in mind that both of these models (VCM, VCB)

are directly dependent on the other models and this

comes with an increase in evaluation time and total

used model bandwidth.

The worst Accuracy is obtained by MNB and hav-

ing no other strength points, it is clear that this model

is not suited for solving this problem.

6.3 Comparison

What follows is a comparison between the genera-

tive and the non-generative models’ results, compar-

ing them on each of the constraints we stated in 3:

• Inference Time - When it comes to inference

time, a non-generative model has a negligible

response time (tens of milliseconds) whilst the

LLMs response time ranged between 4 and 13

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

722

Table 9: Veto Clean Results.

Threshold Acc Recall FPR

1 92.42 98.4 13.54

2 92.27 97.85 13.29

3 91.82 96.44 12.8

4 90.54 92.73 11.65

5 85.41 79.67 8.86

6 78.03 62.41 6.39

7 64.05 28.13 0.1

8 51.11 2.12 0.0

Table 10: Veto Malicious Results.

Threshold Acc Recall FPR

1 79.94 99.69 39.77

2 80.54 99.69 38.58

3 80.63 99.69 38.4

4 81.13 99.66 37.36

5 82.02 98.72 34.66

6 80.94 92.3 30.4

7 74.39 54.62 5.87

8 55.51 11.03 0.08

seconds. As previously mentioned, in the context

of blocking opening or executing a ﬁle on an end-

point, a security product can not have a time im-

pact bigger than hundreds of milliseconds at most.

Besides the actual inference time, one must take

into account the round-trip of the HTTP call in

the case of a cloud deployment. Summarizing, in

both cases the LLMs are not suitable.

• Detection Rate (Recall) - From detection rate

perspective, all LLMs conﬁgurations are scoring

higher results than traditional models; However,

their suitability for a real-life scenario is highly

dependent on an additional method to lower the

number of false positive; On the other hand, the

non-generative models achieve similar results as

the ones obtained by LLMs in terms of recall but

with a visible lower false positive rate.

• False Positive Rate - Sometimes more important

than the detection rate, is the false positive rate.

If the impact of not blocking a malicious sample

may have either small or big impact on an end-

point, the impact of blocking a clean sample can

have a devastating impact on a device. For ex-

ample, if that blocked clean ﬁle is an operating

system ﬁle, that endpoint may be rendered unus-

able. Looking at the results obtained by the LLMs

in our experiment, the false positive rate is accept-

able only for LLama2-13b with t = 8, Mixtral-

7b with t ∈ {7,8}, Mixtral-8x7b with t = 8, Av-

erage with t ∈ {7, 8}, Majority with t ∈ {7,8},

Vote Malicious with t = 8 and Vote Clean with

Table 11: LLM 1 Request Average duration and Total

Duration for each Model. A=Llama2-13b, B=Mistral,

C=Mixtral.

Malicious Benign

Model 1-Avg. Total 1-Avg Total

A 5s 27h45m 6s 33h

B 6s 33h 4s 22h

C 12s 67h 12s 67h

Table 12: ML Models results sorted by ACC Desc.

Mdl Acc Recall FPR F1 F2

DTE 96.98 96.83 2.87 96.97 96.89

XGB 96.96 97.26 3.34 96.96 97.15

DTG 96.93 96.78 2.93 96.92 96.84

BGL 96.92 97.06 3.23 96.92 97.01

LeN 96.62 96.55 3.31 96.62 96.57

VCH 95.94 95.68 3.8 95.92 95.77

RF2 95.82 96.16 4.53 95.83 96.03

TNN 95.38 94.57 3.82 95.34 94.88

LiN 94.78 95.66 6.1 94.84 95.32

LR1 93.12 95.1 8.86 93.24 94.34

RF1 93.01 95.79 9.77 93.19 94.73

LR2 92.9 95.18 9.39 93.05 94.32

SV2 92.59 94.41 9.22 92.72 93.72

ADB 92 95.2 11.19 92.24 93.99

SV1 91.38 93.68 10.9 91.57 92.82

VCB 89.18 79.35 1 87.99 82.6

VCM 88.18 98.64 22.27 89.29 94.68

MNB 85.17 81.79 11.46 84.64 82.91

t ∈ {5, 6,7} but in all cases the detection rate is

not.

In contrast, in the case of the non-generative mod-

els, almost all results are single digits and of these,

almost half have a false positive rate lower than

5%.

• Memory Footprint - LLMs require a lot of mem-

ory (even if quantizated). While this is not a prob-

lem if the model is executed on a server or a cloud

service where usually such resources (RAM) are

available, one can not assume that each endpoint

will have sufﬁcient resources (in terms of mem-

ory) to allow such a model to run. As such, it

using them on low-end devices or on general on

devices where the amount of memory is unknown

does not seem feasible.

• Model Update Size - With larger size come larger

updates. While this is not a problem for a cloud

service (where you only need to update once) it is

a problem if the model is distributed locally (espe-

cially if the number of customers that are using the

models is large - e.g., millions). This means that

each time you have an update for a model each on

Fine Tuning LLMs vs Non-Generative Machine Learning Models: A Comparative Study of Malware Detection

723

of them will have to download that update (and

there is a price for the bandwidth that in this case

will not be insigniﬁcant).

Moreover, comparing our results with similar re-

search done in (S

anchez et al., 2024), we can observe

that by ﬁne-tuning Large Language Models, the ac-

curacy value increases. For example, using transfer

learning, they obtained an accuracy value of 58.17%

for Mistral with a context window size of 8192. Com-

pared to their result, after ﬁne-tuning, we managed to

obtain an accuracy of 91.02% for a threshold value

1, with only 4096 tokens. However, their best-model,

BigBird, with a context size of 4096 scored an accu-

racy value of 86.67% which is close enough to the

results obtained by our models.

7 CONCLUSION

In terms of real-time protection large language mod-

els are not suited (at least for the moment) for this

task. The main disadvantages are (in order):

1. Long inference time (in these cases, the inference

process should not take more than a couple of mil-

liseconds)

2. Detection (recall) and False positive rate (in par-

ticular false positive rate should be close to 0)

3. Memory footprint (a decent model requires a lot

of memory that most consumer endpoints do not

have)

4. Cost (for scenarios where the models are stored

locally and updates are needed, the cost will in-

crease linearly with the number of customers)

With the advancement of the NPU

and com-

bined with ﬁne-tuning LLM models for speciﬁc de-

tection tasks most of the previous disadvantages

might be solved. For the moment non-generative

models seem to produce better results for this type of

scenarios.

However, we consider that LLMs can be success-

fully used as an additional detection layer in a threat

detection environment where the inference time and

false positive rate could be negligible; For example,

such solutions might be deployed in a SandBoxed en-

vironment where the time needed to draw a conclu-

sion is a matter of seconds/minutes. Moreover, in a

SandBoxed execution, multiple techniques to identify

benign ﬁles might be deployed in order to reduce the

FP rate.

In terms of a model for security analytics platform

(EDR, XDR or SIEM) these models can be a good

Neural Processing Units

option, but only after ﬁne-tuning for speciﬁc detec-

tion tasks. It should also be pointed out that even in

this case, running a model locally might not be that

easy due to memory constraints. While most of these

system have a cloud component, in scenarios where

privacy is relevant, the memory footprint might be an

issue.

REFERENCES

Al-haija, Q. A., Odeh, A. J., and Qattous, H. K. (2022).

Pdf malware detection based on optimizable decision

trees. Electronics.

Bakır, H. (2024). Votedroid: a new ensemble voting clas-

siﬁer for malware detection based on ﬁne-tuned deep

learning models. Multimedia Tools and Applications.

Balan, G., Simion, C.-A., Gavrilut, D., and Luchian, H.

(2023). Feature mining and classiﬁer selection for api

calls-based malware detection. Applied Intelligence,

53:29094–29108.

Botacin, M. (2023). Gpthreats-3: Is automatic malware

generation a threat? In 2023 IEEE Security and Pri-

vacy Workshops (SPW), pages 238–254.

Charan, P. V. S., Chunduri, H., Anand, P. M., and Shukla,

S. K. (2023). From text to mitre techniques: Exploring

the malicious use of large language models for gener-

ating cyber attack payloads.

Chen, J., Guo, S., Ma, X., Li, H., Guo, J., Chen, M., and

Pan, Z. (2020). Slam: A malware detection method

based on sliding local attention mechanism. Security

and Communication Networks, 2020:1–11.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford,

C., Chaplot, D. S., de las Casas, D., Bressand, F.,

Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R.,

Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T.,

Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mis-

tral 7b.

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A.,

Savary, B., Bamford, C., Chaplot, D. S., de las Casas,

D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G.,

Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-

A., Stock, P., Subramanian, S., Yang, S., Antoniak, S.,

Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix,

T., and Sayed, W. E. (2024). Mixtral of experts.

Karanjai, R. (2022). Targeted phishing campaigns using

large scale language models.

Kim, S., Choi, J., Ahmed, M. E., Nepal, S., and Kim, H.

(2022). Vuldebert: A vulnerability detection system

using bert. In 2022 IEEE International Symposium

on Software Reliability Engineering Workshops (ISS-

REW), pages 69–74.

Li, Z., Zhu, H., Liu, H., Song, J., and Cheng, Q. (2024).

Comprehensive evaluation of mal-api-2019 dataset

by machine learning in malware detection. ArXiv,

abs/2403.02232.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

724

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach.

Motlagh, F. N., Hajizadeh, M., Majd, M., Najaﬁ, P., Cheng,

F., and Meinel, C. (2024). Large language models in

cybersecurity: State-of-the-art.

Omar, M. (2023). Detecting software vulnerabilities using

language models.

Pa Pa, Y. M., Tanizaki, S., Kou, T., van Eeten, M., Yosh-

ioka, K., and Matsumoto, T. (2023). An attacker’s

dream? exploring the capabilities of chatgpt for de-

veloping malware. In Proceedings of the 16th Cyber

Security Experimentation and Test Workshop, CSET

’23, page 10–18, New York, NY, USA. Association

for Computing Machinery.

Rahali, A. and Akhlouﬁ, M. A. (2021). Malbert: Using

transformers for cybersecurity and malicious software

detection.

Rozi

ere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I.,

Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez,

T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton,

J., Bhatt, M., Ferrer, C. C., Grattaﬁori, A., Xiong,

W., D

efossez, A., Copet, J., Azhar, F., Touvron, H.,

Martin, L., Usunier, N., Scialom, T., and Synnaeve,

G. (2024). Code llama: Open foundation models for

code.

anchez, P. M. S., Celdr’an, A. H., Bovet, G., and P

erez,

G. M. (2024). Transfer learning in pre-trained large

language models for malware detection based on sys-

tem calls. MILCOM 2024 - 2024 IEEE Military Com-

munications Conference (MILCOM), pages 853–858.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2020).

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter.

Senanayake, J. M. D., Kalutarage, H. K., and Al-Kadri,

M. O. (2021). Android mobile malware detection us-

ing machine learning: A systematic review. Electron-

ics.

shiqi, L. (2019). Android malware analysis and detection

based on attention-cnn-lstm. Journal of Computers,

pages 31–43.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,

J., and Catanzaro, B. (2020). Megatron-lm: Training

multi-billion parameter language models using model

parallelism.

Singh, G., Kumar, B., Gaur, L., and Tyagi, A. (2019).

Comparison between multinomial and bernoulli na

ıve

bayes for text classiﬁcation. 2019 International Con-

ference on Automation, Computational and Technol-

ogy Management (ICACTM), pages 593–596.

os, T., Bergeron, S. P., and Berlin, K. (2023). Web con-

tent ﬁltering through knowledge distillation of large

language models.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R., and Le, Q. V. (2020a). Xlnet: Generalized autore-

gressive pretraining for language understanding.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R., and Le, Q. V. (2020b). Xlnet: Generalized autore-

gressive pretraining for language understanding.

Zheng, Y., Pujar, S., Lewis, B., Buratti, L., Epstein, E.,

Yang, B., Laredo, J., Morari, A., and Su, Z. (2021).

D2a: A dataset built for ai-based vulnerability detec-

tion methods using differential analysis.

Zhou, X. and Verma, R. M. (2022). Vulnerability detection

via multimodal learning: Datasets and analysis. In

Proceedings of the 2022 ACM on Asia Conference on

Computer and Communications Security, ASIA CCS

’22, page 1225–1227, New York, NY, USA. Associa-

tion for Computing Machinery.

Zhou, Y., Liu, S., Siow, J., Du, X., and Liu, Y. (2019). De-

vign: Effective vulnerability identiﬁcation by learn-

ing comprehensive program semantics via graph neu-

ral networks.

S¸ahın, D.

O., Akleylek, S., and Kılıc¸, E. (2022). Linreg-

droid: Detection of android malware using multiple

linear regression models-based classiﬁers. IEEE Ac-

cess, 10:14246–14259.

Fine Tuning LLMs vs Non-Generative Machine Learning Models: A Comparative Study of Malware Detection

725