
of them will have to download that update (and
there is a price for the bandwidth that in this case
will not be insignificant).
Moreover, comparing our results with similar re-
search done in (S
´
anchez et al., 2024), we can observe
that by fine-tuning Large Language Models, the ac-
curacy value increases. For example, using transfer
learning, they obtained an accuracy value of 58.17%
for Mistral with a context window size of 8192. Com-
pared to their result, after fine-tuning, we managed to
obtain an accuracy of 91.02% for a threshold value
1, with only 4096 tokens. However, their best-model,
BigBird, with a context size of 4096 scored an accu-
racy value of 86.67% which is close enough to the
results obtained by our models.
7 CONCLUSION
In terms of real-time protection large language mod-
els are not suited (at least for the moment) for this
task. The main disadvantages are (in order):
1. Long inference time (in these cases, the inference
process should not take more than a couple of mil-
liseconds)
2. Detection (recall) and False positive rate (in par-
ticular false positive rate should be close to 0)
3. Memory footprint (a decent model requires a lot
of memory that most consumer endpoints do not
have)
4. Cost (for scenarios where the models are stored
locally and updates are needed, the cost will in-
crease linearly with the number of customers)
With the advancement of the NPU
13
and com-
bined with fine-tuning LLM models for specific de-
tection tasks most of the previous disadvantages
might be solved. For the moment non-generative
models seem to produce better results for this type of
scenarios.
However, we consider that LLMs can be success-
fully used as an additional detection layer in a threat
detection environment where the inference time and
false positive rate could be negligible; For example,
such solutions might be deployed in a SandBoxed en-
vironment where the time needed to draw a conclu-
sion is a matter of seconds/minutes. Moreover, in a
SandBoxed execution, multiple techniques to identify
benign files might be deployed in order to reduce the
FP rate.
In terms of a model for security analytics platform
(EDR, XDR or SIEM) these models can be a good
13
Neural Processing Units
option, but only after fine-tuning for specific detec-
tion tasks. It should also be pointed out that even in
this case, running a model locally might not be that
easy due to memory constraints. While most of these
system have a cloud component, in scenarios where
privacy is relevant, the memory footprint might be an
issue.
REFERENCES
Al-haija, Q. A., Odeh, A. J., and Qattous, H. K. (2022).
Pdf malware detection based on optimizable decision
trees. Electronics.
Bakır, H. (2024). Votedroid: a new ensemble voting clas-
sifier for malware detection based on fine-tuned deep
learning models. Multimedia Tools and Applications.
Balan, G., Simion, C.-A., Gavrilut, D., and Luchian, H.
(2023). Feature mining and classifier selection for api
calls-based malware detection. Applied Intelligence,
53:29094–29108.
Botacin, M. (2023). Gpthreats-3: Is automatic malware
generation a threat? In 2023 IEEE Security and Pri-
vacy Workshops (SPW), pages 238–254.
Charan, P. V. S., Chunduri, H., Anand, P. M., and Shukla,
S. K. (2023). From text to mitre techniques: Exploring
the malicious use of large language models for gener-
ating cyber attack payloads.
Chen, J., Guo, S., Ma, X., Li, H., Guo, J., Chen, M., and
Pan, Z. (2020). Slam: A malware detection method
based on sliding local attention mechanism. Security
and Communication Networks, 2020:1–11.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford,
C., Chaplot, D. S., de las Casas, D., Bressand, F.,
Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R.,
Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T.,
Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mis-
tral 7b.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A.,
Savary, B., Bamford, C., Chaplot, D. S., de las Casas,
D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G.,
Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-
A., Stock, P., Subramanian, S., Yang, S., Antoniak, S.,
Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix,
T., and Sayed, W. E. (2024). Mixtral of experts.
Karanjai, R. (2022). Targeted phishing campaigns using
large scale language models.
Kim, S., Choi, J., Ahmed, M. E., Nepal, S., and Kim, H.
(2022). Vuldebert: A vulnerability detection system
using bert. In 2022 IEEE International Symposium
on Software Reliability Engineering Workshops (ISS-
REW), pages 69–74.
Li, Z., Zhu, H., Liu, H., Song, J., and Cheng, Q. (2024).
Comprehensive evaluation of mal-api-2019 dataset
by machine learning in malware detection. ArXiv,
abs/2403.02232.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
724