
aged by the kernel. This region is granted a high priv-
ilege level and ordinary user programs are not per-
mitted to access it directly. By contrast, the user
space refers to the memory area outside the kernel
space. The user space contains nearly all software
used when a user is operating a machine, such as of-
fice applications, web servers, multimedia software,
and command-line tools. These programs access the
kernel functions through system calls. Therefore,
most ordinary programs cannot directly access kernel
functions or hardware devices without system calls
2
.
Therefore, system-call sequences reveal the OS func-
tionalities used by a program during its execution.
The evaluation of a program through its execu-
tion and the analysis of its behavior is known as dy-
namic analysis (DA). DA is primarily used for cy-
bersecurity (Section 2) and programming language
translation (Yoneda et al., 2024). The DA can be
performed at various levels of granularity. At finer
levels, machine instruction or memory access-level
analyses are often employed (e.g., (Cohen and Nis-
sim, 2018)). However, such fine-grained analyses
present challenges, including the significant overhead
incurred and the additional effort required to render
the collected data interpretable to humans. By con-
trast, system call-level analyses are coarse-grained.
Coarse-grained methods are more selective in terms
of the information that they capture, such as system
call invocations, resulting in lower overhead and eas-
ier interpretation than fine-grained methods. We refer
to the data obtained from system call-level DA as “DA
data” in the following sections.
1.3 Objective of this Study
The objective of this study is to determine whether a
given malware sample truly exhibits malicious behav-
ior in a test environment. However, despite numer-
ous studies exploring malware detection using vari-
ous levels of DA and machine learning (ML), con-
clusively demonstrating with clear evidence that ma-
licious behavior occurred in a test environment re-
mains a significant challenge. As DA involves exe-
cuting a target program and converting its behavior
into time-series data, the absence of malicious be-
havior during the analysis period is a critical limita-
tion. This limitation must be verified, because there
are cases in which even when a user attempts to ex-
ecute malware, the intended functionality may not be
achieved. For example, as explained in Section 1.1,
malware may conceal its malicious behavior using
anti-analysis techniques or fail to execute correctly
2
As an exception, some operations such as numerical
computations can directly invoke CPU instructions.
because of the absence of the necessary libraries in
the test environment. In addition, if this limitation
is not verified, there is a risk of generating data on
malware behavior that lacks information on malicious
behavior. However, traditional ML-based malware-
detection methods can classify samples as malicious.
We emphasize that this issue is fundamentally dis-
tinct from false positives (i.e., misclassifying clean-
ware as malware). We are concerned with the possi-
ble scenario in which ML models classify data lack-
ing malicious behavior as malware, thereby creating
an illusion of successful detection. This is because
ML does not directly detect malicious behavior; in-
stead, it establishes a decision boundary between la-
bels based on the provided features. Therefore, we do
not consider “whether the classified label is malware
or cleanware” and “whether malicious behavior was
exhibited or not” to be equivalent. In other words,
rather than comparing malware with cleanware, the
determination should be based on whether the mal-
ware actually exhibited malicious behavior.
For these reasons, we consider the inability to pro-
vide conclusive evidence that the sample exhibits ma-
licious behavior in a test environment to be a key is-
sue. To address this, we tested the following four hy-
potheses:
(1) DA data can provide evidence of the malicious
behavior exhibited by malware.
(2) The DA data can provide evidence of anti-
analysis behavior exhibited by malware.
(3) By inputting DA data into a large language model
(LLM), the DA data can be interpreted, and the
malicious behavior of malware can be explained.
(4) The anti-analysis behavior of malware can be ex-
plained by inputting the DA data into the LLM.
Our Proposal is to shift the focus from an ex-
amination of the attack method, such as the deter-
mination of the malware family, to an evaluation of
the maliciousness of the behaviors exhibited by the
program. We employed LLMs because given that
some LLMs, such as ChatGPT 4o
3
, possess back-
ground knowledge of system calls and malware be-
havior, we believe that they can interpret DA data and
explain the occurrence of malicious activity with con-
crete evidence. Additionally, we believe that LLMs
are compatible with DA data since DA data consist
of highly interpretable string-formatted data that pro-
vide insights into the behavior of computer programs.
Moreover, one of the key advantages of LLMs is
their discussion capabilities. That is, after interpreting
the DA data that indicate malware behavior, LLMs
3
https://chatgpt.com/?model=gpt-4o
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
444