CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training

Framework of Large Language Models

Khang Mai, Razvan Beuran and Naoya Inoue

Japan Advanced Institute of Science and Technology (JAIST), Ishikawa, Japan

Keywords:

Large Language Models (LLMs), Domain-Adaptive Pre-Training, Cybersecurity-Speciﬁc LLMs, MITRE

Att&CK, LLM4Security.

Abstract:

Recently, powerful open-source models LLMs, such as Llama 3, have become alternatives to commercial

ones, especially in sensitive or regulated industries. In cybersecurity, most LLM utilization relies on custom

ﬁne-tuning or post-training methods, such as prompt engineering. Although domain-adaptive pre-training has

been proven to improve the model’s performance in the specialized domain, it is less used in cybersecurity due

to the cumbersome implementation effort. This paper introduces CyLLM-DAP, a framework for expediting

the domain specialization process of LLMs in cybersecurity by simplifying data collecting, preprocessing,

and pre-training stages in low-resource settings. We demonstrate how CyLLM-DAP can be utilized to col-

lect, process data, and develop cybersecurity-speciﬁc LLMs (CyLLMs) based on state-of-the-art open-source

models (Llama 3 and Mistral v0.3). The effectiveness of domain-adaptive pre-training is conﬁrmed via two

experiments for text classiﬁcation and Q&A tasks. Our evaluation results show that, when compared with gen-

eral base or instruct models, injecting the LLMs with cybersecurity knowledge allows the models to generally

perform better in every ﬁne-tuning epoch for the text classiﬁcation task; and brings a performance gain of up

to 4.75% for the Q&A task (comparable to domain-adaptive pre-training in other domains). The framework,

the generated CyLLMs, and the data are publicly available for use in cybersecurity applications.

1 INTRODUCTION

Large Language Models (LLMs) are transformer-

based models (Vaswani et al., 2017) with billions of

parameters, enabling extraordinary language under-

standing capabilities. Central to their development

is the pre-training phase, where these models are ex-

posed to vast amounts of unstructured text. This criti-

cal step infuses the models with extensive knowledge,

allowing them to understand and generate human-

like text effectively. The outcome of this pre-training

process is the creation of base or foundation LLMs.

The foundation models can be further ﬁne-tuned for

language problem-solving (e.g., text classiﬁcation,

Q&A, summarization, etc.), often with signiﬁcantly

fewer data and computer resources. The result of such

ﬁne-tuning process is the instruct LLMs.

Commercial LLMs ( e.g., OpenAI’s GPTs (Brown

et al., 2020)) are general-purpose instruct models with

strong task-solving capacity. However, commercial

LLMs may involve data transmission to third-party

servers, raising privacy and security concerns in sen-

sitive industries and businesses. More recently, high-

tech companies (e.g., Meta, Mistral) have published

highly capable open-source LLMs that are compet-

General LLMs

General ﬁne-tuning

Instruct

models

Custom ﬁne-

tuned models

Custom

ﬁne-tuning

Domain-adaptive

pretraining

Domain-speciﬁc

base models

Base

models

Prompt

engineering + RAG

Figure 1: Main methods for utilizing LLMs in downstream

tasks.

itive with commercial LLMs. With a high level of

customization and data transparency, developing pri-

vate LLMs based on these open-source models has

become a new trend.

Users can choose to utilize three main methods to

utilize open-source LLMs, as shown in Figure 1.

1. Domain-Adaptive Pre-Training (as shown by

Mai, K., Beuran, R. and Inoue, N.

CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models.

DOI: 10.5220/0013094800003899

In Proceedings of the 11th International Conference on Information Systems Security and Privacy (ICISSP 2025) - Volume 2, pages 24-35

ISBN: 978-989-758-735-1; ISSN: 2184-4356

the blue box and dashed lines in Figure 1): To

optimize the model’s performance in a particular

domain, users can employ domain-adaptive pre-

training of the base models on unstructured text

from the relevant knowledge domain. Domain-

adaptive pre-training is a type of unsupervised

training because the training data does not come

with explicit annotations or labels. This approach

results in domain-speciﬁc base models.

2. Custom Fine-Tuning (presented with orange box

and lines in Figure 1): Custom ﬁne-tuning uti-

lizes annotated (supervised) datasets to adapt the

model’s capabilities for more effective perfor-

mance in speciﬁc tasks. Custom ﬁne-tuning can

be implemented for general and domain-speciﬁc

models, resulting in custom ﬁne-tuned models.

3. Prompt Engineering (shown with red ellipse and

dotted lines in Figure 1): Users design questions

in a speciﬁc format (prompts) to interact with

ﬁne-tuned (instruct) LLMs. To ensure that LLMs

are provided with the correct context, Retrieval-

Augmented Generation (RAG) can be used to re-

trieve relevant documents for the question and in-

corporate these documents into the prompt.

In the ﬁgure presented, domain-adaptive pre-

training, custom ﬁne-tuning, and prompt engineering

with RAG are not mutually exclusive. They can be

combined to improve the LLM-based application’s

overall performance.

LLMs have been utilized in various cybersecu-

rity sub-domains, marking a new branch of cyber-

security research, namely LLM for cybersecurity

(LLM4Security) (Zhang et al., 2024). However, the

domain specialization of LLMs in cybersecurity is

usually under-discussed for different reasons. First,

the cybersecurity domain is enormous and diverse,

with vague boundaries from other domains. This pre-

vents an instant solution from being able to collect

and preprocess the data. To the best of our knowl-

edge, no cybersecurity datasets are big enough for di-

rect use in the LLM domain specialization. Second,

the cost of implementing a solution to collect and pre-

process data is signiﬁcant when compared with the

cost of RAG or ﬁne-tuning.

This paper introduces a simple but effective

framework based on domain-adaptive pre-training,

named CyLLM-DAP, to ease the cybersecurity spe-

cialization process of LLMs. Using the framework,

cybersecurity-speciﬁc LLMs (termed CyLLMs) are

created by exposing two open-source LLMs (Llama

3 and Mistral v0.3) with 30 GB of cybersecurity

text from heterogeneous sources. These CyLLMs are

evaluated in two experiments, showing their capacity

to improve ﬁne-tuning effectiveness when compared

to baseline LLMs. In our GitHub repository

, we dis-

cuss the usage of the published framework, models,

and training data. Note that we are actively updating

the framework and related assets.

The remainder of the paper is structured as fol-

lows. In Section 2, we present background concepts

together with relevant works. Section 3 describes

in detail the framework architecture and implementa-

tion. We then discuss the use of the proposed frame-

work for cybersecurity specialization of LLM in Sec-

tion 4. The experiment to evaluate the CyLLMs is

mentioned in Section 5. The paper ends with a con-

clusion and references.

2 BACKGROUND AND RELATED

WORKS

In this section, we ﬁrst introduce large language mod-

els (LLMs) and the backbone architecture used in

most LLMs: transformers. We then present the con-

cept of domain-adaptive pre-training. Finally, we dis-

cuss how LLMs are used in the cybersecurity domain.

2.1 Large Language Models

In 2017, Google researchers introduced transform-

ers (Vaswani et al., 2017) with a multi-head atten-

tion mechanism, marking a new era in deep learning

(DL) research. The unique attention mechanism al-

lows transformers to capture dependencies and pat-

terns in the input parallelly. Additionally, the effec-

tiveness of transformers is based on transfer learn-

ing, in which models reuse knowledge they have al-

ready acquired and apply it to a new but related prob-

lem. For example, pre-training with textual ﬁnan-

cial data can improve the model’s performance in

solving ﬁnance-related problems. There are gener-

ally three main transformer architectures: sequence-

to-sequence (encoder-decoder) models, decoder-only

models, and encoder-only models. We mainly discuss

decoder-only and encoder-only models, which are the

most successful transformer-based architectures.

Encoder-only models, such as BERT (Devlin

et al., 2019), work as an embedding module, taking

the input sequence and output its representative vec-

tor. When adapting to downstream tasks, different

types of layers, such as the softmax layer for classiﬁ-

cation tasks, are added on top of the encoder models

to generate suitable answers (e.g., class labels).

Decoder-only models (GPT models) focus solely

on the decoder component of the original transformer

https://github.com/cyb3rlab/CyLLM-DAP

CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models

architecture. These GPT models (Radford et al.,

2018) come in different sizes, ranging from millions

to billions of parameters. GPT models with billions

of parameters focusing on language understanding are

called Large Language Models (LLMs).

The next token prediction is a core mechanism

used in training LLMs. In this mechanism, the model

learns to predict the next word (or token) in a se-

quence based on the context provided by the preced-

ing words. In the training phase, when a sequence of

k tokens (s

, s

, ..., s

) is inputted, the model learns to

maximize the probability

∑

logP

, s

, ..., s

i−1

)

with Θ is the model’s parameters. The model’s pa-

rameters are then updated, allowing it to minimize the

prediction error between its output token and the ex-

pected one. This process iterates over vast amounts

of text data, allowing the model to learn natural lan-

guage’s statistical properties and patterns.

The recent rise in popularity of open-source lan-

guage model frameworks, as opposed to commercial

ones like GPT-4, is attributed to their long-term ad-

vantages. Open-source frameworks offer greater cus-

tomization, allowing users to adjust model parame-

ters completely or partially to suit their training re-

quirements. They also enable users to train mod-

els from scratch for experimentation or personal use.

Furthermore, the increasing availability of cheaper

and higher-performance computers makes training

and hosting models in production more cost-effective.

Privacy concerns also drive the preference for open-

source frameworks, as companies and organizations

aim to maintain control over their data. Recently,

the Llama series (Touvron et al., 2023) and Mis-

tral series (Jiang et al., 2023) are among the best

open-source LLMs, comparable in performance with

closed-source LLMs, such as GPT4.

2.2 Domain-Adaptive Pre-Training

The published LLMs are originally designed for

general-purpose problem-solving. They struggle to

perform well when tasked with understanding spe-

cialized knowledge domains. This is because LLMs

are fundamentally statistical models that learn pat-

terns from text data after exposure to extensive

amounts of text. The insufﬁcient expertise domain

data in the pre-training stage reduces LLM’s perfor-

mance in the target domain.

Domain-adaptive pre-training, or continual pre-

training, is an advanced technique to customize a

general-purpose language model for a speciﬁc do-

main by continuing its training using the domain’s

data. This process aims to improve the model’s per-

formance in a particular ﬁeld without starting from

scratch. As shown in the survey (Ling et al., 2023)

regarding the LLM domain specialization, this tech-

nique has been extensively applied in various domains

(e.g., ﬁnance, law) to improve the LLM performance

in such domains. For example, Wu et al. (Wu et al.,

2023) develop PMC-LLaMa by pre-training the base

model (Llama 2) with biomedical papers and books

and subsequently ﬁne-tuning for following instruc-

tions in the medical domain. In the evaluation of

this research, domain-adaptive pre-training allows the

model to reach 2.94% of the performance gain.

In cybersecurity, domain-adaptive pre-training is

not a new concept. Before the era of LLMs, we

already have various cybersecurity-speciﬁc encoder-

only models, such as SecureBERT (Aghaei et al.,

2023), CyBERT (Ranade et al., 2021) and Cy-

SecBERT (Bayer et al., 2024). To our knowledge,

no public framework currently supports this domain-

adaptive pre-training task for LLMs in cybersecurity.

Table 1 shows the advancements of CyLLM-DAP

over similar frameworks/models for the cybersecurity

domain specialization pre-training process.

2.3 LLM Applications in Cybersecurity

Text data is a redundant and important source of infor-

mation in many domains. In cybersecurity, text data

can be obtained from Cyber Threat Intelligence (CTI)

reports, emails, system logs, network design, security

guidelines, etc. Text analysis and generation are un-

avoidable tasks in many cybersecurity solutions. At

the simplest level, generating a summary report to ex-

plain the system’s output is beneﬁcial.

LLMs have been utilized in cybersecurity for dif-

ferent purposes. For example, LLMs can be used

in various scenarios and training roles in cyberse-

curity training and education. Greco et al. (Greco

et al., 2024) presents various promising strategies uti-

lizing LLMs for PETA (Phishing Education, Training,

and Awareness). LocalIntel (Mitra et al., 2024) is a

LLM-based framework for generating organizational

threat intelligence. Moreover, LLM can also be used

for software vulnerability detection (Ferrag et al.,

2024), malware dynamic analysis (Yan et al., 2023),

hardware security and policy generation (Tarek et al.,

2024), etc.

In a recent study (Zhang et al., 2023), various

methods in the ﬁeld are compared for addressing po-

tential vulnerabilities in software systems. The study

demonstrates that the use of transformers with pre-

trained knowledge outperformed traditional machine-

learning methods in the context of software vulnera-

bility repair. Moreover, after pre-training these mod-

els on extensive codebase data, there is a notable 9.4%

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

Table 1: Comparision between CyLMM-DAP and other domain-adaptive frameworks/models in cybersecurity.

Frameworks/

Models

Model Type

Dataset

Publication

Model

Publication

Data

Collecting

Scripts

Filtering and

Preprocessing

Scripts

Domain-Adaptive

Pre-training

Scripts

Framework

Publication

SecureBert encoder-only ✓ ✓

CyBERT encoder-only ✓

CySecBert encoder-only ✓ ✓ ✓ ✓

CyLLM-DAP decoder-only ✓ ✓ ✓ ✓ ✓ ✓

increase in accuracy. The authors suggest that pre-

training the models with knowledge closely aligned

with the target downstream task is a promising ap-

proach for enhancing their performance.

In a comprehensive survey conducted by Zhang et

al. (Zhang et al., 2024) on the use of LLMs in cyberse-

curity, more than 180 projects across over ten down-

stream tasks are analyzed. The survey explicitly ref-

erenced only one paper (Jiang et al., 2024) that inte-

grated domain-adaptive pre-training into the proposed

approach for binary code analysis. This highlights

the current lack of discussion on domain-adaptive pre-

training for LLMs in cybersecurity applications.

3 CyLLM-DAP ARCHITECTURE

This section presents the general architecture of

CyLLM-DAP. As shown in Figure 2, the framework

contains six main components to support the domain-

adaptive process in cybersecurity, including:

1. Data Collection: To collect data from different

sources.

2. Relevance Filtering: To ﬁlter out irrelevant doc-

uments from the dataset.

3. Quality Filtering: To ensure the quality of the

data, various methods can be utilized to ﬁlter out

bad documents from the dataset.

4. Data Anonymization: To protect individuals’

private and sensitive information.

5. Data Deduplication: To ensure the uniqueness of

each document.

6. Training: To provide training scripts for the

domain-adaptive pre-training process.

The framework provides a high level of customization

by following object-oriented programming. To work

with the framework, users can choose a default work-

ﬂow or create a personal workﬂow of components to

meet their needs. In addition to this, they can also im-

plement their own components, using the inheritance

mechanism in object-oriented programming.

Note that not all of the framework’s components

are implemented from scratch by us. In such compo-

nents, we reuse the best effort from other researchers

and developers. We clearly mention this information

when providing more details for each component in

the following sections.

3.1 Data Collection

To ensure the generability of LLMs, training data

should be collected from multiple data sources. The

diversity of data exposes the model to a wide range

of language patterns, ensuring fairness in the lan-

guage capacity of the models across different scenar-

ios. Different data sources are currently supported by

CyLLM-DAP, including:

1. Web Data: Web data is text data obtained from

web pages. This is the most common and re-

dundant type of data. The Common Crawl

dataset (Patel, 2020) is a web dataset that is

crawled from the internet. The Common Crawl

organization maintains this dataset by conduct-

ing regular scrawls, which started in 2007. Cur-

rently, Common Crawl is the biggest dataset with

hundreds of TiB of data, spanning over billions

of web pages. When working with this type of

dataset, users can choose to work with the HTML

representation (WARC format) or plain text (WET

format) extracted from those web pages.

Our framework supports the data collection using

both of the mentioned data formats. However,

the text extraction mechanism originally imple-

mented by Common Crawl to create WET data is

not optimal, as discussed in (Li et al., 2024). In-

stead, we use the traﬁlatura Python library (Bar-

baresi, 2021) to extract high-quality text from the

raw HTML data (WARC format).

2. Academic Papers: Academic papers are another

great source of data, providing high-quality text

written by researchers. Currently, S2OCR (Lo

et al., 2020) is the largest dataset of English aca-

demic papers that are well-established and main-

tained. Any papers contained inside S2OCR have

already been transformed from different data for-

mats (e.g., PDF, latex source code) into a user-

friendly data type. Besides metadata such as cor-

pus ID, the paper content is annotated with labels

for easy extraction of titles, abstracts, and para-

CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models

Relevance Filtering

Data Anonymization

Data Deduplication

Training

Data Collection

Quality Filtering

(a) Framework components.

Data

Deduplication

Data

Anonymization

Training

Quality

Filtering

Relevance

Filtering

Data

Collection

(b) Example workﬂow.

Figure 2: CyLLM-DAP’s components and example workﬂow in practical use.

graphs. Currently, CyLLM-DAP only supports

this dataset for collecting academic papers.

3. Hugging Face Hub: Hugging Face (HF) (De-

langue, 2024) is a famous machine-learning

framework containing open-source libraries and

tools for developing and deploying advanced nat-

ural language processing (NLP) models. In addi-

tion, the HF hub is a platform where researchers

and developers can share or access a signiﬁcant

number of developed models and datasets. There-

fore, the hub is an important data source in terms

of redundancy, diversity, and data quality. In

CyLLM-DAP, we currently support the data col-

lection from Wikipedia and RedPajama (Togeth-

erAI, 2023). Users can quickly adapt the frame-

work for other datasets hosted in the HF hub.

4. Books: Books are another high-quality source of

data that should be included when pre-training

LLMs. While CyLLM-DAP currently supports

the download of books from online sources, users

should take care of the copyright problems when

downloading them.

5. Code: Pre-training LLMs with code data has been

proven effective for downstream tasks relevant to

programming code. Code data can be collected

from online code hosting or question-and-answer

platforms (e.g., GitHub, StackOverﬂow).

CyLLM-DAP implements various collectors to

support data collection from the mentioned sources

and transform them into text-based data. Note that

data collectors in CyLLM-DAP are implemented in

streamlined modes to avoid downloading all the raw

data into local storage. For this reason, collectors

come with simple forms of relevance ﬁlters (discussed

in 3.2.1) to remove irrelevant documents on the ﬂy.

3.2 Data Filtering

In this section, we discuss two main components of

CyLLM-DAP for data ﬁltering: relevance ﬁltering

and quality ﬁltering. Generally, a data ﬁlter will take

a document as an input and output a boolean value

indicating if the document meets some predeﬁned re-

quirements. Data ﬁlters can be used as part of the

data collectors for preliminary ﬁltering or as separate

modules in the workﬂow.

3.2.1 Relevance Filtering

When preparing data for LLM’s domain specializa-

tion, it is crucial to determine the relevance of data

elements, such as documents, to the target knowledge

domain. Depending on the characteristics of the target

domain, different strategies can be recruited appropri-

ately. For relevance ﬁltering in the cybersecurity do-

main, as brieﬂy mentioned earlier, the following char-

acteristics should be considered:

1. Vague Domain Boundary: The cybersecurity do-

main does not have a clear boundary separating

itself from other domains. Computer systems

are deployed in many domains to automate their

operations with high accuracy. However, these

computer-based solutions also come with risks

and cybersecurity challenges. To effectively ad-

dress cybersecurity issues in such systems, it is

necessary to have cross-domain knowledge. For

instance, preventing online fraud in the internet

banking system requires a combination of cyber-

security and ﬁnancial expertise.

2. Broad and Diversity: Cybersecurity encompasses

various sub-domains, including hardware secu-

rity, software security, data security, and code

security. The speciﬁcity of these sub-domains

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

varies, with some being quite broad, like CTI,

and others containing more specialized or niche

knowledge. For example, Generative AI Security

is a special cybersecurity sub-domain, pertaining

to security issues within generative AI-based tools

and solutions, such as LLMs.

In CyLLM-DAP, there are two main methods used

for relevance ﬁltering, as follows.

Keyword-Based Relevance Filtering. The

keyword-based approach begins with a list of cy-

bersecurity keywords, regex, and URL patterns.

CyLLM-DAP currently uses a list of 500 cyberse-

curity keywords (e.g., data security, data safety) and

regex patterns (e.g., CVE). For URL patterns, we

determine a list of 300 cybersecurity newspapers and

blogs (e.g., www.scmagazine.com). Note that these

lists are being updated frequently.

The keywords and patterns used in CyLLM-DAP

are enclosed in ﬁlters, each of which has a differ-

ent method for assessing various components (such

as URLs, text, and titles) of the document for rele-

vance ﬁltering. One important advantage of keyword-

based ﬁltering is its running speed and multiprocess-

ing friendliness. Each ﬁlter requires little memory to

run the function. For this reason, we recommend us-

ing these keyword-based ﬁlters as the preliminary ap-

proach to handling big data sources.

Model-Based Relevance Filtering. In model-

based ﬁltering, language models are used to deter-

mine a document’s relevance to the cybersecurity

domain. A DL model is trained to classify documents

into cybersecurity and non-cybersecurity categories.

For this purpose, an encoder-only model, already pre-

trained in the context of cybersecurity, is ﬁne-tuned

with a labeled dataset. After the ﬁne-tuning, the

model can provide a cybersecurity relevance score

for an input document. With a threshold value of 0.5,

any document with a cybersecurity relevance score

under 0.5 will be removed from the dataset.

3.2.2 Quality Filtering

In LLM development, data quality is one of the most

important factor to ensure the model’s performance.

From different LLM projects, various quality metrics

are invented based on the observation of actual data.

These quality signals/metrics are mostly calculated by

analyzing the text’s statistics, such as the ratio be-

tween the number of sentences starting with bullet

points and other sentences.

In CyLLM-DAP, inspired by other LLM

works (TogetherAI, 2023), we develop functions

for quality metrics and rules. The metrics functions

take a document as input and output relevant metrics

score. Additionally, rules in CyLLM-DAP are

implemented as ﬁlters, in which some thresholds

are applied to the document’s quality metrics to

determine if it is a good document. For example, the

mentioned ratio should not exceed 0.7 to indicate a

good document.

While these metrics are proven effective, they are

not universal and should be choose wisely depending

on the target tasks. Beside the list of default metrics

and rules for cybersecurity domain, LLM developers

can design their own functions to meet their needs.

3.3 Data Deduplication

Duplicate entries in training datasets can result in

over-ﬁtting and create a false impression of improved

performance during training. Recent research, as

highlighted in (Lee et al., 2022), underscores the im-

portance of data deduplication, emphasizing its po-

tential to enhance model performance in a more bal-

anced and comprehensive manner. To deduplicate

data in general, we ﬁrst need to identify duplicates

using similarity search and remove them.

In an exhaustive similarity search, each pair of

documents in the dataset is usually compared to de-

termine their similarity. This mechanism imposes a

too-high time cost and is not suitable for handling

large amounts of data. In CyLLM-DAP, we use a

popular method for efﬁcient similarity search, namely

LSH (Locality Sensitive Hashing) (Dasgupta et al.,

2011). In general, LSH aims to reduce the data di-

mensions while maintaining local distances between

data points. The output of this method is buckets of

similar documents. Subsequently, only the longest

document from each bucket is reserved, while others

are removed from the dataset.

When deduplicating small datasets, the whole pro-

cess can be done in the computer’s RAM. However, a

large amount of intermediate data (e.g., dense hash

signatures) is generated during the run of LSH. For

this reason, CyLLM-DAP splits the deduplicating

process into smaller steps and stores any intermedi-

ate data in local storage. This allows CyLLM-DAP to

work with large datasets.

3.4 Data Anonymization

Data anonymization is a data sanitizing technique that

protects the individual’s privacy or sensitive informa-

tion. In this technique, personally identiﬁable infor-

mation (PII) is detected and removed (or modiﬁed)

from the text before feeding to LLMs. As a result,

CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models

LLMs do not remember personal information, avoid-

ing being gathered by cyber attackers in the inference

stage.

CyLLM-DAP utilizes presidio anonymizer (Mi-

crosoft, 2024) developed by Microsoft to implement

its anonymizing function. Currently, only four types

of PII are considered, including email, URL, phone

number, and credit card. More PII will be added

in the future. Additionally, users can create cus-

tomized anonymizing functions and integrate them

into CyLLM-DAP.

3.5 Training

In CyLLM-DAP, we create various scripts that users

can directly use for domain specialization. These

scripts are mostly based on the HuggingFace library

and its code examples. Since HuggingFace provides

leading libraries and tools for working with Machine

Learning and LLM models, using their examples en-

sures a robust yet simple approach to LLM develop-

ment.

The provided scripts follow two common LLM

pre-training paradigms, considering the available

computer resources. One is the full pre-training, in

which all model parameters will be alternated dur-

ing the pre-training process. This approach requires

a large amount of GPU RAM with a possibility of

catastrophic forgetting of old knowledge. The other

is to partially update the model’s parameters, which

is more suitable for a low-resource computing envi-

ronment.

4 CyLLMs CREATION USING

CyLLM-DAP

In this section, we discuss the development process of

cybersecurity-speciﬁc foundation LLMs (CyLLMs).

Two CyLLM versions (CyLlama3 and CyMistral)

are based on two corresponding open-source LLMs:

Llama-8B-v3 and Mistral-7B-v0.3. We ﬁrst present

the data preparation process. Then, we discuss the

domain-adaptive pre-training process to specialize the

foundation models in cybersecurity knowledge. Both

of these tasks are implemented using CyLLM-DAP.

Lastly, we examine the impact of domain specializa-

tion through experiments on two distinct downstream

tasks. The whole process is demonstrated in Figure 3,

in which the data preparation and domain-adaptive

pre-training process are discussed in Section 4.1 and

4.2, respectively. Subsequently, the experiment is pre-

sented in Section 5.

4.1 Data Preparation

The training data used in the specialization process

of LLM are collected via CyLLM-DAP using the de-

fault setting. Table 2 shows the data sources which

we collect data from. The data collection process is

conducted over six months (from Jan. to Jun. 2024)

using four computing nodes with high-speed internet.

As we can see in Table 2, there is no code data in-

cluded since we are not focusing on solving code-

related downstream tasks. We plan to include code

data for code-related downstream tasks in our future

work.

Table 2: Cybersecurity text data for domain-adaptive pre-

training.

Data

sources

Size Original Sources

Wikipedia ∼500 MB

Wikipedia dataset on

HuggingFace hub

Academic

papers

∼3 GB S2OCR

Books ∼200 MB Online book libraries

Web data ∼26 GB

Common Crawl

RedPajama

Total ∼30 GB

Using CyLLM-DAP, we apply various data ﬁl-

ters to the dataset to ensure its quality and relevance.

These include:

1. Language ﬁltering: In this dataset, only text writ-

ten in the English language is kept.

2. Relevance ﬁltering: In this step, the preliminary

relevance ﬁlters (with keywords and patterns) are

applied during the data collection. Model-based

ﬁltering is also used as a separate module. Re-

garding academic papers, we use the API search

function provided by S2OCR authors to obtain

pertinent documents related to a list of cyberse-

curity keywords.

3. Quality ﬁltering: All the default metrics and rules

implemented by CyLLM-DAP for the cybersecu-

rity domain are used. Note that we also ﬁlter out

toxic documents at this step.

4. Deduplication: We apply the deduplication pro-

cess on the dataset to remove duplicated docu-

ments.

5. Anonymization: The default anonymization func-

tion of CyLLM-DAP is used.

4.2 Domain-Adaptive Pre-Training

As mentioned in Section 3.2.1, the cybersecurity do-

main does not have a clear boundary with other

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

Data

Preparation

Cybersecurity

Adaptation

Experiment

Relevance Filtering

Academic

papers

Books

Quality Filtering

Deduplication

Anonymization

Cybersecurity Data

Wikipedia

Web data

Custom

ﬁne-tuning

 System message:

 You are a cybersecurityexpert.

 Question:

 What is the malware LoudMiner?

 Answer:

 LoudMiner is a cryptocurrencyminer...

Custom ﬁne-tunedLLMs

Evaluate model's performance

(F1, GLUE score, etc.)

Fine-tuning

dataset

Base LLMs

Domain-adaptive

pretraining

CyLLMs

Figure 3: The development process of CyLLMs with CyLLM-DAP.

knowledge domains. Depending on the application,

solving cybersecurity downstream tasks may require

knowledge from other domains. Specializing the

models in the cybersecurity domain should not cause

knowledge loss in other domains. For this reason, we

avoid fully training the model in which all of their

parameters are updated. Instead, a small amount of

parameters is set to be altered during the continual

pre-training process using LORA (Low-rank Adap-

tation) (Hu et al., 2021). LORA is one of the most

popular methods for working with LLMs, aiming to

enable LLM training in low-resource settings. LORA

works by freezing the model’s weights and introduc-

ing smaller trainable matrices of parameters. By this,

only a few weights are updated to adapt the model

with new information.

Since most model parameters are frozen dur-

ing the domain-adaptive pre-training process, catas-

trophic forgetting is alleviated. This preserves the

model’s pre-trained knowledge relevant to other do-

mains. In LORA training, rank and alpha are two

important hyperparameters controlling the size of the

trainable matrix. We set the rank and alpha hyperpa-

rameters (primary hyperparameters used in LORA) to

be high (64 and 128, respectively) to effectively inject

new knowledge into the models.

The domain-adaptive pre-training process uses

H100 GPU computers, with 123 hours for train-

ing CyLlama3 and 135 hours for CyMistral for one

epoch. Our approach involves utilizing a context

length of 1024 and a batch size of 64 without ap-

plying quantization to the model. Quantization is a

method used to reduce GPU memory consumption

during LLM pre-training by mapping high-precision

values to lower precision. While this approach can ef-

fectively reduce GPU memory usage, it can also im-

pact the quality of the pre-training process. Since our

models can be accommodated within GPU memory

during training with LORA, we opt not to employ

quantization in order to maintain the quality of the

pre-training process.

The result of the domain-adaptive pre-training

process is two cybersecurity-speciﬁc foundation

LLMs, namely CyLLMs. These CyLLMs come in

two versions, each based on its respective foundation

model. To elaborate, CyLlama3 is the cybersecurity-

speciﬁc model of Llama-8B-v3, while CyMistral3 is

built on Mistral-7B-v0.3. In the following section, we

will carry out experiments on these CyLLMs to assess

their performance on downstream tasks.

5 EVALUATION

In this section, we employ two downstream tasks to

validate the effectiveness of CyLLMs (generated us-

ing CyLLM-DAP) in various cybersecurity applica-

tions, particularly in tasks involving custom datasets.

Several important considerations are as follows:

1. In this evaluation, our focus is not on creating

multi-purpose instruct LLMs that can excel in ev-

ery cybersecurity task. Instead, we conduct two

experiments where relevant models are ﬁne-tuned

separately.

2. The evaluation wants to conﬁrm the performance

of these CyLLMs in private and custom datasets.

This is a common use case for open-source LLMs

in the ﬁeld of cybersecurity. Therefore, utilizing

private datasets that have not been publicly dis-

closed is more appropriate for this purpose.

CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models

Table 3: The list of LLMs involved in the experiments and

the type of training method already applied to them.

Model Group

General

Pre-training

Domain-adaptive

Pre-training

General

Fine-tuning

Group B:

- BaseMistral

- BaseLlama3

✓

Group I:

- InstructMistral

- InstructLlama3

✓ ✓

Group C:

- CyMistral

- CyLlama3

✓ ✓

As shown in Table 3, there are three groups of LLMs

related to the experiments. Each group contains cor-

responding models from the Llama 3 and Mistral v0.3

LLM families.

• Group B includes the base (foundation) models

Llama-3-8B (BaseLlama3) and Mistral-7B-v0.3

(BaseMistral) originating from the company (e.g.,

Meta, Mistral). Base models are developed via

general pre-training.

• Group I includes Llama-3-8B-Instruct (InstructL-

lama3) and Mistral-7B-Instruct-v0.3 (Instruct-

Mistral) that the technology company originally

published. These models have undergone both

general pre-training and general ﬁne-tuning.

• Group C comprises cybersecurity-speciﬁc LLMs

(CyLLMs), namely CyLlama3 and CyMistral.

These are developed based on BaseLlama3 and

BaseMistral, respectively.

To conﬁrm the effectiveness of the domain-

adaptive pre-training, all the models in groups B, I,

and C are custom ﬁne-tuned to solve the task. Af-

ter the ﬁne-tuning process, we calculate and compare

their performances using metrics. The details for each

experiment are presented in the following sections.

5.1 Task 1: Text Classiﬁcation

This text classiﬁcation experiment aims to assess

LLM’s ability to generate short responses to iden-

tify the category of a cybersecurity text among many

classes. For this experiment, we create a dataset of

145,000 data samples. Each data sample contains

a text relevant to one cyberattack technique labeled

with its MITRE ATT&CK’s technique ID. MITRE

ATT&CK (Strom et al., 2018) is a globally accessi-

ble knowledge base that provides standardized knowl-

edge to cybersecurity practitioners regarding attack-

ing tactics, techniques, procedures and malicious en-

tities such as attackers, campaigns, and tools. This

task is a multi-class classiﬁcation with 628 ATT&CK

technique IDs as classes. In general, the dataset gen-

eration has two stages:

Table 4: The task-speciﬁc dataset used in the experiment

with examples.

Related Task Example

Text

classiﬁcation

(145,000

samples)

- System message: You are a cybersecurity

expert. Below is an instruction that describes

a task in the cybersecurity domain, paired with

an input that provides further context. Write

a response that appropriately completes the

request.

- Instruction: You are given a text description

of a procedure example. Identify the MITRE

ATT&CK technique used.

- Input: Siloscape leverages a sophisticated

form of API call concealment...

- Output: Obfuscated Files or Information

T1027

Question

& Answering

(22,000

samples)

- System message: You are a helpful cyberse-

curity expert.

- Question: What is the malware LoudMiner?

- Answer: LoudMiner is a cryptocurrency

miner that uses virtualization software to siphon

system resources...

1. Stage 1: We utilize an automatic report analysis

framework, namely RAF-AG (Mai et al., 2025),

to initiate the data-generating process. Using the

framework, the input CTI reports are transformed

into corresponding cyberattack paths. For each at-

tack path, the attacking technique ID of an attack

step, along with its correlated text, is gathered.

2. Stage 2: We use OpenAI GPT 3.5, a powerful

closed-source LLM, for data augmentation, a pro-

cess by which an input text is paraphrased into

various text patterns.

We utilize the Alpaca prompt template (Taori

et al., 2023) to format the data into a single prompt.

Because instruct models (group I) are originally

aligned with different prompt templates during its

general ﬁne-tuning, only B and C models are involved

in this experiment. After the custom ﬁne-tuning, the

models are expected to recognize the most probable

technique ID for the input text.

During ﬁne-tuning, a single A100 40 GB com-

puter is used with LORA (rank of 64 and alpha of

128) and a learning rate of 1e − 4. 90% of the dataset

is allocated for ﬁne-tuning, while the remaining por-

tion is reserved for evaluation. Model performance is

documented at the end of each epoch over a span of

ten epochs. During the evaluation phase, we gather

LLM’s responses for all data samples in the evalua-

tion set. Subsequently, we extract technique IDs from

these responses and compare them with the expected

ones to calculate the F1 score.

As we can see from Figure 4, introducing cyber-

security knowledge to the LLMs results in overall

performance improvements compared to the baseline

foundation models (B models) (see values highlighted

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

0.917

-7.86%

-7.67%

-3.75%

-1.01%

-0.77%

+0.44%

+0.66%

+1%

1 2 3 4

5 6

7 8 9 10

0.6

0.7

0.8

Training epochs

F1 score

(a) Llama 3 series.

0.969

+2.1%

+0.54%

+0.84%

+0.63%

+1.26%

+1.57%

+1.36%

+1.57%

+1.68%

+1.57%

1 2 3 4

5 6

7 8 9 10

0.6

0.7

0.8

0.9

Training epochs

F1 score

(b) Mistral v0.3 series.

Figure 4: The performance of C models (CyLLM) and B models (Base models) measured via the F1 score for text classiﬁca-

tion. The performance differences in percentage between C models and B models are shown in red color. Horizontal violet

dashed lines and numbers show the maximum F1 scores in each chart. The left-hand side chart shows the Llama 3 LLM

series, and the right-hand side chart shows the Mistral v0.3 LLM series.

in red). This impact is consistently observed in ev-

ery ﬁne-tuning epoch with Mistral v0.3 models (the

right chart). Conversely, the effectiveness of cyber-

security knowledge varies across epochs for Llama 3

models. This effectiveness becomes more apparent

when the number of epochs exceeds six (≥ 6). Fur-

thermore, models from the Mistral v0.3 series (shown

in the right chart) demonstrate superior performance,

with a maximum F1 score of 0.969 compared to 0.917

for the Llama 3 series (displayed in the left chart).

5.2 Task 2: Question & Answer

In this experiment, we develop a simple Q&A chatbot

that can accurately respond to user inquiries regarding

CTI, such as the attacking techniques, malware, at-

tackers. The experiment aim to evaluate LLM’s abil-

ity to generate long responses, with a stringent re-

quirement for the LLMs to adhere closely to the ex-

pected answers. This is particularly crucial in situa-

tions where the chatbot needs to provide accurate and

comprehensive information without any fabrication.

We utilize a dataset of 22,000 conversation sam-

ples, encompassing questions and corresponding ex-

pected answers. Additionally, the dataset features a

system message prompting LLM to assume the role

of a cybersecurity expert during the interaction. A

data sample can be seen in Table 4.

All the models in Table 3 are involved in this ex-

periment. They are subsequently ﬁne-tuned with the

dataset to acquire the Q&A capacity. Note that we

employ the default chat templates originally created

by the companies behind each model family. In eval-

uating the text generated by the ﬁne-tuned LLMs, we

take into account the following metrics:

1. GLEU (Google BLEU) score (Wu et al., 2016)

is commonly used to evaluate the performance of

translators by measuring the similarity between

machine-translated text and the expected output.

2. The BERTScore (Zhang et al., 2020) lever-

ages contextual embeddings generated by the

BERT (Devlin et al., 2019) model. This metric is

widely used for assessing the similarity between

texts and is proven to produce results similar to

human evaluations.

3. A high-quality commercial LLM (GPT-4o) will

act as a judge (LLM-as-a-judge) to evaluate the

generated response in terms of Understandabil-

ity (clarity of the response), Relevance (adequacy

and pertinence of the information), Naturalness

(human-like quality of the response), and Hallu-

cination (incorporation of incorrect information).

Table 5 shows the performance of related mod-

els for Q&A ﬁne-tuning regarding six main metrics:

GLEU score, BERTScore, Understandability, Natu-

ralness, Relevance, and Hallucination. Word count is

a sub-metric mainly used for reference. From this ta-

ble, we can observe that:

• In general, incorporating cybersecurity knowl-

edge into the models enhances their performance

compared to models equipped only with general

knowledge. CyMistral stands out as the top-

performing model (in Mistral v0.3 series) across

all six metrics, while CyLlama3 outperforms oth-

ers in 4 out of 6 metrics within the Llama 3 mod-

els. When comparing the best with the second-

best model in each series, integrating cyberse-

curity knowledge leads to performance improve-

ments of 4.75% (GLEU score) for the Llama

series, and 3.07% (GLEU score) and 2.85%

(BERTScore) for the Mistral series. These perfor-

CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models

Table 5: The performance of the created LLMs for the Q&A task. The underlined scores show the best models within the

same family (same background color). The bold scores show the best model for speciﬁc metrics among all of the involved

models. Unlike other metrics, models with less hallucination score is better.

Model GLEU score BERTScore Understandability Naturalness Relevance Hallucination Average Length

BaseLlama3 0.569 0.718 0.846 0.77 0.618 0.565 115.615

BaseMistral 0.778 0.91 0.957 0.937 0.835 0.119 64.699

InstructLlama3 0.565 0.707 0.828 0.747 0.603 0.592 116.474

InstructMistral 0.781 0.911 0.954 0.934 0.828 0.118 64.01

CyLlama 3 0.596 0.717 0.854 0.762 0.63 0.563 111.943

CyMistral 0.805 0.937 0.959 0.939 0.84 0.103 64.022

mance gains are comparable with those acquired

by injecting domain-speciﬁc knowledge in other

domain areas (e.g., medical (Wu et al., 2023)).

• Comparing model families, Mistral models gen-

erally outperform Llama 3 counterparts, particu-

larly in terms of the GLEU score. The leading

Mistral model, CyMistral, achieves a GLEU score

of 0.805, whereas the best Llama3 model, CyL-

lama3, scores 0.596. This performance differ-

ence can be attributed to Llama3 models generat-

ing longer responses compared to Mistral models

(as indicated in the word count column). The dis-

crepancy in length between the generated text and

the expected text signiﬁcantly impacts the com-

putation of the GLEU score, resulting in a lower

output score for Llama 3 models.

• It is evident that incorporating cybersecurity

knowledge does not result in a signiﬁcant per-

formance improvement for the Understandability

and Naturalness metrics. This is primarily be-

cause all the models involved are based on foun-

dational models pre-trained with high-quality text

to generate coherent and natural language. Ad-

ditionally, Llama 3 models cannot capture all the

necessary information present in the expected an-

swer as effectively as the Mistral models (as in-

dicated in the Relevance metric). In terms of the

Hallucination metric, the Llama 3 models tend to

generate supplementary information, which may

be inaccurate, thereby resulting in a high Halluci-

nation score as judged by the LLM.

Generally, LLM judgment is beneﬁcial and trustful

because the ﬁndings align with other deterministic

metrics like BERTScore and GLEU score. Further-

more, it can be inferred that domain-adaptive pre-

training for Llama 3 models is not as effective as for

Mistral models. This indicates the need for additional

considerations when utilizing Llama 3 models.

6 CONCLUSION

This paper presented CyLLM-DAP, a framework de-

veloped to facilitate essential tasks in the cyberse-

curity specialization process for open-source large

language models. The framework consists of mod-

ules that can be utilized as they are or customized to

gather data and ensure data quality before conducting

domain-adaptive pre-training.

We illustrated the use of CyLLM-DAP for cre-

ating domain-adaptive pre-training or baseline mod-

els during ﬁne-tuning for the text classiﬁcation

task. Moreover, comparable to domain-adaptive pre-

training in other domains, the LLM’s cybersecurity

specialization can yield important performance im-

provement of up to 4.75% (for Q&A task) when com-

pared with the general base and instruct models.

In our future work, we plan to address this study’s

limitations by incorporating a diversity of model

sizes, families, and data sources. The framework,

models, and cybersecurity data are accessible to the

public and will undergo regular updates to facilitate

the integration of LLMs in the cybersecurity ﬁeld.

REFERENCES

Aghaei, E., Niu, X., Shadid, W., and Al-Shaer, E. (2023).

SecureBERT: A Domain-Speciﬁc Language Model

for Cybersecurity.

Barbaresi, A. (2021). Traﬁlatura: A Web Scraping Library

and Command-Line Tool for Text Discovery and Ex-

traction. In Proceedings of the Joint Conference of the

59th Annual Meeting of the Association for Compu-

tational Linguistics and the 11th International Joint

Conference on Natural Language Processing: Sys-

tem Demonstrations, pages 122–131. Association for

Computational Linguistics.

Bayer, M., Kuehn, P., Shanehsaz, R., and Reuter, C. (2024).

CySecBERT: A domain-adapted language model for

the cybersecurity domain. ACM Trans. Priv. Secur.,

27(2).

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

et al. (2020). Language models are few-shot learn-

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

ers. In Advances in Neural Information Processing

Systems, volume 33, pages 1877–1901. Curran Asso-

ciates, Inc.

Dasgupta, A., Kumar, R., and Sarl

os, T. (2011). Fast

locality-sensitive hashing. In Proceedings of the 17th

ACM SIGKDD international conference on Knowl-

edge discovery and data mining, pages 1073–1081.

Delangue, C. (2024). Hugging face. https://huggingface.

co/.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding. In Pro-

ceedings of the 2019 Conference of the North, pages

4171–4186. Association for Computational Linguis-

tics.

Ferrag, M. A., Battah, A., Tihanyi, N., Jain, R., Maimut,

D., Alwahedi, F., et al. (2024). SecureFalcon: Are We

There Yet in Automated Software Vulnerability De-

tection with LLMs?

Greco, F., Desolda, G., and Vigan

o, L. (2024). Support-

ing the Design of Phishing Education, Training and

Awareness interventions: An LLM-based approach.

In 2nd International Workshop on CyberSecurity Ed-

ucation for Industry and Academia.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,

S., Wang, L., and Chen, W. (2021). LoRA: Low-Rank

Adaptation of Large Language Models.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,

Chaplot, D. S., et al. (2023). Mistral 7B.

Jiang, N., Wang, C., Liu, K., Xu, X., Tan, L., and Zhang,

X. (2024). Nova: Generative Language Models for

Assembly Code with Hierarchical Attention and Con-

trastive Learning.

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D.,

Callison-Burch, C., and Carlini, N. (2022). Dedupli-

cating Training Data Makes Language Models Bet-

ter. In Proceedings of the 60th Annual Meeting of

the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 8424–8445. Association

for Computational Linguistics.

Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., et al.

(2024). DataComp-LM: In search of the next genera-

tion of training sets for language models.

Ling, C., Zhao, X., Lu, J., Deng, C., Zheng, C., Wang, J.,

et al. (2023). Domain Specialization as the Key to

Make Large Language Models Disruptive: A Com-

prehensive Survey.

Lo, K., Wang, L. L., Neumann, M., Kinney, R., and Weld,

D. (2020). S2ORC: The Semantic Scholar Open Re-

search Corpus. In Proceedings of the 58th Annual

Meeting of the Association for Computational Lin-

guistics, pages 4969–4983, Online. Association for

Computational Linguistics.

Mai, K., Lee, J., Beuran, R., Hotchi, R., Ooi, S. E., Kuroda,

T., and Tan, Y. (2025). RAF-AG: Report analysis

framework for attack path generation. Computers &

Security, 148:104125.

Microsoft (2024). Microsoft Presidio - data protection

and anonymization SDK. https://microsoft.github.io/

presidio/.

Mitra, S., Neupane, S., Chakraborty, T., Mittal, S., et al.

(2024). LOCALINTEL: Generating Organizational

Threat Intelligence from Global and Local Cyber

Knowledge.

Patel, J. M. (2020). Introduction to Common Crawl

Datasets, pages 277–324. Apress.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever,

I. (2018). Improving Language Understanding by

Generative Pre-Training.

Ranade, P., Piplai, A., Joshi, A., and Finin, T. (2021). Cy-

BERT: Contextualized embeddings for the cybersecu-

rity domain. In 2021 IEEE International Conference

on Big Data (Big Data), pages 3334–3342.

Strom, B. E., Applebaum, A., Miller, D. P., Nickels, K. C.,

Pennington, A. G., and Thomas, C. B. (2018). Mitre

Att&ck: Design and Philosophy.

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X.,

Guestrin, C., Liang, P., and Hashimoto, T. B. (2023).

Stanford Alpaca: An Instruction-following LLaMA

model.

Tarek, S., Saha, D., Saha, S. K., Tehranipoor, M., and Farah-

mandi, F. (2024). SoCureLLM: An LLM-driven Ap-

proach for Large-Scale System-on-Chip Security Ver-

iﬁcation and Policy Generation.

TogetherAI (2023). RedPajama: an open dataset for train-

ing large language models.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,

A., Babaei, Y., Bashlykov, N., et al. (2023). Llama 2:

Open Foundation and Fine-Tuned Chat Models.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., et al. (2017). Attention is All you Need. In Pro-

ceedings of the 31st International Conference on Neu-

ral Information Processing Systems (NIPS’17), vol-

ume 30, pages 6000–6010. Curran Associates, Inc.

Wu, C., Lin, W., Zhang, X., Zhang, Y., Wang, Y., and Xie,

W. (2023). PMC-LLaMA: Towards Building Open-

source Language Models for Medicine.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., et al. (2016).

Google’s Neural Machine Translation System: Bridg-

ing the Gap between Human and Machine Translation.

Yan, P., Tan, S., Wang, M., and Huang, J. (2023). Prompt

Engineering-assisted Malware Dynamic Analysis Us-

ing GPT-4.

Zhang, J., Bu, H., Wen, H., Chen, Y., Li, L., and Zhu, H.

(2024). When LLMs Meet Cybersecurity: A System-

atic Literature Review.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and

Artzi, Y. (2020). BERTScore: Evaluating text gen-

eration with BERT. In International Conference on

Learning Representations.

CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models