CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models

Khang Mai; Razvan Beuran; Naoya Inoue

doi:10.5220/0013094800003899

CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models

Khang Mai, Razvan Beuran, Naoya Inoue

2025

Abstract

Recently, powerful open-source models LLMs, such as Llama 3, have become alternatives to commercial ones, especially in sensitive or regulated industries. In cybersecurity, most LLM utilization relies on custom fine-tuning or post-training methods, such as prompt engineering. Although domain-adaptive pre-training has been proven to improve the model’s performance in the specialized domain, it is less used in cybersecurity due to the cumbersome implementation effort. This paper introduces CyLLM-DAP, a framework for expediting the domain specialization process of LLMs in cybersecurity by simplifying data collecting, preprocessing, and pre-training stages in low-resource settings. We demonstrate how CyLLM-DAP can be utilized to collect, process data, and develop cybersecurity-specific LLMs (CyLLMs) based on state-of-the-art open-source models (Llama 3 and Mistral v0.3). The effectiveness of domain-adaptive pre-training is confirmed via two experiments for text classification and Q&A tasks. Our evaluation results show that, when compared with general base or instruct models, injecting the LLMs with cybersecurity knowledge allows the models to generally perform better in every fine-tuning epoch for the text classification task; and brings a performance gain of up to 4.75% for the Q&A task (comparable to domain-adaptive pre-training in other domains). The framework, the generated CyLLMs, and the data are publicly available for use in cybersecurity applications.

Download

Paper Citation

in Harvard Style

Mai K., Beuran R. and Inoue N. (2025). CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models. In Proceedings of the 11th International Conference on Information Systems Security and Privacy - Volume 2: ICISSP; ISBN 978-989-758-735-1, SciTePress, pages 24-35. DOI: 10.5220/0013094800003899

in Bibtex Style

@conference{icissp25,
author={Khang Mai and Razvan Beuran and Naoya Inoue},
title={CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models},
booktitle={Proceedings of the 11th International Conference on Information Systems Security and Privacy - Volume 2: ICISSP},
year={2025},
pages={24-35},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013094800003899},
isbn={978-989-758-735-1},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 11th International Conference on Information Systems Security and Privacy - Volume 2: ICISSP
TI - CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models
SN - 978-989-758-735-1
AU - Mai K.
AU - Beuran R.
AU - Inoue N.
PY - 2025
SP - 24
EP - 35
DO - 10.5220/0013094800003899
PB - SciTePress