
ers. In Advances in Neural Information Processing
Systems, volume 33, pages 1877–1901. Curran Asso-
ciates, Inc.
Dasgupta, A., Kumar, R., and Sarl
´
os, T. (2011). Fast
locality-sensitive hashing. In Proceedings of the 17th
ACM SIGKDD international conference on Knowl-
edge discovery and data mining, pages 1073–1081.
Delangue, C. (2024). Hugging face. https://huggingface.
co/.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Pro-
ceedings of the 2019 Conference of the North, pages
4171–4186. Association for Computational Linguis-
tics.
Ferrag, M. A., Battah, A., Tihanyi, N., Jain, R., Maimut,
D., Alwahedi, F., et al. (2024). SecureFalcon: Are We
There Yet in Automated Software Vulnerability De-
tection with LLMs?
Greco, F., Desolda, G., and Vigan
`
o, L. (2024). Support-
ing the Design of Phishing Education, Training and
Awareness interventions: An LLM-based approach.
In 2nd International Workshop on CyberSecurity Ed-
ucation for Industry and Academia.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
S., Wang, L., and Chen, W. (2021). LoRA: Low-Rank
Adaptation of Large Language Models.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, D. S., et al. (2023). Mistral 7B.
Jiang, N., Wang, C., Liu, K., Xu, X., Tan, L., and Zhang,
X. (2024). Nova: Generative Language Models for
Assembly Code with Hierarchical Attention and Con-
trastive Learning.
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D.,
Callison-Burch, C., and Carlini, N. (2022). Dedupli-
cating Training Data Makes Language Models Bet-
ter. In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 8424–8445. Association
for Computational Linguistics.
Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., et al.
(2024). DataComp-LM: In search of the next genera-
tion of training sets for language models.
Ling, C., Zhao, X., Lu, J., Deng, C., Zheng, C., Wang, J.,
et al. (2023). Domain Specialization as the Key to
Make Large Language Models Disruptive: A Com-
prehensive Survey.
Lo, K., Wang, L. L., Neumann, M., Kinney, R., and Weld,
D. (2020). S2ORC: The Semantic Scholar Open Re-
search Corpus. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
guistics, pages 4969–4983, Online. Association for
Computational Linguistics.
Mai, K., Lee, J., Beuran, R., Hotchi, R., Ooi, S. E., Kuroda,
T., and Tan, Y. (2025). RAF-AG: Report analysis
framework for attack path generation. Computers &
Security, 148:104125.
Microsoft (2024). Microsoft Presidio - data protection
and anonymization SDK. https://microsoft.github.io/
presidio/.
Mitra, S., Neupane, S., Chakraborty, T., Mittal, S., et al.
(2024). LOCALINTEL: Generating Organizational
Threat Intelligence from Global and Local Cyber
Knowledge.
Patel, J. M. (2020). Introduction to Common Crawl
Datasets, pages 277–324. Apress.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever,
I. (2018). Improving Language Understanding by
Generative Pre-Training.
Ranade, P., Piplai, A., Joshi, A., and Finin, T. (2021). Cy-
BERT: Contextualized embeddings for the cybersecu-
rity domain. In 2021 IEEE International Conference
on Big Data (Big Data), pages 3334–3342.
Strom, B. E., Applebaum, A., Miller, D. P., Nickels, K. C.,
Pennington, A. G., and Thomas, C. B. (2018). Mitre
Att&ck: Design and Philosophy.
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X.,
Guestrin, C., Liang, P., and Hashimoto, T. B. (2023).
Stanford Alpaca: An Instruction-following LLaMA
model.
Tarek, S., Saha, D., Saha, S. K., Tehranipoor, M., and Farah-
mandi, F. (2024). SoCureLLM: An LLM-driven Ap-
proach for Large-Scale System-on-Chip Security Ver-
ification and Policy Generation.
TogetherAI (2023). RedPajama: an open dataset for train-
ing large language models.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., Bashlykov, N., et al. (2023). Llama 2:
Open Foundation and Fine-Tuned Chat Models.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., et al. (2017). Attention is All you Need. In Pro-
ceedings of the 31st International Conference on Neu-
ral Information Processing Systems (NIPS’17), vol-
ume 30, pages 6000–6010. Curran Associates, Inc.
Wu, C., Lin, W., Zhang, X., Zhang, Y., Wang, Y., and Xie,
W. (2023). PMC-LLaMA: Towards Building Open-
source Language Models for Medicine.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., et al. (2016).
Google’s Neural Machine Translation System: Bridg-
ing the Gap between Human and Machine Translation.
Yan, P., Tan, S., Wang, M., and Huang, J. (2023). Prompt
Engineering-assisted Malware Dynamic Analysis Us-
ing GPT-4.
Zhang, J., Bu, H., Wen, H., Chen, Y., Li, L., and Zhu, H.
(2024). When LLMs Meet Cybersecurity: A System-
atic Literature Review.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and
Artzi, Y. (2020). BERTScore: Evaluating text gen-
eration with BERT. In International Conference on
Learning Representations.
CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models
35