
A key limitation of this study lies in the inade-
quacy of BLEU, ROUGE, and BERTScore metrics
for evaluating vulnerability explanations. These met-
rics, standard in NLP tasks, fail to capture nuances
relevant to this domain. Additionally, hardware chal-
lenges, such as VRAM limitations, posed significant
obstacles during model execution, highlighting the
need for robust memory resources.
Problems Found. Various operational challenges
were observed. The Gemma-2 27B model fre-
quently encountered memory issues, requiring man-
ual restarts. Similarly, the CodeGeeX4 9B model ex-
hibited inconsistent response times, often taking ex-
cessively long to generate outputs, leading to the im-
plementation of a 150-second execution time limit.
Many models also occasionally produced unusable
outputs, such as repeated line breaks or redundant
phrases.
Future Works. Future research should focus on
fine-tuning models, though larger models will require
more memory for this process. Exploring alternative
RAG techniques and their impact on the quality of
explanations is another promising direction. Addi-
tionally, explanations, particularly the why? and fix?
components, could support tasks aimed at automating
vulnerability repair. Investigating collaborative rea-
soning techniques, where multiple LLMs interact to
produce more contextualized explanations, is another
avenue. Finally, surveying experienced programmers
for their opinions on model performance could pro-
vide valuable insights into practical applications and
user preferences.
REFERENCES
Chen, Y., Ding, Z., Alowain, L., Chen, X., and Wagner, D.
(2023). DiverseVul: A New Vulnerable Source Code
Dataset for Deep Learning Based Vulnerability De-
tection. In PROCEEDINGS OF THE 26TH INTER-
NATIONAL SYMPOSIUM ON RESEARCH IN AT-
TACKS, INTRUSIONS AND DEFENSES, RAID 2023,
pages 654–668, New York. Assoc Computing Ma-
chinery.
Fu, M. and Tantithamthavorn, C. (2022). LineVul: A
Transformer-based Line-Level Vulnerability Predic-
tion. In 2022 IEEE/ACM 19th International Confer-
ence on Mining Software Repositories (MSR), pages
608–620.
Fu, M., Tantithamthavorn, C., Le, T., Nguyen, V., and
Phung, D. (2022). VulRepair: A T5-based automated
software vulnerability repair. In Proceedings of the
30th ACM Joint European Software Engineering Con-
ference and Symposium on the Foundations of Soft-
ware Engineering, ESEC/FSE 2022, pages 935–947,
New York, NY, USA. Association for Computing Ma-
chinery.
Hin, D., Kan, A., Chen, H., and Babar, M. A. (2022).
LineVD: Statement-level Vulnerability Detection us-
ing Graph Neural Networks. In 2022 MINING SOFT-
WARE REPOSITORIES CONFERENCE (MSR 2022),
MSR ’22, pages 596–607, Los Alamitos. IEEE Com-
puter Soc.
Hu, S., Huang, T., Ilhan, F., Tekin, S. F., and Liu, L. (2023).
Large Language Model-Powered Smart Contract Vul-
nerability Detection: New Perspectives. In 2023 5TH
IEEE INTERNATIONAL CONFERENCE ON TRUST,
PRIVACY AND SECURITY IN INTELLIGENT SYS-
TEMS AND APPLICATIONS, TPS-ISA, pages 297–
306, New York. IEEE.
Nguyen, V.-A., Nguyen, D. Q., Nguyen, V., Le, T., Tran,
Q. H., and Phung, D. (2022). ReGVD: Revisiting
Graph Neural Networks for Vulnerability Detection.
In 2022 IEEE/ACM 44th International Conference
on Software Engineering: Companion Proceedings
(ICSE-Companion), pages 178–182.
Rozi
`
ere, B. et al. (2024). Code llama: Open foundation
models for code.
Team, G. et al. (2024). Gemma 2: Improving open language
models at a practical size.
Wang, X., Hu, R., Gao, C., Wen, X.-C., Chen, Y., and Liao,
Q. (2024). Reposvul: A repository-level high-quality
vulnerability dataset. In Proceedings of the 2024
IEEE/ACM 46th International Conference on Soft-
ware Engineering: Companion Proceedings, ICSE-
Companion ’24, page 472–483, New York, NY, USA.
Association for Computing Machinery.
Wu, Y., Jiang, N., Pham, H. V., Lutellier, T., Davis, J., Tan,
L., Babkin, P., and Shah, S. (2023). How Effective Are
Neural Networks for Fixing Security Vulnerabilities.
In Proceedings of the 32nd ACM SIGSOFT Interna-
tional Symposium on Software Testing and Analysis,
ISSTA 2023, pages 1282–1294, New York, NY, USA.
Association for Computing Machinery.
Zhang, Q., Fang, C., Yu, B., Sun, W., Zhang, T., and Chen,
Z. (2024). Pre-Trained Model-Based Automated Soft-
ware Vulnerability Repair: How Far are We? IEEE
Transactions on Dependable and Secure Computing,
21(4):2507–2525.
Zhang*, T., Kishore*, V., Wu*, F., Weinberger, K. Q., and
Artzi, Y. (2020). Bertscore: Evaluating text generation
with bert. In International Conference on Learning
Representations.
Zheng, Q. et al. (2023). Codegeex: A pre-trained model
for code generation with multilingual benchmarking
on humaneval-x. In Proceedings of the 29th ACM
SIGKDD Conference on Knowledge Discovery and
Data Mining, pages 5673–5684.
A Study on Vulnerability Explanation Using Large Language Models
1411