Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection

Lucas B. Germano, Lincoln Q. Vieira, Ronaldo Goldschmidt, Julio Cesar Duarte, Ricardo Choren

2025

Abstract

Software security ensures data privacy and system reliability. Vulnerabilities in the development cycle can lead to privilege escalation, causing data exfiltration or denial of service attacks. Static code analyzers, based on predefined rules, often fail to detect errors beyond these patterns and suffer from high false positive rates, making rule creation labor-intensive. Machine learning offers a flexible alternative, which can use extensive datasets of real and synthetic vulnerability data. This study examines the impact of bias in synthetic datasets on model training. Using CodeBERT for C/C++ vulnerability classification, we compare models trained on biased and unbiased data, incorporating overlooked preprocessing steps to remove biases. Results show that the unbiased model achieves 98.5% accuracy, compared to 63.0% for the biased model, emphasizing the critical need to address dataset biases in training.

Download


Paper Citation


in Harvard Style

Germano L., Vieira L., Goldschmidt R., Duarte J. and Choren R. (2025). Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 504-511. DOI: 10.5220/0013156800003890


in Bibtex Style

@conference{icaart25,
author={Lucas Germano and Lincoln Vieira and Ronaldo Goldschmidt and Julio Duarte and Ricardo Choren},
title={Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={504-511},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013156800003890},
isbn={978-989-758-737-5},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection
SN - 978-989-758-737-5
AU - Germano L.
AU - Vieira L.
AU - Goldschmidt R.
AU - Duarte J.
AU - Choren R.
PY - 2025
SP - 504
EP - 511
DO - 10.5220/0013156800003890
PB - SciTePress