Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection
Lucas B. Germano, Lincoln Q. Vieira, Ronaldo Goldschmidt, Julio Cesar Duarte, Ricardo Choren
2025
Abstract
Software security ensures data privacy and system reliability. Vulnerabilities in the development cycle can lead to privilege escalation, causing data exfiltration or denial of service attacks. Static code analyzers, based on predefined rules, often fail to detect errors beyond these patterns and suffer from high false positive rates, making rule creation labor-intensive. Machine learning offers a flexible alternative, which can use extensive datasets of real and synthetic vulnerability data. This study examines the impact of bias in synthetic datasets on model training. Using CodeBERT for C/C++ vulnerability classification, we compare models trained on biased and unbiased data, incorporating overlooked preprocessing steps to remove biases. Results show that the unbiased model achieves 98.5% accuracy, compared to 63.0% for the biased model, emphasizing the critical need to address dataset biases in training.
DownloadPaper Citation
in Harvard Style
Germano L., Vieira L., Goldschmidt R., Duarte J. and Choren R. (2025). Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 504-511. DOI: 10.5220/0013156800003890
in Bibtex Style
@conference{icaart25,
author={Lucas Germano and Lincoln Vieira and Ronaldo Goldschmidt and Julio Duarte and Ricardo Choren},
title={Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={504-511},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013156800003890},
isbn={978-989-758-737-5},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection
SN - 978-989-758-737-5
AU - Germano L.
AU - Vieira L.
AU - Goldschmidt R.
AU - Duarte J.
AU - Choren R.
PY - 2025
SP - 504
EP - 511
DO - 10.5220/0013156800003890
PB - SciTePress