
Limitations. While the debiasing approach in this
work significantly improved model performance us-
ing the Juliet C/C++ 1.3 dataset, it is tailored specif-
ically to this dataset. The identified biases, such as
the static function and cascade patterns, are unique to
the synthetic nature of SARD’s Juliet project. Con-
sequently, this method may not generalize to other
datasets with different biases. Moreover, the process
of identifying these patterns is manual and highly de-
pendent on the dataset being analyzed, limiting its
scalability.
One other limitation of this work is the poten-
tial for overfitting to the bias-free dataset. While the
model performs exceptionally well on the sanitized
version of the SARD dataset, there is a risk that it
has learned to recognize specific patterns or cues in-
herent to the cleaned synthetic data rather than de-
veloping a broader understanding of vulnerability de-
tection. Given that SARD is a synthetic dataset, it
might still have nuances or hidden clues that human
researchers might overlook, but that the model could
use to forecast outcomes. This could result in an over-
estimation of the model’s actual capability, as real-
world datasets lack the artificial patterns introduced
by test case generation algorithms, presenting a more
complex and noisy environment for vulnerability de-
tection.
Future Work. Future research should focus on au-
tomating the detection of biases in synthetic datasets
or ensuring greater care in dataset creation to reduce
the introduction of skewed patterns. Additionally,
models trained on synthetic datasets should be rig-
orously evaluated on real-world datasets to better as-
sess their generalization capabilities. While synthetic
datasets like Juliet provide extensive test cases, real-
world data introduces greater complexity and diver-
sity in vulnerabilities, making it essential for robust
and practical model evaluation.
REFERENCES
Barbierato, E., Vedova, M. L. D., Tessera, D., Toti, D., and
Vanoli, N. (2022). A methodology for controlling bias
and fairness in synthetic data generation. Applied Sci-
ences, 12(9).
Cheng, X., Wang, H., Hua, J., Xu, G., and Sui, Y. (2021).
Deepwukong: Statically detecting software vulnera-
bilities using deep graph neural network. ACM Trans.
Softw. Eng. Methodol., 30(3).
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,
M., Shou, L., Qin, B., Liu, T., Jiang, D., and Zhou,
M. (2020). CodeBERT: A pre-trained model for pro-
gramming and natural languages. In Findings of the
Association for Computational Linguistics: EMNLP
2020, pages 1536–1547. Association for Computa-
tional Linguistics.
Huang, W., Lin, S., and Li, C. (2022). Bbvd: A bert-
based method for vulnerability detection. Interna-
tional Journal of Advanced Computer Science and Ap-
plications, 13(12):890–898.
Jeon, S. and Kim, H. K. (2021). Autovas: An automated
vulnerability analysis system with a deep learning ap-
proach. Computers & Security, 106:102308.
Li, X., Wang, L., Xin, Y., Yang, Y., Tang, Q., and Chen, Y.
(2021). Automated software vulnerability detection
based on hybrid neural network. Applied Sciences.
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., and Chen, Z. (2022).
Sysevr: A framework for using deep learning to detect
software vulnerabilities. IEEE Transactions on De-
pendable and Secure Computing, 19(4):2244–2258.
Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S., Deng,
Z., and Zhong, Y. (2018). Vuldeepecker: A deep
learning-based system for vulnerability detection. In
Proceedings 2018 Network and Distributed System
Security Symposium, NDSS 2018, San Diego, CA,
USA. Internet Society.
Lin, G., Jia, H., and Wu, D. (2022). Distilled and con-
textualized neural models benchmarked for vulnerable
function detection. Mathematics, 10(23):1–24.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and
Galstyan, A. (2021). A survey on bias and fairness in
machine learning. ACM Comput. Surv., 54(6).
NIST (2021). Software assurance reference dataset. https:
//samate.nist.gov/SARD/. Accessed: 2024-06-28.
Nong, Y., Aldeen, M., Cheng, L., Hu, H., Chen, F., and
Cai, H. (2024). Chain-of-thought prompting of large
language models for discovering and fixing software
vulnerabilities.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need. In Proceedings of
the 31st International Conference on Neural Informa-
tion Processing Systems, NIPS’17, page 6000–6010,
Red Hook, NY, USA. Curran Associates Inc.
Zeng, P., Lin, G., Zhang, J., and Zhang, Y. (2023). In-
telligent detection of vulnerable functions in soft-
ware through neural embedding-based code analy-
sis. International Journal of Network Management,
33(3):e2198.
Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection
511