Enriching the Semantic Representation of the Source Code with Natural Language-Based Features from Comments for Improving the Performance of Software Defect Prediction

Anamaria Briciu, Mihaiela Lupea, Gabriela Czibula, Istvan Gergely Czibula

2024

Abstract

The present study belongs to the new research direction that aims to improve software defect prediction by using additional knowledge such as source code comments. The fusion of programming language features learned from the code and natural language features extracted from the code comments is the proposed semantic representation of a source code. Two types of language models are applied to learn the semantic features: (1) the pre-trained models CodeBERT and RoBERTa for code embedding and textual embedding; (2) doc2vec model used for both, code embedding and comments embedding. These two semantic representations, in two combinations (only code features and code features fused with comment features), are used separately with the XGBoost classifier in the experiments conducted on the Calcite dataset. The results show that the addition of the natural language features from the comments increases the software defect prediction performance.

Download


Paper Citation


in Harvard Style

Briciu A., Lupea M., Czibula G. and Gergely Czibula I. (2024). Enriching the Semantic Representation of the Source Code with Natural Language-Based Features from Comments for Improving the Performance of Software Defect Prediction. In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE; ISBN 978-989-758-696-5, SciTePress, pages 132-143. DOI: 10.5220/0012688400003687


in Bibtex Style

@conference{enase24,
author={Anamaria Briciu and Mihaiela Lupea and Gabriela Czibula and Istvan Gergely Czibula},
title={Enriching the Semantic Representation of the Source Code with Natural Language-Based Features from Comments for Improving the Performance of Software Defect Prediction},
booktitle={Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE},
year={2024},
pages={132-143},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012688400003687},
isbn={978-989-758-696-5},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE
TI - Enriching the Semantic Representation of the Source Code with Natural Language-Based Features from Comments for Improving the Performance of Software Defect Prediction
SN - 978-989-758-696-5
AU - Briciu A.
AU - Lupea M.
AU - Czibula G.
AU - Gergely Czibula I.
PY - 2024
SP - 132
EP - 143
DO - 10.5220/0012688400003687
PB - SciTePress