In this study, we employ various ML and DL al-
gorithms to classify vulnerable code changes of sev-
eral projects with real vulnerability data, and compare
two different source code representations in predic-
tion. We identify our research question as follows: To
what extent do different kinds of source code represen-
tations predict vulnerability inducing code changes?
The contributions of our study are summarized as fol-
lows:
• We utilize two different source code representa-
tion techniques that are software metrics and AST
based embeddings to compare their effects on pre-
dicting vulnerable code changes.
• We use SmartSHARK dataset(Trautsch et al.,
2021) that is quite comprehensive in terms of
commits and software metrics. However, the
dataset does not contain the projects’ vulnera-
bilities. We mined vulnerabilities from National
Vulnerability Database (NVD)
1
and linked them
to their associated fixing commits from project-
specific issue tracking systems. We also identified
vulnerability-inducing changes using SZZ. Even-
tually, we propose an extended SmartSHARK in
terms of reported vulnerabilities.
• We perform predictions at code change-level in-
stead of file-component or method level, as it
gives instant feedback. The proposed method can
analyse code changes and predict vulnerabilities
after each commit. With this model, we contin-
uously check only code changes, thus, the com-
plexity and the analysing time are reduced.
2 RELATED WORK
To predict vulnerabilities through source code re-
quires some kind of modeling of the code, either in
the form of metrics, or other (token, embedding or
graphical) representations. Traditional software met-
rics such as size, complexity, coupling, code churn,
and fault history are widely used to predict software
defects and results in promising performance (Li and
Shao, 2019; Tosun and Bener, 2009). Several studies
also investigate the use of software metrics for vul-
nerability prediction: Shin and Williams (Shin and
Williams, 2008) investigate the validity of a hypothe-
sis that asserts software complexity is the enemy of
software security. They explore the usage of nine
different complexity metrics, commonly utilized in
software defect prediction, to predict security issues.
Their analysis on Mozilla JavaScript Engine to iden-
tify vulnerability-prone code parts report that those
1
National Vulnerability Database. https://nvd.nist.gov
nine metrics have a weak correlation with vulnerabil-
ities. Later, Shin and Williams (Shin and Williams,
2013) also validate the aforementioned hypothesis.
They observe that the correlations between complex-
ity metrics and vulnerabilities are weak but statisti-
cally significant.
Chowdhury and Zulkernine (Chowdhury and
Zulkernine, 2011) investigate whether code complex-
ity, coupling and cohesion, i.e., structural metrics, can
be used for vulnerability prediction. They built vari-
ous classifiers using these metrics for 52 Mozilla Fire-
fox releases, and conclude that structural metrics are
useful to predict vulnerabilities.
With the growing success of natural language
processing models, text-mining-based methods have
emerged as an alternative approach for extracting
source code features. Alon et al. (Alon et al., 2019)
propose code2vec as a neural network-based model
to represent source code as a continuous distributed
vector. First, the Abstract Syntax Tree (AST) of the
code is broken down into a set of paths, and then, the
method learns the atomic representation of each path
while trying to aggregate them as a set. Lozoya et
al. (Lozoya et al., 2021) propose a code embedding
technique called comit2vec, based on code2vec. In-
stead of embedding representation of the code itself,
this technique focuses on code change representation
to classify security-relevant commits.
Furthermore, word embedding techniques have
also been used to transfer the source code into the nu-
merical vector. Harer et al. (Harer et al., 2018) gener-
ated popular word2vec embeddings for C/C++ tokens
and utilized these for vulnerability prediction. Henkel
et al. (Henkel et al., 2018) applied the GloVe model
to extract word embeddings from the AST of the C
source code. Fang et al. (Fang et al., 2020) propose
the FastEmbed technique for vulnerability prediction
based on an ensemble of ML models.
Hanifi et al. (Hanifi et al., 2023) proposed 1D
CNN based method for vulnerability prediction on
function level. While preserving the structural and
semantic information in the source code, the method
transforms the AST of the source code into a numer-
ical vector. Sahin et al. (S¸ahin et al., 2022) pro-
pose a vulnerability prediction model using different
code representations to explore whether a function at
a specific code change is vulnerable or not. They
represent the function versions as node embeddings
learned from their AST, and build models using two
Graph Neural Networks with node embeddings and
Convolutional Neural Network (CNN) and Support
Vector Machine (SVM) with token representations.
Moreover, there exist studies that compare the per-
formance of software metrics-based and text mining-
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
470