Automated Software Vulnerability Detection Using CodeBERT and

Convolutional Neural Network

Rabaya Sultana Mim, Abdus Satter, Toukir Ahammed and Kazi Sakib

Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh

Keywords:

Source Code, Vulnerability Detection, CodeBERT, Centrality Analysis, Convolutional Neural Network.

Abstract:

As software programs continue to grow in size and complexity, the prevalence of software vulnerabilities

has emerged as a signiﬁcant security threat. Detecting these vulnerabilities has become a major concern due

to the potential security risks they pose. Though Deep Learning (DL) approaches have shown promising

results, previous studies have encountered challenges in simultaneously maintaining detection accuracy and

scalability. In response to this challenge, our research proposes a method of automated software Vulnerability

detection using CodeBERT and Convolutional Neural Network called VulBertCNN. The aim is to achieve

both accuracy and scalability when identifying vulnerabilities in source code. This approach utilizes

pre-trained codebert embedding model in graphical analysis of source code and then applies complex network

analysis theory to convert a function’s source code into an image taking into account both syntactic and

semantic information. Subsequently, a text convolutional neural network is employed to detect vulnerabilities

from the generated images of code. In comparison to three existing CNN based methods TokenCNN, VulCNN

and ASVD, our experimental results demonstrate a noteworthy improvement in accuracy from 78.6% to

95.7% and F1 measure increasing from 62.6% to 89% which is a signiﬁcant increase of 21.7% and 26.3%.

This underscores the effectiveness of our approach in detecting vulnerabilities in large-scale source code.

Hence, developers can employ these ﬁndings to promptly apply effective patches on vulnerable functions.

1 INTRODUCTION

Software vulnerabilities pose an increasing risk to

software systems making them susceptible to attacks

and potential damage, thereby raising security con-

cerns (Alves et al., 2016). In 2023, the Open Source

Security and Risk Analysis (OSSRA) conducted a

comprehensive study involving 1703 codebases with

audit data. The ﬁndings indicated that 76% of the

codes were open source. Moreover, 48% of the code-

bases exhibited high-risk vulnerabilities and 84% of

these vulnerabilities were associated with open source

security ﬂaws. Consequently, to enhance the security

of software it is crucial to employ advanced methods

for detecting vulnerabilities on a large scale.

Recently, several approaches for vulnerability de-

tection using Deep Learning (DL) have emerged

falling into two categories: the text-based approach

(Li et al., 2018; Zou et al., 2019) and the graph-

based approach (Zhou et al., 2019; Cheng et al.,

2021). Prior studies (Li et al., 2018; Zou et al., 2019;

Mim et al., 2023a) focusing on text-based identiﬁca-

tion of source code vulnerabilities applied static pro-

gram analysis or natural language processing. How-

ever, these approaches often fall short in disregarding

the semantics of the source code. To address these

limitations, program analysis is employed to repre-

sent source code semantics as a graph. Graph analy-

sis methods such as Graph Neural Networks (GNN),

are then applied to identify vulnerabilities. While

these graph-based approaches excel at vulnerability

identiﬁcation but their scalability is challenging, es-

pecially when compared to text-based approaches.

Text-based approaches, lacking the ability to capture

inter-dependencies between different lines of source

code, result in lower accuracy. On the other hand,

graph-based methods achieve high accuracy but strug-

gle with scalability in complex scenarios with many

nodes in a graph representing statements in the pro-

gram’s source code. The most recent vulnerability de-

tection system, VulCNN (Wu et al., 2022) and ASVD

(Mim et al., 2023b), attempts to combine both text-

based and graph-based approaches to gather syntac-

tic and semantic data from source code. However,

156

Mim, R., Satter, A., Ahammed, T. and Sakib, K.

Automated Software Vulnerability Detection Using CodeBERT and Convolutional Neural Network.

DOI: 10.5220/0012707900003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 156-167

ISBN: 978-989-758-696-5; ISSN: 2184-4895

VulCNN’s scalability has improved compared to eight

state-of-the-art vulnerability detectors. Still, its detec-

tion performance is deemed unsatisfactory as it lever-

ages Sent2Vec (Moghadasi and Zhuang, 2020) em-

bedding in source code which struggles in capturing

intricate code context and misinterpret code seman-

tics. VulCNN only considers three network central-

ities (Freeman et al., 2002) (degree centrality, katz

centrality, and closeness centrality) for calculating the

importance of each line of source code. The impact of

incorporating other centrality measures such as eigen-

vector centrality and betweenness centrality in vulner-

ability detection has not been explored yet. But these

centralities might have a greater impact on vulnera-

bility detection combining with DL based approaches

which accounts for the overall effect of a node or

statement in source code and detects vulnerability.

To address these issues, we propose an en-

hanced automated Vulnerability detection method us-

ing CodeBERT and Convolutional Neural Network

called VulBertCNN. which uses CodeBERT (Feng

et al., 2020) embedding model leveraging the advan-

tages of large-scale pre-training which captures both

syntactic and semantic information of source code.

Our method comprises of three phases. First, using

the source code as input, we create a Program Depen-

dency Graph (PDG) containing data ﬂow and control

ﬂow information. In the second phase, CodeBERT

embedding is applied to each node of the generated

PDG. In the third phase, we utilize different combi-

nations of ﬁve centralities on each node, with each

centrality corresponding to a channel, to create an im-

age. Finally, a Convolutional Neural Network (CNN)

(Krizhevsky et al., 2017) model is trained on the pro-

duced images to identify functions as either vulnera-

ble or non-vulnerable.

To ensure the model’s effectiveness in large-scale

vulnerability scanning we explored pre-trained Code-

BERT embedding model with various centrality com-

binations and we applied the proposed model on two

benchmark dataset SARD and Big-Vul (Fan et al.,

2020) containing a total of 40,584 and 188,770 func-

tions. Results reveal its outperformance over the

state-of-the-art vulnerability detectors in terms of ac-

curacy and F1 measure by 21.7% and 26.3%.

In summary, the contributions of this paper are as

follows.

• We analyzed the shortcomings of three recent vul-

nerability detection approaches TokenCNN, Vul-

CNN and ASVD in doing syntactic and seman-

tic analysis along with selection of node centrality

in generating images from source code by CNN.

Then we proposed the efﬁcient combination of

node centralities to generate images.

• We utilize the language model to develop a tech-

nique for representing vulnerable source code to

detect software vulnerabilities. In contrast to pre-

vious studies (Wu et al., 2022; Mim et al., 2023b)

that employed the Sent2Vec embedding method,

we incorporate the CodeBERT embedding model.

Furthermore, we design a more efﬁcient Convo-

lutional Neural Network (CNN) model by inte-

grating ﬁve centralities. This design aims to re-

duce computational overhead while enhancing the

overall detection performance.

• Our experimental results indicate that Vul-

BertCNN outperforms the state-of-the-art meth-

ods (Russell et al., 2018; Wu et al., 2022; Mim

et al., 2023b) on two benchmark datasets SARD

and BigVul (Fan et al., 2020) signiﬁcantly in

terms of accuracy and F1 measure.

Paper Organization. The remainder of this paper is

structured as follows. Section 2 gives an overview

of previous studies on vulnerability detection and

presents the motivation for our improvements. Sec-

tion 3 introduces the technical route of VulBertCNN.

Section 4 presents the experimental setup and results

analysis. Section 5 demonstrates the threats to valid-

ity of our work. Section 6 includes future research

directions and concludes this paper.

2 RELATED WORK

This section focuses on researches that are conducted

to detect vulnerabilities. Vulnerability detection tech-

niques vary in their degree of automation, typically

falling into three main categories: manual, semi-

automatic, and full-automatic methods. Manual tech-

niques rely on human experts to create vulnerability

patterns, but they may not cover all potential vulnera-

bilities, resulting in lower detection efﬁciency in real

world scenarios for tools like Checkmarx, FlawFinder

and RATS. Semi automatic techniques (Shankar et al.,

2001; Yamaguchi et al., 2015; Shar et al., 2014) in-

volve human experts extracting speciﬁc features, such

as API symbols (Yamaguchi et al., 2012) and sub-

trees, import and function calls (Neuhaus et al., 2007)

which are then fed into traditional machine learning

models to identify vulnerabilities. In contrast, full-

automatic techniques (Cheng et al., 2021; Duan et al.,

2019; Li et al., 2021; Lin et al., 2017) leverage deep

learning to automatically extract features and gener-

ate vulnerability patterns without the need for manual

expert input. Deep learning based techniques (Cheng

et al., 2021; Zhou et al., 2019; Russell et al., 2018;

Li et al., 2018; Zou et al., 2019) can be further cat-

Automated Software Vulnerability Detection Using CodeBERT and Convolutional Neural Network

157

egorized into text-based approaches and graph-based

approaches.

2.1 Text Based Approach

The text-based approaches involve treating a pro-

gram’s source code as text and employing natural lan-

guage processing techniques to identify vulnerabili-

ties. Russell et al. presented a TokenCNN model,

which utilizes lexical analysis for obtaining source

code tokens and employs a Convolutional Neural Net-

work (CNN) for the detection of vulnerabilities (Rus-

sell et al., 2018). Rahman et al. introduced a method-

level detection approach that minimizes the search

space, determining similarity scores between source

code and bug reports. Based on these scores, it

ranks methods that are identiﬁed as vulnerable (Rah-

man et al., 2016). Li et al. proposed Vuldeepecker,

which gathers code gadgets through program slic-

ing, converts them into vector formats, and utilizes

Bidirectional Long Short Term Memory (BLSTM)

models for vulnerability detection (Li et al., 2018).

Zhou et al. presented an enhanced method called

µVulDeePecker, which incorporates code attention

integrating control dependence into Vuldeepecker’s

program processing technique to identify multi-class

vulnerabilities (Zou et al., 2019). Mim et al. pro-

posed VFDetector which uses information retrieval

based method for detecting vulnerability from source

code using vulnerability reports. VFDetector calcu-

lates the textual similarity score between vulnerability

report’s description and source code of a software sys-

tem. Higher similarity score suggests highly vulnera-

ble software system (Mim et al., 2023a). These text-

based approaches suffer from poor detection perfor-

mance due to their reliance on static analysis of source

code neglecting source code semantics and consider-

ing the whole source code as plain text.

2.2 Graph Based Approach

To overcome the limitations of text-based approaches,

researchers have turned to dynamic program analy-

sis, converting source code semantics into a graph

and employing graph analysis for vulnerability de-

tection. Zhou et al. introduced an approach uti-

lizing a graph neural network with a convolutional

module to identify vulnerabilities, achieving com-

plete graph-level classiﬁcation through node pooling

(Zhou et al., 2019). Cheng et al. segmented the pro-

gram dependency graph into subgraphs after distill-

ing program semantics, integrating these subgraphs

into a graph neural network to train a vulnerability

detector (Cheng et al., 2021). While graph-based ap-

proaches prove more effective in identifying vulnera-

bilities, they suffer from scalability issues compared

to text-based strategies. Addressing these, Wu et al.

proposed VulCNN, which leverages a program depen-

dency graph (PDG) to extract information from each

line of code. Centrality analysis on the PDG quanti-

ﬁes the signiﬁcance of each node in a speciﬁc func-

tion, considering three centralities: degree, katz, and

closeness. This analysis produces an image capturing

graph features from three perspectives. Subsequently,

a convolutional neural network (CNN) is trained to

detect vulnerabilities (Wu et al., 2022). Though Vul-

CNN address both syntactic and semantic analysis but

it’s detection performance is not satisfactory.

Text-based vulnerability detection techniques of-

ten overlook program semantics, leading to inaccurate

results. Conversely, graph-based techniques offer ac-

curacy but face scalability challenges, primarily due

to the substantial number of nodes in program graphs.

Consequently, there is a need for automated vulner-

ability detection techniques that strike a balance be-

tween accuracy and scalability, considering both as-

pects simultaneously.

3 PROPOSED METHODOLOGY

This section proposes VulBertCNN (Vulnerability de-

tection with CodeBERT based Convolutional Neu-

ral Network) which consists of three major phases:

Graph Generation, Feature Extraction and Vulnera-

bility Detection as shown in Figure 1. The details are

described in the followings.

3.1 Graph Generation

As shown in Figure 2, this phase initially normalize

the source code of a function before performing static

analysis to obtain the function’s Program Dependency

Graph (PDG). Since the goal of VulBertCNN is to

concurrently detect vulnerabilities with accuracy and

scalability, ﬁrstly static analysis is performed to trans-

late the program semantics of source code into a graph

representation. Since a function can potentially im-

plement a speciﬁc task, hence this phase concen-

trate on ﬁnding vulnerabilities at a more ﬁne grained

level (i.e., function-level) due to the coarse granular-

ity of ﬁle-level vulnerability detection. The normal-

ization is performed in three steps. A sample function

transformation at three normalization steps has been

demonstrated in Figure 2. The steps are -

• Step 1: Eliminates the comments from the source

code because they have no effect on the semantics

of the program.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

158

Figure 1: Overview of proposed vulnerability detection system using CodeBERT and Convolutional Neural Network.

• Step 2: One-to-one mapping of user-deﬁned vari-

ables to symbolic names is performed. For ex-

ample, the variable named “value” is mapped to

symbolic name “VAR1”.

• Step 3: One-to-one mapping of user-deﬁned

functions to symbolic names is performed. Such

as the function named “Vulnerable()” is converted

to symbolic name “FUN1()”.

Figure 2: Steps of Source Code Normalization.

After normalization the PDG of the function is then

extracted using an open source code analysis tool for

C/C++ named Joern (Yamaguchi et al., 2014). Each

line of code in the function represents a node in a

PDG.

3.2 Feature Extraction

In this feature extraction step of our research, we em-

ploy CodeBERT (Feng et al., 2020), a pre-trained

BERT model that integrates both Natural Language

(NL) and Programming Language (PL) encodings

creating a comprehensive model suitable for ﬁne-

tuning on source code tasks. Trained on an extensive

dataset sourced from code repositories and program-

ming documents, CodeBERT demonstrates enhanced

effectiveness in software program training and source

code analysis. Firstly, each node of program depen-

dency graph which represents a statement of source

code in tokenized. Then each tokenized lines of code

is given input to pre-trained codeBERT model and

output contains contextual vector representation of

each token. This phase has two steps : Tokenization

and CodeBert Embedding.

Tokenization: During the pre-training phase, input

data is constructed by combining two segments using

special tokens, [CLS], w1, w2, ..wn, [SEP], where

[CLS] serves as a classiﬁcation token. This input

structure involves one segment representing natural

language text and the other representing code from a

speciﬁc programming language. The [CLS] token is

a special token placed before the two segments. Fol-

lowing standard text processing in Transformer, natu-

ral language text is treated as a sequence of words and

divided into WordPieces (Wu et al., 2016), while a

code snippet is considered a sequence of tokens. The

output of CodeBERT includes contextualized vector

representations for each token, encompassing both

natural language and code, as well as the represen-

tation of [CLS], which serves as a summarized repre-

sentation.

CodeBert Embedding: In the context of our re-

search, a snippet of source code is extracted as a PDG

from graph generation phase. Processing the PDG

during this feature extraction phase includes obtain-

ing a comprehensive code representation which is vi-

tal for subsequent model construction. We leverage

Automated Software Vulnerability Detection Using CodeBERT and Convolutional Neural Network

159

CodeBert (Feng et al., 2020), designed to extract code

features using a transformer-based method, speciﬁ-

cally effective for source code-related tasks with self-

supervised learning objectives. CodeBert takes a

PDG of a single function of source code as the raw

input and then processes and splits it into individ-

ual statements S

. Each statement is tokenized using

CodeBERT’s pretrained BPE tokenizer. After collect-

ing S = S

, S

, ..., S

the entire function is input into

CodeBERT, enabling the acquisition of function-level

and statement level code representations. This Code-

Bert embedding converts each sentence into a equiva-

lent vector representation having dimension size 768.

Finally an embedded function-level graph with state-

ment level features is extracted from this feature ex-

traction phase.

3.3 Image Construction

After completing CodeBert embedding, each node in

the new embedded PDG is substituted with its corre-

sponding embedded vector. In this phase, our goal is

to convert the new PDG into an image efﬁciently tak-

ing into account how various lines of code contribute

to program semantics.To generate an image of a func-

tion from the PDG, it’s essential to have information

about the connections between the nodes. The weight

of each connection is determined by evaluating the

node’s contributions to program semantics. Treating

the PDG as a social network graph, node connection

weights are determined using social network central-

ity analysis. This analysis is employed to evaluate

the signiﬁcance of each node or in other words each

line of code. Centrality (Freeman et al., 2002) con-

cepts, originally introduced in social network analysis

aim to assess a node’s importance within the network.

Various ﬁelds including biological and transportation

networks have successfully applied centrality analysis

demonstrating its utility in network assessment. Our

paper considers ﬁve centrality metrics which are dis-

cussed below.

Degree Centrality: Degree centrality for a node is

determined by counting both the incoming edges (in-

degree) and outgoing edges (outdegree), essentially

representing the number of links associated with the

node. To obtain the standardized value, the highest

degree is divided by N-1, where N represents the total

number of nodes in the network.

Katz Centrality: It calculates the centrality value

of a node by taking into account the centrality of its

neighboring nodes. This is determined by summing

the number of directly connected nodes and the num-

ber of indirectly connected nodes through these im-

mediate neighbors. The Katz centrality of a node ‘n’

can be expressed as follows:

= α

∑

n j

+ β (1)

In the given expression, A represents the Adjacency

Matrix, and α, β and λ stand for the Attenuation Fac-

tor, Initial Centrality Controller and Eigenvalues of

the graph G respectively. These parameters are in-

strumental in assigning increased weight to nearby

neighbors (via β), while simultaneously applying a

penalty to distant links (utilizing α). Notably, α must

be smaller than the inverse of the largest eigenvalue

of the adjacency matrix to ensure the proper compu-

tation of Katz centrality allowing for an accurate mea-

surement by considering the inﬂuence of various fac-

tors.

α <

max

(2)

Closeness Centrality: Closeness centrality mea-

sures how close a node is to every other node in

the graph. It is computed by averaging the shortest

path lengths between the nodes within the graph. A

node with a smaller average distance signiﬁes greater

closeness to the center of the graph. The average dis-

tance is essentially the inverse of the node’s proximity

centrality among all x-1 accessible nodes.

C(x) =

x − 1

∑

x−1

n=1

d(n, x)

(3)

In this equation, where x represents the total number

of nodes in the graph and d(n, x) denotes the distance

between nodes n and x.

Eigenvector Centrality: Eigenvector centrality as-

sesses a node’s overall impact based on the centrality

of its neighboring nodes. In accordance with the given

deﬁnition, the i-th element in the n vector corresponds

to the eigenvector centrality of the i node.

An = λn (4)

Here, the eigenvalue λ is associated with the adja-

cency matrix A.

Betweenness Centrality: Betweenness centrality

calculates the shortest paths for nodes within a graph,

representing the overall percentage of total pairs of

shortest routes that traverse through a speciﬁc node α

a. Mathematically, it is expressed as:

(a) =

∑

s,t∈N

σ(s,t|a)

σ(s,t)

(5)

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

160

In this formula, N signiﬁes the set of nodes, σ(s,t)

denotes the count of shortest paths between nodes s

and t, and σ(s, t|a) represents the count of paths pass-

ing through node a other than nodes s and t.

Degree(x)

N − 1

(6)

Due to the typical composition of a RGB image

with three channels: Red, Green, and Blue, prior stud-

ies utilize three different centrality measures: degree

centrality, Katz centrality, and closeness centrality.

These centrality measures provide insights into the

importance of various lines of code within a function

from distinct perspectives. By incorporating addi-

tional two centrality measures: Betweenness central-

ity and Eigenvector centrality we can achieve a more

comprehensive assessment of each line of code’s con-

tribution to the overall program semantics because

Eigenvector centrality assesses the importance of a

node based on its connections to other highly central

nodes, offering insight into critical elements within a

PDG. Betweenness centrality identiﬁes nodes that act

as key intermediaries, revealing critical pathways for

information ﬂow or dependencies between different

modules or functions in a program. Both measures

provide valuable insights into the structural and func-

tional importance of nodes within a network, aiding

in understanding program semantics and dependen-

cies in source code.

In summary, we compute the centrality values

for all nodes in the new embedded PDG. Subse-

quently, we arrange the resulting vectors in accor-

dance with the number of lines of code multiplied

by the corresponding centrality measure. These ar-

ranged vectors represent the “Degree channel” “Katz

channel”, “Closeness channel” , “Betweennes chan-

nel” and “Eigenvector channel” Additionally, by ap-

plying betweenness centrality and eigenvector cen-

trality analysis we obtain two more channels. Ulti-

mately, the combination of these ﬁve channels is uti-

lized to generate the ﬁnal image representation from

a source code function.

3.4 Vulnerability Classiﬁcation

Deep learning algorithms have outperformed previous

technologies in various domains, such as speech and

image recognition. By utilizing effective hierarchical

feature extraction methods and unsupervised or semi-

supervised feature learning, deep learning presents

the advantage of replacing manual feature acquisition.

In the domain of image processing, Convolutional

Neural Network (CNN) has gathered signiﬁcant at-

tention. This is attributed to its ability not only to

eliminate the need for manual image preparation but

also to enable users to extract features at a level com-

parable to human capabilities. Following the image

generation phase, a function’s source code is trans-

formed into an image. To identify vulnerabilities, the

initial step involves training a CNN model on an im-

age. While CNN typically utilizes input images of

equal size, it encounters variations in the number of

lines of code required for each function. Hence, ad-

justments are necessary.

Figure 3: Cumulative Distribution Function of total number

of lines of code of a function.

To identify vulnerabilities, the initial step involves

training a CNN model with images. While CNN typ-

ically utilizes equal-sized input images, the varying

number of statements in each function within our in-

put dataset necessitates an adjustment. To determine

a threshold for ﬁxed-size images, an analysis is con-

ducted to assess the number of lines of code in each

function. Figure 3 shows the Cumulative Distribu-

tion Function (CDF) of the total number of statements

(lines of code) in functions from the input dataset is

calculated. It is observed that over 99% of the func-

tions contain fewer than 200 statements. After exper-

imenting with various threshold values (ranging from

40 to 200 statements) for vulnerability detection, a de-

cision is made to set the cutoff at the ﬁrst 100 state-

ments of a function. This decision is based on con-

siderations of detection accuracy and related runtime

overhead. For functions with fewer than 100 state-

ments, zeros are padded at the end of the vectors. In

functions with more than 100 statements, the vector’s

tail is discarded. The input images are typically of

sizes 3*100*768, 4*100*768, or 5*100*768, where

3, 4, 5 denote the number of channels, and 786 de-

notes the dimensions of the embedded vector.

In Figure 4, the input image shape is speci-

ﬁed as 5 ∗ 100 ∗ 768, where 5 represents ﬁve dis-

Automated Software Vulnerability Detection Using CodeBERT and Convolutional Neural Network

161

Figure 4: CNN Classiﬁcation of VulBertCNN.

tinct channels: “degree” (D),“Katz” (K),“Closeness”

(C),“Eigenvector” (E), and “Betweenness” (B). These

channels are created to incorporate different aspects.

Following the generation of ﬁxed-size images, they

are utilized for training a CNN model. The CNN

model employs various convolution ﬁlters, each with

a shape of m*768 as depicted in Figure 4. This design

allows each ﬁlter to independently cover the entire ca-

pacity of the CodeBert embedding. The variable m

denotes the size of the ﬁlter, indicating the number of

consecutive sentences to be considered.

Table 1: Hyper-Parameter Settings In VulBertCNN.

Hyper-Parameters Value

Learning Rate 0.0001

Number of Epochs 100

Batch-Size 64

loss function Cross Entropy Loss

Activation Function ReLU

Optimizer Adam

In our proposed method, we extract features from

different parts of the image by employing various ﬁl-

ter sizes ranging from 1 to 10, each with 64 feature

maps. This process is followed by max pooling and

the application of the Rectiﬁed Linear Unit (ReLU)

(Dahl et al., 2013) activation function across the en-

tire model. The datasets are initially divided based on

their original split. However, in cases where the split

information is unavailable, we adopt a default split of

80:10:10 for training, validation and testing sets re-

spectively. The hyperparameters utilized in the CNN

architecture are detailed in Table 1.

4 EVALUATION AND RESULT

ANALYSIS

In this section, we perform experiments to compare

the detection accuracy of VulBertCNN with state-of-

the-art solutions TokenCNN (Russell et al., 2018),

VulCNN (Wu et al., 2022) and ASVD (Mim et al.,

2023b). Before delving into the effectiveness of Vul-

BertCNN, we provide details on the implementation

speciﬁcs.

4.1 Experiment Setup

The proposed method is implemented in Python (ver-

sion 3.11.5). We conducted experiments using an

ASUS TUF Gaming laptop featuring an Intel Core

i7-8th generation CPU on a Windows server. The pro-

cessor in the laptop has six cores, each with a maxi-

mum operating frequency of 2.5 GHz.

4.2 Datasets

To evaluate the effectiveness of VulBertCNN, we use

two benchmark vulnerability datasets: SARD and

Big-Vul (Fan et al., 2020) in alignment with state-

of-the-art methods. The datasets were collected from

two sources: National Institute of Standards and

Technology (NIST) and Common Vulnerability Ex-

posure (CVE) database. The details of the experimen-

tal datasets are presented below in Table 2.

SARD: The software assurance reference dataset

(SARD) dataset encompasses a signiﬁcant volume of

production, synthetic, and academic security ﬂaws

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

162

Table 2: Details of Datasets.

Dataset Total Vul Non-Vul Vul (%)

SARD 40,584 13,684 26,900 33.71

Big-Vul 188,770 10,670 178,100 5.65

(i.e., bad functions) and non-vulnerable function (i.e.,

good functions). Our paper concentrates on identi-

fying vulnerabilities speciﬁcally in C/C++, thus we

exclusively target functions written in C/C++ within

SARD. The dataset from SARD comprises 12,300 in-

stances of vulnerable functions and 21,000 instances

of non-vulnerable functions. Recognizing the po-

tential lack of realism in synthetic programs within

SARD, we supplement our data with another dataset

derived from real-world software. For real-world

vulnerabilities, we utilize the National Vulnerabil-

ity Database (NVD) as our primary data source, re-

sulting in 1,384 vulnerable functions from various

open-source C/C++ projects. To complement this,

we randomly select a subset of non-vulnerable func-

tions from the dataset which contains non-vulnerable

functions from diverse open-source projects, ensur-

ing a balanced representation. The ultimate dataset

comprises 13,684 functions with vulnerabilities and

26,900 functions without vulnerabilities.

Big-Vul: We utilized the benchmark dataset Big-

Vul, created by Fan et al. (Fan et al., 2020). This

dataset contains reliable and comprehensive code vul-

nerabilities directly associated with the publicly ac-

cessible CVE database. Notably, the construction of

this dataset involved a signiﬁcant investment of man-

ual resources to ensure its high quality. Addition-

ally, it stands out for its substantial scale, being one

of the most extensive vulnerability datasets available.

The dataset is compiled from 348 open source Github

projects spanning from 2002 to 2019, covering 91 dis-

tinct Common Weakness Enumeration (CWE) cate-

gories. This comprehensive dataset includes approx-

imately 188,700 C/C++ functions, with 5.6% iden-

tiﬁed as vulnerable, equivalent to 10,600 vulnerable

functions. It offers detailed ground-truth informa-

tion at the function level, specifying which functions

within a codebase are susceptible to vulnerabilities.

4.3 Evaluation Metrics

To effectively evaluate model predictions, we estab-

lished and deﬁned ground truth values as follows:

True Positive (TP) is the number of vulnerable sam-

ples correctly detected as vulnerable. True Negative

(TN) is the number of non-vulnerable samples cor-

rectly identiﬁed as not vulnerable. False Positive (FP)

is the number of non-vulnerable samples incorrectly

classiﬁed as vulnerable. False Negative (FN) is the

number of vulnerable samples erroneously identiﬁed

as not vulnerable. Hence, we employ four metrics

for our experiments, namely: Accuracy: This met-

ric indicates the number of samples that are correctly

classiﬁed into their respective classes (e.g., positive or

negative labels for vulnerable or non-vulnerable func-

tion).

Accuracy =

T P + TN

T P + TN + FP +FN

(7)

Precision: Precision is the ratio of correctly classiﬁed

adversarial examples to the total number of adversar-

ial examples.

Precision =

T P

T P + FP

(8)

Recall: Recall refers to the ratio of incorrectly classi-

ﬁed adversarial examples.

Recall =

T P

T P + FN

(9)

F1 Score: F1 Score is a metric that combines pre-

cision and recall, providing a simple and convenient

way to compare our three classiﬁers.

F1 −Score = 2 ∗

Recall ∗ Precision

Recall + Precision

(10)

4.4 Detection Performance Evaluation

To assess the effectiveness of the proposed technique,

we conducted experiments with a total of 16 combina-

tions of centralities on the benchmark SARD and Big-

Vul dataset. As we observed in Figure 3 that more

than 99% of the functions have thresholds below 200

lines of code, we initiated our evaluations by select-

ing 10 thresholds (40, 60, 80, 100, 120, 140, 160, 180,

and 200 lines). The experimental results of our pro-

posed method with different combination of central-

ities using SARD dataset are presented in Figure 5.

As TokenCNN, VulCNN and ASVD all are evaluated

using this dataset that is why to better compare the

detection performance of our VulBertCNN approach

it is evaluated with the same dataset.

We conducted CodeBERT embedding and cen-

trality analysis on the graph derived from the source

code. Since the centralities align with the channels

of an image, we conducted our experiment consider-

ing three, four, or ﬁve channels representing various

combinations of the aforementioned ﬁve centralities.

The experiment consisted of 100 epochs and the ﬁ-

nal outcomes of the maximum detection performance

in each of the combination using three, four, ﬁve or

without centralities are illustrated in Figure 6. We ex-

amined the following four speciﬁc cases to observe

the inﬂuence of centrality on vulnerability detection.

Automated Software Vulnerability Detection Using CodeBERT and Convolutional Neural Network

163

Figure 5: Comparison of the detection accuracy for dif-

ferent centralities using CodeBert embedding and CNN on

SARD dataset.

Vulnerability Detection with CodeBert Embed-

ding Using 3 Centralities: We experimented with

different combinations of centralities to enhance vul-

nerability detection. Initially, we explored combi-

nations of three centralities, selecting from a pool

of ﬁve. The optimal performance, with a maximum

accuracy of 90.9%, was achieved using the DCE

(Degree-Closeness-Eigenvector) centralities, as illus-

trated in Figure 6(a). In comparison, the state-of-the-

art VulCNN detector, employing the DKC (Degree-

Katz-Closeness) combination, achieved a maximum

accuracy of 83%.

Vulnerability Detection with CodeBert Embed-

ding Using 4 Centralities: An image constructed

with four centralities corresponds to a four-channel

image. The highest accuracy, reaching 95.7% was

observed with the DCEB combination (Figure 6(b)).

Additionally, DCEB and CKEB demonstrated sub-

stantial detection performance in comparison to other

combinations achieving accuracies of 95.7% and

95.4%, respectively.

Vulnerability Detection with CodeBert Embed-

ding Using 5 Centralities: For the case of ﬁve cen-

tralities, representing an image with ﬁve channels, the

maximum detection accuracy reached 88% (Figure

6(c)), which was not as signiﬁcant as the performance

achieved with four centralities.

Vulnerability Detection with CodeBert Embed-

ding Without Using Centrality: To assess the im-

pact of centrality analysis on vulnerability detection,

we conducted another experiment which is Vulner-

ability Detection with CodeBert Embedding using

CNN without centrality (VulBertCNN-wc). This ex-

periment achieved only 83% accuracy. However,

upon incorporating centrality analysis, the accuracy

signiﬁcantly improved to approximately 96%. This

underscores the importance of considering centrality

measures in code lines to enhance the accuracy of vul-

nerability identiﬁcation in software.

In summary, the analysis highlights that the best

accuracy (95.7%) was achieved with combinations

of four centralities, speciﬁcally CKEB, DCEB, and

DKEB in SARD dataset. Furthermore, VulBertCNN

emphasizes the signiﬁcant role of CodeBERT, eigen-

vector and betweenness centrality measures in effec-

tive vulnerability detection.

4.5 Baseline Methods Comparison

We evaluate VulBertCNN’s performance in vulner-

ability detection with three state-of-the-art detection

approaches based on image processing which are To-

kenCNN, VulCNN and ASVD.

4.5.1 TokenCNN

Russell et al. annotate the source code and transform

it into the corresponding matrix. They utilize con-

volutional neural networks, integrated learning, and

a random forest classiﬁer for detecting vulnerabilities

in the code (Russell et al., 2018). In TokenCNN no se-

mantic relation is considered here only simple lexical

analysis is done on source code which is then given

input in CNN. Thats why it cannot accurately detect

vulnerability from source code.

4.5.2 VulCNN

Wu et al. (Wu et al., 2022) developed Vul-

CNN, a vulnerability detection method using CNN.

They used a dataset with both vulnerable and non-

vulnerable C/C++ functions, created program depen-

dency graphs (PDGs), and applied a sentence em-

bedding technique i.e., Sent2Vec (Pagliardini et al.,

2018) to each statement. Utilizing centrality tech-

niques, they transformed functions into images and

trained a CNN model. In a case study involving

projects such as Libav, Xen and Seamonkey, VulCNN

detected 73 previously unknown vulnerabilities. Vul-

CNN achieves 83% accuracy but it only considers

three centralities there are other centralities yet to be

explored.

4.5.3 ASVD

Mim et al. (Mim et al., 2023b) developed ASVD with

comparison to VulCNN method with sentence embed-

ding and ﬁve centralities. The detection accuracy ob-

tained in ASVD is 88% which outperforms the accu-

racy of VulCNN (83%) by 5% which we targeted to

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

164

(a) Number of lines of code analyzed

with 3 centralities

(b) Number of lines of code analyzed

with 4 centralities

with 5 centralities

Figure 6: Maximum vulnerability detection performance in VulBertCNN with CodeBert embedding and different centralities.

improve further by leveraging CodeBert embedding

instead of Sent2Vec embedding approach.

4.5.4 Analysis of Comparison

To evaluate the efﬁciency of our embedding approach

utilizing the CodeBERT model (as detailed in Sec-

tion 3) we performed experiments to assess the perfor-

mance of both the Sent2vec and CodeBERT embed-

ding models. The results were then compared with

those of three related studies, TokenCNN, VulCNN

and ASVD. Tables 3 provides a summary of the eval-

uation results in terms of Accuracy (A), Precision (P),

Recall (R) and F1 score.

Table 3: Evaluation of VulBertCNN using CodeBert Em-

bedding.

Dataset Methods A P R F1

SARD

TokenCNN 78.6 55.7 71.6 61.9

VulCNN 83.4 86.9 81.9 84.3

ASVD 88.2 89.5 85.3 87.3

VulBertCNN 95.7 90.4 87.3 89.0

Big-Vul

TokenCNN 61.4 50.6 79.7 62.6

VulCNN 80.3 83.3 78.2 82.3

ASVD 89.6 94.2 86.5 90.2

VulBertCNN 91.8 89.1 87.4 88.2

Table 3 indicates that our embedding approach

which is utilizing the CodeBERT model has demon-

strated signiﬁcantly superior evaluation outcomes

when compared to existing state-of-the-art ap-

proaches. Our method achieved an accuracy of 95.7%

and F1 score of 89% on SARD dataset and 91.8% ac-

curacy on Big-Vul dataset showcasing a substantial

enhancement over other existing methods. Moreover,

both Precision (90.4%) and Recall (87.3%) metrics

exhibited improvements in SARD dataset. While the

F1 measure with CodeBERT (88.2%) on Big-Vul did

not outperform ASVD, it displayed improvement in

accuracy when contrasted with our experimental eval-

uation using Sent2Vec (90.2%). In these scenarios,

the accuracy metric proves more suitable for evaluat-

ing vulnerability detection performance.

In summary, the result analysis reveals that our

embedding method using CodeBERT outperforms the

Sent2Vec embedding method used by VulCNN and

ASVD. By combining CNN with CodeBERT embed-

dings and centrality analysis, we achieved a signiﬁ-

cant improvement in the F1-score increasing it from

62.6% to 89% and detection accuracy from 78.6% to

95.7% which is about 21.7% and 26.3% improvement

compared to the state-of-the-art approaches.

5 THREATS TO VALIDITY

In this section, we discussed below the potential

threats which may affect the validity of this study.

External Validity: External validity presents chal-

lenges in generalizing study ﬁndings. Differences

in programming languages (e.g., C/C++ vs. Java or

Python) and caution in extending results to various

code vulnerabilities are key considerations. General-

izing ﬁndings to a broader range of open-source and

industrial systems requires careful handling due to the

complexity of obtaining and analyzing diverse indus-

trial systems. To overcome these challenges, future

plans include collecting data from industrial systems

across industries and countries to enhance dataset di-

versity.

Internal Validity: Internal validity concerns the ac-

curacy of causal inferences in our vulnerability de-

tection research. Variations in dataset characteristics

and algorithmic parameters pose potential confound-

ing issues. To enhance internal validity, we employ

a rigorous experimental design, exercise precise vari-

able control, and conduct sensitivity analyses. For in-

stance, strategic dataset partitioning over time ensures

a robust assessment of temporal dynamics, contribut-

ing to the internal validity of our vulnerability classi-

ﬁcation methodology.

Automated Software Vulnerability Detection Using CodeBERT and Convolutional Neural Network

165

Construct Validity: Construct validity focuses on the

tool used for extracting Program Dependency Graphs

(PDGs), namely Joern. While commonly used, Jo-

ern may have inherent ﬂaws. Despite this, we chose

Joern for PDG extraction and performed manual re-

views to identify and address any issues. There’s a

potential threat from modifying baseline approaches,

but we mitigate this risk by retrieving the original

source code directly from GitHub repositories asso-

ciated with analyzed techniques.

Criterion Validity: In vulnerability detection, met-

rics like precision, recall, and F-measure quantify the

alignment of identiﬁed vulnerabilities with actual vul-

nerable functions. High value of criterion validity in-

dicates our algorithm effectively predicts vulnerabili-

ties in line with widely accepted standards.

6 CONCLUSION AND FUTURE

WORK

This paper introduces an automated Software

Vulnerability Detection with CodeBert and

Convolutional Neural Network named VulBertCNN,

aiming to overcome the limitations of state-of-the-

art individual text and graph-based approaches in

vulnerability detection.

In this paper, a vulnerability detection approach

is proposed which focuses on integrating Codebert

embedding model with multiple centralities in image

generation from PDGs to assess the overall impact

of each line of code within a function, thereby de-

termining its vulnerability status. The evaluation in-

volves the generation of 16 centrality combinations

derived from 5 centralities, revealing that the highest

accuracy is attained with a combination of 4 central-

ity measures. This achieves an accuracy surpassing

the previous state-of-the-art techniques from 78.6%

to 95.7% and F1-score increasing it from 62.6% to

89%. It is observed that leveraging codebert embed-

ding with CNN emerges effective role in vulnerability

detection.

Future plans involve optimizing program depen-

dency graph generation time with tools like Frama-

C, incorporating dynamic analysis for improved de-

tection. Additionally, efforts will be made to narrow

down the search space within a function by comparing

source code with vulnerability reports from National

Vulnerability Database aiming to identify statement-

level vulnerabilities.

ACKNOWLEDGEMENTS

This research is supported by the fellowship from

Information and Communication Technology (ICT)

Division, Ministry of Posts, Telecommunications

and Information Technology, Bangladesh. No-

56.00.0000.052.33.001.23-09; Date: 04.02.2024.

REFERENCES

Alves, H., Fonseca, B., and Antunes, N. (2016). Software

metrics and security vulnerabilities: dataset and ex-

ploratory study. In 2016 12th European Dependable

Computing Conference (EDCC), pages 37–44. IEEE.

Cheng, X., Wang, H., Hua, J., Xu, G., and Sui, Y. (2021).

Deepwukong: Statically detecting software vulnera-

bilities using deep graph neural network. ACM Trans-

actions on Software Engineering and Methodology

(TOSEM), 30(3):1–33.

Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Im-

proving deep neural networks for lvcsr using rectiﬁed

linear units and dropout. In 2013 IEEE international

conference on acoustics, speech and signal process-

ing, pages 8609–8613. IEEE.

Duan, X., Wu, J., Ji, S., Rui, Z., Luo, T., Yang, M., and Wu,

Y. (2019). Vulsniper: Focus your attention to shoot

ﬁne-grained vulnerabilities. In IJCAI, pages 4665–

4671.

Fan, J., Li, Y., Wang, S., and Nguyen, T. N. (2020). A c/c++

code vulnerability dataset with code changes and cve

summaries. In Proceedings of the 17th International

Conference on Mining Software Repositories, pages

508–512.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,

M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. (2020).

Codebert: A pre-trained model for programming and

natural languages. arXiv preprint arXiv:2002.08155.

Freeman, L. C. et al. (2002). Centrality in social networks:

Conceptual clariﬁcation. Social network: critical con-

cepts in sociology. Londres: Routledge, 1:238–263.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-

agenet classiﬁcation with deep convolutional neural

networks. Communications of the ACM, 60(6):84–90.

Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., and Chen, Z. (2021).

Sysevr: A framework for using deep learning to detect

software vulnerabilities. IEEE Transactions on De-

pendable and Secure Computing, 19(4):2244–2258.

Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S.,

Deng, Z., and Zhong, Y. (2018). Vuldeepecker: A

deep learning-based system for vulnerability detec-

tion. arXiv preprint arXiv:1801.01681.

Lin, G., Zhang, J., Luo, W., Pan, L., and Xiang, Y. (2017).

Poster: Vulnerability discovery with function repre-

sentation learning from unlabeled projects. In Pro-

ceedings of the 2017 ACM SIGSAC conference on

computer and communications security, pages 2539–

2541.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

166

Mim, R. S., Ahammed, T., and Sakib, K. (2023a). Iden-

tifying vulnerable functions from source code using

vulnerability reports.

Mim, R. S., Khatun, A., Ahammed, T., and Sakib, K.

(2023b). Impact of centrality on automated vulner-

ability detection using convolutional neural network.

In 2023 International Conference on Information and

Communication Technology for Sustainable Develop-

ment (ICICT4SD), pages 331–335. IEEE.

Moghadasi, M. N. and Zhuang, Y. (2020). Sent2vec: A new

sentence embedding representation with sentimental

semantic. In 2020 IEEE International Conference on

Big Data (Big Data), pages 4672–4680. IEEE.

Neuhaus, S., Zimmermann, T., Holler, C., and Zeller, A.

(2007). Predicting vulnerable software components.

In Proceedings of the 14th ACM conference on Com-

puter and communications security, pages 529–540.

Pagliardini, M., Gupta, P., and Jaggi, M. (2018). Unsuper-

vised learning of sentence embeddings using compo-

sitional n-gram features. In Proceedings of the 2018

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, Volume 1 (Long Papers), pages

528–540.

Rahman, S., Rahman, M. M., and Sakib, K. (2016). An

improved method level bug localization approach us-

ing minimized code space. In Evaluation of Novel

Approaches to Software Engineering: 11th Interna-

tional Conference, ENASE 2016, Rome, Italy, April

27–28, 2016, Revised Selected Papers 11, pages 179–

200. Springer.

Russell, R., Kim, L., Hamilton, L., Lazovich, T., Harer,

J., Ozdemir, O., Ellingwood, P., and McConley, M.

(2018). Automated vulnerability detection in source

code using deep representation learning. In 2018 17th

IEEE international conference on machine learning

and applications (ICMLA), pages 757–762. IEEE.

Shankar, U., Talwar, K., Foster, J. S., and Wagner, D.

(2001). Detecting format string vulnerabilities with

type qualiﬁers. In 10th USENIX Security Symposium

(USENIX Security 01).

Shar, L. K., Briand, L. C., and Tan, H. B. K. (2014).

Web application vulnerability prediction using hy-

brid program analysis and machine learning. IEEE

Transactions on dependable and secure computing,

12(6):688–707.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi,

M., Macherey, W., Krikun, M., Cao, Y., Gao, Q.,

Macherey, K., et al. (2016). Google’s neural ma-

chine translation system: Bridging the gap between

human and machine translation. arXiv preprint

arXiv:1609.08144.

Wu, Y., Zou, D., Dou, S., Yang, W., Xu, D., and Jin, H.

(2022). Vulcnn: An image-inspired scalable vulner-

ability detection system. In Proceedings of the 44th

International Conference on Software Engineering,

pages 2365–2376.

Yamaguchi, F., Golde, N., Arp, D., and Rieck, K. (2014).

Modeling and discovering vulnerabilities with code

property graphs. In 2014 IEEE symposium on secu-

rity and privacy, pages 590–604. IEEE.

Yamaguchi, F., Lottmann, M., and Rieck, K. (2012). Gen-

eralized vulnerability extrapolation using abstract syn-

tax trees. In Proceedings of the 28th annual computer

security applications conference, pages 359–368.

Yamaguchi, F., Maier, A., Gascon, H., and Rieck, K.

(2015). Automatic inference of search patterns for

taint-style vulnerabilities. In 2015 IEEE Symposium

on Security and Privacy, pages 797–812. IEEE.

Zhou, Y., Liu, S., Siow, J., Du, X., and Liu, Y. (2019). De-

vign: Effective vulnerability identiﬁcation by learn-

ing comprehensive program semantics via graph neu-

ral networks. Advances in neural information process-

ing systems, 32.

Zou, D., Wang, S., Xu, S., Li, Z., and Jin, H. (2019).

µvuldeepecker: A deep learning-based system for

multiclass vulnerability detection. IEEE Transactions

on Dependable and Secure Computing, 18(5):2224–

2236.

Automated Software Vulnerability Detection Using CodeBERT and Convolutional Neural Network

167