CodeGrapher: An Image Representation Method to Enhance Software

Vulnerability Prediction

Ramin Fuladi

1 a

and Khadija Haniﬁ

2 b

Ericsson Research, Istanbul, Turkey

Sabanci University, Istanbul, Turkey

ﬁ

Keywords:

Software Vulnerability Prediction, CodeGrapher, ML Algorithms, Semantic Relations, Source Code Analysis,

Similarity Distance Metrics, Image Generation.

Abstract:

Contemporary software systems face a severe threat from vulnerabilities, prompting exploration of innovative

solutions. Machine Learning (ML) algorithms have emerged as promising tools for predicting software vulner-

abilities. However, the diverse sizes of source codes pose a signiﬁcant obstacle, resulting in varied numerical

vector sizes. This diversity disrupts the uniformity needed for ML models, causing information loss, increased

false positives, and false negatives, diminishing vulnerability analysis accuracy. In response, we propose

CodeGrapher, preserving semantic relations within source code during vulnerability prediction. Our approach

involves converting numerical vector representations into image sets for ML input, incorporating similarity

distance metrics to maintain vital code relationships. Using Abstract Syntax Tree (AST) representation and

skip-gram embedding for numerical vector conversion, CodeGrapher demonstrates potential to signiﬁcantly

enhance prediction accuracy. Leveraging image scalability and resizability addresses challenges from varying

numerical vector sizes in ML-based vulnerability prediction. By converting input vectors to images with a set

size, CodeGrapher preserves semantic relations, promising improved software security and resilient systems.

1 INTRODUCTION

Software vulnerabilities are inherent weaknesses in

code that arise during the development or commit-

ting process and can be exploited by malicious actors

to gain unauthorized access to a system and execute

harmful actions (Haniﬁ et al., 2023). To mitigate po-

tential risks, organizations employ various methods to

identify and predict these vulnerabilities. Two widely

adopted approaches are static and dynamic analysis

(Palit et al., 2021). Static analysis involves scruti-

nizing the source code or binary code without actu-

ally executing it. This method seeks to identify exist-

ing vulnerabilities by examining the code’s structure,

syntax, and potential logical ﬂaws. It is particularly

useful during the early stages of development when

the codebase is accessible and can be analyzed thor-

oughly. Static analysis tools can quickly scan large

codebases, providing valuable insights into potential

vulnerabilities without the need for runtime execution

(Schiewe et al., 2022). On the other hand, dynamic

https://orcid.org/0000-0003-4142-1293

https://orcid.org/0000-0001-7044-3315

analysis involves running the code and analyzing its

behavior during runtime. By observing the code in

action, this method can detect vulnerabilities that may

not be apparent in the static analysis phase. Dynamic

analysis is advantageous when dealing with complex

and interactive applications, as it allows for a better

understanding of how the software responds to real-

world inputs and interactions (Lin et al., 2020). The

advantage of static analysis lies in its ability to de-

tect vulnerabilities before the code is executed, which

can potentially save signiﬁcant time and resources in

the development process. It can catch issues early on,

reducing the chances of encountering critical vulnera-

bilities later in the software lifecycle. Moreover, static

analysis can be automated, making it scalable and ef-

ﬁcient for large-scale projects (Halepmollası et al.,

2023). Despite its beneﬁts, static analysis may pro-

duce false positives or false negatives, meaning that

it can ﬂag non-existing vulnerabilities or miss ac-

tual ﬂaws due to the inherent complexities of code

analysis. Additionally, static analysis tools may not

capture vulnerabilities arising from runtime-speciﬁc

conditions or interactions. Recently, Machine Learn-

ing (ML) and Artiﬁcial Intelligence (AI) techniques

666

Fuladi, R. and Haniﬁ, K.

CodeGrapher: An Image Representation Method to Enhance Software Vulnerability Prediction.

DOI: 10.5220/0012717100003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 666-673

ISBN: 978-989-758-696-5; ISSN: 2184-4895

have been integrated into static analysis tools to en-

hance vulnerability detection. By leveraging ML/AI,

these tools can adapt and improve their accuracy over

time. However, applying ML/AI to static code analy-

sis presents its own set of challenges. The structural

differences between code and traditional text data can

hinder the effectiveness of standard ML algorithms,

requiring specialized approaches and tailored models

for optimal results (Bilgin et al., 2020). In order to

leverage ML algorithms for predicting software vul-

nerabilities, it is essential to represent the source code

as numerical vectors. This process, commonly re-

ferred to as code embedding, aims to transform the

code into meaningful numerical representations. This

can be achieved either through manual extraction of

features such as lines of code or code complexity, or

by employing ML-based techniques to automatically

learn the vector representation of the code.

Several methods have been proposed for code em-

bedding (Haniﬁ et al., 2023; Bilgin et al., 2020; S¸ahin

et al., 2022; Duan et al., 2019; Halepmollası et al.,

2023), but they all encounter a common challenge:

the variability in the length of the source code, which

results in varying sizes of the output vectors. More-

over, in source code, multiple lines can be intercon-

nected, meaning that one line of code may be depen-

dent on or related to another line within the code. This

interrelation and semantic code relationship are also

relevant when dealing with vulnerabilities. However,

ML models often require ﬁxed input sizes, necessitat-

ing the vector representation of source codes to be ad-

justed to match the model’s input size. Consequently,

some vectors need to be truncated, while others must

be padded with zeros to achieve uniformity in input

size (Telang and Wattal, 2007). The process of modi-

fying vector sizes through truncation or padding leads

to information loss, adversely impacting the accuracy

of vulnerability analysis. As a result, the occurrence

of false negatives and false positives increases, under-

mining the overall effectiveness of the analysis. Ad-

dressing this challenge is crucial for enhancing the

accuracy and reliability of software vulnerability pre-

diction using ML algorithms.

In this study, we introduce a novel approach called

CodeGrapher, which aims to convert the numerical

vector representation of source code into one or more

images. Due to the different size of the source codes,

the size of the resulted images vary. By re-sizing

the size of the images to a global and constant value

(i.e. nxn), they can be utilized by ML methods to pre-

dict vulnerabilities in the source code. CodeGrapher

addresses the challenge of varying numerical vector

sizes resulting from differences in the sizes of source

codes. By resolving this discrepancy in input sizes

for ML models during static analysis, our solution en-

sures that all relevant information within the source

code is retained.

The conversion of numerical vector representa-

tions into images is advantageous for resizing due

to the inherent adaptability of image variables. Im-

ages can be easily resized to a standardized nxn for-

mat without compromising the integrity of the infor-

mation they encapsulate. During the resizing pro-

cess, CodeGrapher ensures that relevant details within

the images are preserved, maintaining the essential

characteristics of the source code representations.

This adaptability and preservation of information con-

tribute to the effectiveness of CodeGrapher in ad-

dressing the challenges posed by varying source code

sizes in the context of static analysis for machine

learning models.

To achieve this, we employ similarity distance

metrics on the numerical representation of the source

code, which allows us to generate an image based on

the vector input. Subsequently, the size of the im-

age is adjusted to a ﬁxed value to align with the re-

quirements of ML models. In converting the source

code to the numerical representation, we utilize the

Abstract Syntax Tree (AST) representation along with

the skip-gram embedding algorithm, which is an ML-

based algorithm commonly used in Natural Language

Processing (NLP). This research makes signiﬁcant

contributions to the ﬁeld of software vulnerability

prediction and software security. The main contribu-

tions of this paper can be listed as follows:

• Innovative Numerical-to-Image Transforma-

tion: Our innovative solution converts numeri-

cal vectors of source code into images, enhanc-

ing ML model inputs. This unique transformation

preserves semantic relationships within code ele-

ments, bridging code analysis and image process-

ing for intuitive vulnerability prediction.

• Semantic Relationship Preservation: By inte-

grating similarity distance metrics into image gen-

eration, we improve the accuracy of vulnerability

prediction by preserving code element relations,

minimizing false results in software analysis.

• Uniform ML Model Input: Our solution suc-

cessfully preserves all source code information

while addressing varying numerical vector sizes,

ensuring uniformity in ML model input. This en-

hances accuracy in vulnerability prediction, con-

tributing to the development of resilient software

systems.

The remainder of this paper is structured as fol-

lows: In Section 2, we provide a comprehensive re-

view of relevant studies. Section 3 delves into a de-

CodeGrapher: An Image Representation Method to Enhance Software Vulnerability Prediction

667

tailed explanation of our proposed approach, while

Section 4 presents the implementation of CodeGra-

pher. Section 5 provides the threats to validity of our

ﬁndings. Finally, in Section 6, we draw our study to a

conclusion.

2 RELATED WORK

The use of numerical vector representation is crucial

for harnessing ML algorithms in predicting software

vulnerabilities. This technique, referred to as code

embedding (Kanade et al., 2020), revolves around

converting the source code into numerical vectors.

This transformation can be accomplished either man-

ually, where features such as line of code and code

complexity are extracted, or automatically through

ML-based techniques that learn vector representa-

tions (Alon et al., 2019).

Several techniques have been proposed for code

embedding. Alon et al. (Alon et al., 2019) introduced

code2vec, a neural network-based model that repre-

sents source code as a continuous distributed vector.

They break down the Abstract Syntax Tree (AST) of

the code into paths and learn the atomic representa-

tion of each path, aggregating them as a set. Lozoya

et al. (Lozoya et al., 2021) built upon code2vec and

developed comit2vec, which focuses on embedding

representations of code changes. They utilized the

obtained representations for vulnerability ﬁxing com-

mit prediction. Furthermore, word embedding tech-

niques, like word2vec (Alon et al., 2019), have been

used to convert source code into numerical vectors.

Hare et al. (Harer et al., 2018) applied word2vec

on C/C++ tokens to generate word embedding rep-

resentations for ML-based software vulnerability pre-

diction. Henkel et al. (Henkel et al., 2018) utilized

the GloVe model to extract word embedding repre-

sentations from the AST of C source code. Fang et

al. (Fang et al., 2020) introduced FastEmbed, em-

ploying the FastText technique (Feutrill et al., 2018),

and achieved an F1-score of 0.586. Sahin et al.

(S¸ahin et al., 2022) proposed a vulnerability predic-

tion model using different source code representa-

tions. They explored whether a function at a spe-

ciﬁc code change is vulnerability-inducing or not by

representing function versions as node embeddings

learned from their AST. They built models using

Graph Neural Networks (GNNs) with node embed-

dings, Convolutional Neural Networks (CNNs), and

Support Vector Machines (SVMs) with token repre-

sentations. Their experimental analysis on the Wire-

shark project showed that the GraphSAGE model

achieved the highest AUC rate with 96%, while the

CNN model achieved the highest recall and precision

rates with 77% and 82%, respectively. In study by

Bilgin et al. (Bilgin et al., 2020), the authors intro-

duced a technique for software vulnerability predic-

tion at the function level in C code. They achieve

this by converting the Abstract Syntax Tree (AST)

of the source code into a numerical vector. Subse-

quently, they utilize a 1D Convolutional Neural Net-

work (CNN) for software vulnerability prediction.

Similarly, in another study by Duan et al. (Duan et al.,

2019), the authors utilized the Control Flow Graph

(CFG) and AST as graph representations to predict

vulnerabilities. By employing soft attention, they ex-

tract high-level features crucial for vulnerability pre-

diction from the graphs. Moreover, Zhou et al. (Zhou

et al., 2019) proposed a function-level software vul-

nerability prediction method based on a graph repre-

sentation. Their approach incorporates not only the

AST but also dependency and natural code sequence

information to enhance the prediction process.

Previous studies utilize embedding techniques for

software vulnerability prediction, facing challenges

with ML algorithms requiring uniform vector lengths,

often resulting in truncation or zero-padding. These

studies often overlook semantic relations between

code components. In our research, we introduce a

novel approach converting code vectors into images,

preserving semantic relations and resolving variable-

sized outputs. Utilizing images captures intricate

code relationships, maintaining ﬁxed output size, and

improving vulnerability prediction accuracy and ef-

fectiveness.

3 PROPOSED APPROACH

Figure 1 serves as a visual representation of the so-

lution devised to standardize the input dimensions for

ML-based static analysis tools. The primary goal is to

ensure uniformity in the input data size. The process

commences by acquiring a numerical vector represen-

tation of the source code. This vector, along with one

or more similarity distance functions like cosine sim-

ilarity or dot product, is then input into the CodeGra-

pher module. The application of these distance func-

tions serves a dual purpose: ﬁrst, to identify similari-

ties among elements within the vector, and second, to

retain the underlying semantic information associated

with the source code. The pivotal role of the CodeG-

rapher module is to convert the numerical vector into

one or more images, either in RGB or gray-scale for-

mat, adhering to predetermined dimensions. These

resultant images constitute the module’s outputs. This

transformation is facilitated by the utilization of the

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

668

Figure 1: An illustrative scheme outlining the key steps for

CodeGrapher implementation.

aforementioned distance function(s). To elaborate on

the image generation process, Figure 2 provides a

visual breakdown. The process involves selecting a

window of size w and systematically sliding this win-

dow across the numerical vector. With each window

position, corresponding matrices are generated. The

diagram showcases the resulting matrices for two in-

stances: w = 1 and w = k. These matrices are subse-

quently transformed by mapping their values to the

range of [0, 255]. Furthermore, their format is al-

tered to uint8, effectively transitioning the matrices

into image format. Upon successful image genera-

tion, the images’ dimensions are adjusted to match

the required input dimensions of the machine learning

model. The generated images are then harnessed by

Figure 2: Visual Breakdown of Image Generation Process

for Different Window Sizes (w=1 and w=k).

the ML model to identify potential software vulnera-

bilities present in the source code. The process is de-

picted in Figure 2. Importantly, the number of outputs

from the CodeGrapher module is adaptable based on

the conﬁguration settings. For instance, if there exists

only one distance function and one window, the out-

put will be a single image. Conversely, if D distinct

distance functions and L separate windows are em-

ployed, the number of generated images will amount

to D × L. This level of ﬂexibility allows for the pro-

duction of varying numbers of images, depending on

the speciﬁc demands of the analysis procedure. This

accommodates diverse analysis requirements and pro-

vides versatility in tailoring the solution to the speciﬁc

needs of software vulnerability prediction.

4 CodeGrapher

IMPLEMENTATION

For a comprehensive illustration of the process, we

conducted an experiment focused on predicting vul-

nerabilities within C-source code at the function level.

The schematic representation of the system employed

in this experiment is visualized in Figure 3. This

system leverages a Convolutional Neural Network

(CNN) and adopts an image-based representation of

the source code as input to successfully predict exist-

ing vulnerabilities in C-source code.

Figure 3: The main steps involved in implementing Code-

Grapher to construct a system for predicting source code

vulnerabilities.

To facilitate a step-by-step understanding of each

stage, an illustrative code segment is presented below.

This code snippet provides a simple way to check if

a given integer is even or odd by returning 1 for even

numbers and 0 for odd numbers.

i n t main ( ) { / / t h i s t h e main f u n c t i o n

i n t a1 = 5 ; / / d e f i n e a n i n t e g e r v a r i a b l e

p r i n t f ( a1 ) ; / / p r i n t t h e v a r i a b l e t o t h e c o n s o l

}

Now, we will proceed to demonstrate each step

using the provided code sample:

4.1 Tokenization and AST

Representation

The initial step involves the application of lexical

analysis, whereby the source code is transformed into

a sequence of tokens. This transformation is evident

when examining the provided example source code,

which is altered as follows:

i n t ( keywo rd ) , i s E v e n ( i d e n t i f i e r ) , LPAREN ( d e l i m i t e r ) ,

i n t ( keywo rd ) , num ( i d e n t i f i e r ) , RPAREN ( d e l i m i t e r ) ,

= ( o p e r a t o r ) , 5 ( c o n s t a n t ) , ; ( sy mbo l )

Upon obtaining the token sequence from the

source code, the parsing process is initiated. Within

this parsing phase, the tokens generated during lexi-

cal analysis undergo a conversion process, typically

resulting in the creation of a data structure such as

a parse tree. This hierarchical structure offers a vi-

sual representation of the input code’s structure while

simultaneously validating its syntactical correctness

based on the rules deﬁned by a context-free gram-

mar (CFG). The output of this step often manifests

CodeGrapher: An Image Representation Method to Enhance Software Vulnerability Prediction

669

as an Abstract Syntax Tree (AST) representation of

the original source code. The AST, which stems from

the tokens extracted during lexical analysis, encom-

passes both the structural layout and semantic insights

pertaining to the code. This convergence of struc-

tural and semantic information within the AST has

led to the development of a trend in source code anal-

ysis known as AST-based intelligent analysis (Chen

et al., 2019). Figure 4 illustrates the AST generated

to hierarchically represent the syntactic structure of

the isEven function.

Figure 4: AST of the isEven Function.

4.2 Utilizing Embedding for Code

Representation

Developing effective representations for source code

is a complex endeavor, largely due to the intricate

nature of programming languages, the incorporation

of libraries, and the diverse coding styles employed

by various developers. Leveraging established tech-

niques from the realm of Natural Language Process-

ing (NLP), we endeavor to extract embedding features

from source code. However, it’s important to note that

the structure of source code differs signiﬁcantly from

that of regular textual content.

In light of these challenges, we opt for an ap-

proach that involves utilizing AST as the founda-

tion of our code representation technique. The con-

version of source code into its abstract structure al-

lows us to capture the inherent syntax and structure

of the code. To bridge the gap between AST and

textual representation while maintaining information

about each node’s location in the AST, we employ the

Breadth-First Search (BFS) technique. Subsequently,

we employ Skip-Gram, a word embedding technique,

to translate AST nodes into numerical vectors. The

comprehensive steps we follow to extract embedding

features are elaborated below:

1. Normalizing Identiﬁer Names: While AST

predominantly captures structural and content-

related aspects, it omits certain details. For in-

stance, grouping parentheses are implicit within

the tree structure and are not represented as dis-

tinct nodes in the AST. However, structural nodes

such as function names are often irrelevant to our

purpose, lacking vulnerability-related informa-

tion. To mitigate this, prior to using AST nodes,

we initiate a normalization process. This pro-

cess replaces non-essential nodes with uniquely

predeﬁned names. For example, both variable

and function names, being unimportant, are sub-

stituted with distinctive identiﬁers like VARI-

ABLE NAME and FUNCTION NAME. For in-

stance, in the examined example, isEven is re-

placed with FUNCTION NAME, and num is re-

placed with VARIABLE NAME.

2. Conversion to Word Vectors: To ensure the

normalized AST is transformed into a one-

dimensional array without sacriﬁcing the relation-

ships among AST nodes, we adopt the BFS tech-

nique. However, leaf nodes remain attached to

their parent nodes, like the node BinaryOp and

the operator %, as they serve as features rather

than separate entities. The resulting array is then

fed into the embedding model to derive the feature

matrix. An equivalent word vector corresponding

to the normalized AST of the example code is pre-

sented below:

[ Fu ncDef , D ec l , Compound , Fu ncD ecl , I f , Ret u r n ,

P a r a m L i s t , Typ eD ec l , B i nar y Op : == , R e t u r n ,

C o n s t a n t : i n t , D ec l , I d Type : i n t , Bina r yOp : %,

C o n s t a n t : i n t , C o n s t a n t : i n t , Typ eDecl , ID : num ,

C o n s t a n t : i n t , I d Type : i n t ]

3. Conversion to Numerical Vectors (Skip-Gram):

The Skip-Gram method is employed to transition

the aforementioned word vector into a numeri-

cal vector. Skip-Gram operates by extracting nu-

merical features based on the relationships among

neighboring nodes. This approach preserves con-

textual information, which is then mapped into the

resulting numerical vector (Bamler and Mandt,

2017). The process involves two distinct steps:

• Step 1: we generate a feature matrix or a

lookup dictionary using the SkipGram method,

trained with the VDISC dataset (Russell et al.,

2018). By the end of this pre-processing stage,

we obtain an embedding feature matrix, re-

ferred to as the ’Dictionary’. This matrix rep-

resents each word as a numerical vector within

20×1

, considering its location within the code.

As an example, the associative numerical repre-

sentation of ’FuncDef’ node is as below:

Fu nc Def : [ − 2 . 5 1 8 2 4 1 6 , 3 . 1 2 8 3 9 9 4 , 2. 2 2 8 9 2 3 8 ,

0 . 7 2 4 2 7 2 2 , 1 . 4 2 9 6 0 2 4 , 1. 5 8 7 2 3 6 5 , 2. 0 1 3 6 3 3 3 ,

0 . 4 9 0 5 3 5 3 5 , − 0.82888806 , −3 . 0 3 8 2 7 6 2 ,

2 . 8 4 8 7 8 8 3 , 1 . 5 5 7 3 9 1 2 , − 0 . 2 6 1 1 7 6 6 2 , 1 . 3 0 5 0 9 3 4 ,

−1.3 0 6 1 3 4 7 , − 0 . 3 1 5 7 3 5 2 8 , − 3 . 5 8 3 8 4 2 3 ,

1 . 6 3 7 9 4 1 7 , 4 . 9 3 7 8 8 5 8 , 0. 0 4 9 2 4 7 7 4 ]

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

670

• Step 2: we use the acquired ’Dictionary’ to

transform the word vector into a numerical one.

This involves substituting each node with its

corresponding numerical representation from

the Dictionary. Consequently, each function is

depicted by a single numerical vector, with a

length of 20 ×nodes#. This length varies based

on the function’s size and the number of nodes

in the AST.

4.3 Code to Image Representation

After transforming the source code into vector form,

the vector is fed into the CodeGrapher module to gen-

erate an image-based representation. For optimal uti-

lization of the CodeGrapher module, an additional in-

put in the form of a distance function is required. In

this speciﬁc study, we’ve selected cosine similarity as

our distance function. Cosine similarity is a math-

ematical metric utilized to evaluate the likeness be-

tween two vectors by computing the cosine of the an-

gle formed between them. This metric ﬁnds applica-

tions across diverse domains such as natural language

processing, recommendation systems, and image pro-

cessing. One of its notable features is its disregard for

vector magnitude, instead concentrating solely on di-

rectional alignment within a multidimensional space.

This characteristic proves valuable when comparing

documents, images, or other data where the mag-

nitude may not offer as much meaningful informa-

tion as the relative orientation of vectors. It’s impor-

tant to note that in this study, we empirically set the

value of W to 10. As an illustrative example, Fig-

ure 5 provides visual representations of two distinct

source codes. One of these codes contains vulnera-

bilities related to buffer overﬂow (CWE120), while

the other code is devoid of such vulnerabilities. CWE

(Common Weakness Enumeration) is a comprehen-

sive catalog of software and hardware vulnerabilities,

and CWE120 is a speciﬁc category pinpointing the

weakness associated with buffer overﬂow vulnerabil-

ities (Sane, 2020). Buffer overﬂow, as described in

CWE120, is a critical software vulnerability where

a program writes more data into a buffer (temporary

data storage) than it can accommodate (Sane, 2020).

This overﬂow can overwrite adjacent memory, poten-

tially causing crashes, unauthorized access, or even

remote code execution.

4.4 Dataset

In this study, we employed the publicly available

Draper VDISC Dataset, as described in (Russell et al.,

2018), to conduct our experiments. This dataset

(a) Non-vulnerable

code snippet

(b) Vulnerable code

snippet (CWE120)

Figure 5: Images representing two distinct source codes: (a)

non-vulnerable source code, (b) vulnerable code containing

CWE120.

comprises an extensive collection of function-level

source code samples gathered from various sources,

including open-source projects such as the Debian

Linux distribution (unk, 2023a), public git reposi-

tories on GitHub (unk, 2023b), and the synthetic

SATE IV Juilet Test Suite (Black and Black, 2018)

from NIST’s Samate project. While the SATE IV

Juilet Test Suite contains synthetic code, it constitutes

only a small portion (approximately 1%) of the entire

dataset. The authors of the dataset meticulously la-

beled the function-level codes based on ﬁndings from

three different static analyzers, categorizing them into

ﬁve distinct groups of Common Weakness Enumera-

tion (CWE) vulnerabilities(Russell et al., 2018).

This categorization involved labeling functions

ﬂagged by static code analyzers as ”vulnerable” for

the respective CWE category, while labeling oth-

ers as ”non-vulnerable” functions. The dataset was

thoughtfully partitioned into training (80%), valida-

tion (10%), and test (10%) sets to ensure the indepen-

dence of these subsets, with strict avoidance of dupli-

cate samples between training and test sets. Further-

more, we constructed balanced subsets in select ex-

periments to gauge the detectability of different vul-

nerability categories on an equal footing. It’s worth

mentioning that the referenced dataset primarily con-

tains functions written in C and C++ languages. For

our speciﬁc implementation, we focused on functions

written in the C language and ensured their parseabil-

ity using the Pycparser (Bendersky, 2019) parser, re-

sulting in several subsets derived from the original

training, validation, and test datasets.

4.5 Vulnerability Prediction Results

Based on CNN Model

After the transformation of source code into image

representations, these visual depictions become the

input data for our machine learning models. In this

study, we employ a Convolutional Neural Network

(CNN) as our chosen model for predicting vulner-

CodeGrapher: An Image Representation Method to Enhance Software Vulnerability Prediction

671

Table 1: Vulnerability prediction results.

CWE Method in (Bilgin et al., 2020) Our proposed solution

Precsion Recall F1 Precsion Recall F1

CWE119 0.504 0.515 0.509 0.813 0.841 0.824

CWE120 0.415 0.440 0.427 0.764 0.782 0.773

CWE469 0.060 0.187 0.090 0.460 0.552 0.502

CWE476 0.701 0.521 0.598 0.932 0.910 0.921

Other 0.218 0.353 0.270 0.624 0.652 0.640

abilities at the function-level within C source code.

We select parameters such as the number of layers,

the quantity of ﬁlters, and the dimensions of the ﬁl-

ters empirically. Each ﬁlter within the neural network

is deﬁned as a 3 × 3 kernel. Collectively, the CNN

classiﬁer encompasses an impressive 33,573,505 pa-

rameters, all of which are learned during the training

phase. For activation functions, all layers, except the

ﬁnal one which employs the Softmax activation func-

tion, utilize Rectiﬁed Linear Unit (ReLU) functions.

To evaluate our proposed method and compare its

results with previous studies, we train the CNN model

separately for the mentioned ﬁve different CWEs. We

compare the performance results with those provided

in (Bilgin et al., 2020). Our evaluation is based on

standard metrics such as precision, recall, and F1-

score. The tabulated results summarized in Table 1

present a comparative analysis of vulnerabilities be-

tween the method in (Bilgin et al., 2020) and our pro-

posed image-based solution: This comparison under-

scores the improved performance of our image-based

solution across various CWEs, with enhanced preci-

sion, recall, and F1-scores. It suggests the poten-

tial effectiveness of our approach in accurately detect-

ing vulnerabilities, highlighting its promise for robust

vulnerability analysis in software systems.

5 THREATS TO VALIDITY

It is important to acknowledge and address potential

threats to the validity of our research. We recognize

several key threats to the validity of our ﬁndings.

5.1 External Validity

Our evaluation centered on C, providing a detailed

examination of the CodeGrapher approach in a con-

trolled setting. However, this may limit the gener-

alizability of our ﬁndings to other programming lan-

guages. Future work will broaden comparisons across

multiple languages to enhance generalizability.

5.2 Scalability and Granularity

Our experiments primarily focused on predicting vul-

nerabilities within individual functions using Code-

Grapher. However, software development occurs at

multiple levels, posing a threat to external validity and

generalizability. Future research will explore Code-

Grapher’s scalability and versatility across different

levels of analysis, ensuring a comprehensive evalua-

tion of its applicability.

5.3 Vulnerability Type Coverage

In this study, we aimed to cover various vulnerabil-

ity types, including buffer overﬂows, NULL pointer

dereference, and pointer subtraction. However, the

software vulnerability landscape is vast and evolving,

potentially omitting certain types. This could impact

construct validity, as our ﬁndings may not represent

the full spectrum of vulnerabilities. Future research

will address this by evaluating additional vulnerabil-

ity types, enhancing the comprehensiveness of the

CodeGrapher approach.

By recognizing and addressing these threats to va-

lidity, we aim to provide a more robust and compre-

hensive foundation for the application of CodeGra-

pher in software vulnerability prediction, extending

its relevance to a wider array of programming lan-

guages, levels of analysis, and vulnerability types.

6 CONCLUSION

Software vulnerabilities pose signiﬁcant threats to

system security, motivating the development of effec-

tive detection methods. ML algorithms hold promise,

yet varying source code lengths challenge accuracy,

leading to false results. To overcome this, we intro-

duce CodeGrapher, preserving semantic relations by

converting numerical vectors into image sets for ML

input. Image size adjustability ensures consistency,

while similarity distance metrics maintain code ele-

ment relationships. Experimental validation demon-

strates superior performance. Future enhancements

include adjusting window size and CNN parameters

to improve accuracy, advancing software security.

ACKNOWLEDGEMENTS

This work was funded by The Scientiﬁc and Techno-

logical Research Council of Turkey, under 1515 Fron-

tier R&D Laboratories Support Program with project

no: 5169902.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

672

REFERENCES

(2023a). Debian – The Universal Operating System.

(2023b). GitHub: Let’s build from here.

Alon, U., Zilberstein, M., Levy, O., and Yahav, E. (2019).

code2vec:learning distributed representations of code.

Proc. ACM Program. Lang., 3(POPL):40:1–40:29.

Bamler, R. and Mandt, S. (2017). Dynamic word embed-

dings. In ICML, pages 380–389. PMLR.

Bendersky, E. (2019). Github–eliben/pycparser: Complete

c99 parser in pure python.

Bilgin, Z., Ersoy, M. A., Soykan, E. U., Tomur, E., C¸ omak,

P., and Karac¸ay, L. (2020). Vulnerability prediction

from source code using machine learning. IEEE Ac-

cess, 8:150672–150684.

Black, P. E. and Black, P. E. (2018). Juliet 1.3 test suite:

Changes from 1.2. US Department of Commerce, Na-

tional Institute of Standards and Technology.

Chen, L., Ye, W., and Zhang, S. (2019). Capturing source

code semantics via tree-based convolution over api-

enhanced ast. In Proceedings of the 16th ACM Inter-

national Conference on Computing Frontiers, pages

174–182.

Duan, X., Wu, J., Ji, S., Rui, Z., Luo, T., Yang, M., and Wu,

Y. (2019). Vulsniper: Focus your attention to shoot

ﬁne-grained vulnerabilities. In IJCAI, pages 4665–

4671.

Fang, Y., Liu, Y., Huang, C., and Liu, L. (2020). Fastembed:

Predicting vulnerability exploitation possibility based

on ensemble machine learning algorithm. Plos one,

15(2):e0228439.

Feutrill, A., Ranathunga, D., Yarom, Y., and Roughan, M.

(2018). The effect of common vulnerability scoring

system metrics on vulnerability exploit delay. In CAN-

DAR, pages 1–10. IEEE.

Halepmollası, R., Haniﬁ, K., Fouladi, R. F., and Tosun,

A. (2023). A comparison of source code represen-

tation methods to predict vulnerability inducing code

changes.

Haniﬁ, K., Fouladi, R. F., Unsalver, B. G., and Karadag, G.

(2023). Software vulnerability prediction knowledge

transferring between programming languages. arXiv

preprint arXiv:2303.06177.

Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Kosta,

L. R., Rangamani, A., Hamilton, L. H., Centeno, G. I.,

Key, J. R., Ellingwood, P. M., McConley, M. W., Op-

per, J. M., Chin, P., and Lazovich, T. (2018). Auto-

mated software vulnerability detection with machine

learning. CoRR, abs/1803.04497.

Henkel, J., Lahiri, S. K., Liblit, B., and Reps, T. W. (2018).

Code vectors: understanding programs through em-

bedded abstracted symbolic traces. In ACM Joint

Meeting on, ESEC/SIGSOFT FSE, pages 163–174.

Kanade, A., Maniatis, P., Balakrishnan, G., and Shi, K.

(2020). Learning and evaluating contextual embed-

ding of source code. In ICML, pages 5110–5121.

PMLR.

Lin, G., Wen, S., Han, Q.-L., Zhang, J., and Xiang, Y.

(2020). Software vulnerability detection using deep

neural networks: a survey. Proceedings of the IEEE,

108(10):1825–1848.

Lozoya, R. C., Baumann, A., Sabetta, A., and Bezzi, M.

(2021). Commit2vec: Learning distributed represen-

tations of code changes. SN Comput. Sci., 2(3):150.

Palit, T., Moon, J. F., Monrose, F., and Polychronakis, M.

(2021). Dynpta: Combining static and dynamic anal-

ysis for practical selective data protection. In 2021

IEEE Symposium on Security and Privacy (SP), pages

1919–1937. IEEE.

Russell, R., Kim, L., Hamilton, L., Lazovich, T., Harer,

J., Ozdemir, O., Ellingwood, P., and McConley, M.

(2018). Automated vulnerability detection in source

code using deep representation learning. In 2018 17th

IEEE international conference on machine learning

and applications (ICMLA), pages 757–762. IEEE.

S¸ ahin, S. E.,

Ozyedierler, E. M., and Tosun, A. (2022). Pre-

dicting vulnerability inducing function versions using

node embeddings and graph neural networks. Infor-

mation and Software Technology, page 106822.

Sane, P. (2020). Is the owasp top 10 list comprehensive

enough for writing secure code? In Proceedings of the

2020 International Conference on Big Data in Man-

agement, pages 58–61.

Schiewe, M., Curtis, J., Bushong, V., and Cerny, T.

(2022). Advancing static code analysis with language-

agnostic component identiﬁcation. IEEE Access,

10:30743–30761.

Telang, R. and Wattal, S. (2007). An empirical analysis of

the impact of software vulnerability announcements

on ﬁrm stock price. IEEE Transactions on Software

engineering, 33(8):544–557.

Zhou, Y., Liu, S., Siow, J., Du, X., and Liu, Y. (2019). De-

vign: Effective vulnerability identiﬁcation by learn-

ing comprehensive program semantics via graph neu-

ral networks. Advances in neural information process-

ing systems, 32.

CodeGrapher: An Image Representation Method to Enhance Software Vulnerability Prediction

673