Software Vulnerability Prediction Knowledge Transferring Between

Programming Languages

Khadija Haniﬁ

1 a

, Ramin F. Fouladi

1 b

, Basak Gencer Unsalver

2 c

and Goksu Karadag

2 d

Ericsson Security Research, Istanbul, Turkey

Vodafone, Istanbul, Turkey

ﬁ

Keywords:

Software Security, Vulnerability Prediction, Source Code, Machine Learning, Transfer Learning.

Abstract:

Developing automated and smart software vulnerability detection models has been receiving great attention

from both research and development communities. One of the biggest challenges in this area is the lack of

code samples for all different programming languages. In this study, we address this issue by proposing a

transfer learning technique to leverage available datasets and generate a model to detect common vulnera-

bilities in different programming languages. We use C source code samples to train a Convolutional Neural

Network (CNN) model, then, we use Java source code samples to adopt and evaluate the learned model. We

use code samples from two benchmark datasets: NIST Software Assurance Reference Dataset (SARD) and

Draper VDISC dataset. The results show that proposed model detects vulnerabilities in both C and Java codes

with average recall of 72%. Additionally, we employ explainable AI to investigate how much each feature

contributes to the knowledge transfer mechanisms between C and Java in the proposed model.

1 INTRODUCTION

Developers are concerned with the correctness of the

written codes to run in a desired way and meet pre-

deﬁned design speciﬁcations. Along with that, they

utilize different code analyzing techniques to ensure

that the written code is robust enough and free of any

weaknesses (called vulnerabilities) that could be ex-

ploited by attackers to carry out their malicious ac-

tivities. Managing software vulnerabilities involves

a wide range of code analyzing techniques to enhance

the security and to ensure the conﬁdentiality, integrity,

and availability of the system. Source code analyz-

ing techniques are usually classiﬁed into two main

groups; static and dynamic code analysis (Palit et al.,

2021). Static code analysis examines the source code

without executing and running the code. It utilizes

pre-deﬁned set of rules and codding standards to an-

alyze how much the code meets these rules and stan-

dards. On the other hand, dynamic code analysis en-

tails running code and examining the outcome. This

involves testing different possible execution paths of

https://orcid.org/0000-0001-7044-3315

https://orcid.org/0000-0003-4142-1293

https://orcid.org/0000-0001-9426-8400

https://orcid.org/0000-0002-4596-0983

the code and examining their outputs. Dynamic code

analysis is able to ﬁnd security issues caused by the

program interaction with other system components

such as SQL databases or Web services. However,

in dynamic code analysis, to build a series of correct

inputs for test coverage, a pre-knowledge of the pro-

gram steps is needed. Although static and dynamic

analysis have different approach to ﬁnd the vulnera-

bilities, they both suffer from high false positive rate,

necessitating human expertise to review the results

which yields extra time, effort, and cost.

Many studies in literature have explored methods

to decrease the false positive rate of both static and dy-

namic analysis. A common approach utilized in these

studies is the use of machine learning (ML) tech-

niques to train vulnerability detection models (Lin

et al., 2020). While ML-based models show poten-

tial for improving vulnerability prediction accuracy,

they also face two signiﬁcant challenges. Firstly, the

process of addressing various types of vulnerabilities

is resource-intensive and requires both software and

human expert analysis. This is further complicated

by the fact that commercially used source code is

subject to intellectual property rights and considered

as conﬁdential information by enterprises, making it

difﬁcult to obtain access to real labeled vulnerability

Haniﬁ, K., Fouladi, R., Unsalver, B. and Karadag, G.

Software Vulnerability Prediction Knowledge Transferring Between Programming Languages.

DOI: 10.5220/0011859800003464

In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023), pages 479-486

ISBN: 978-989-758-647-7; ISSN: 2184-4895

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

479

data. To mitigate this issue, synthetic datasets, such

as the NIST Software Assurance Reference Dataset

and Draper VDISC dataset, are generated to simulate

different vulnerability types for speciﬁc programming

languages. However, these datasets are limited in

scope, covering only published vulnerability types for

a limited range of programming languages. Secondly,

the structure and logic of each programming language

are distinct, which results in language-speciﬁc fea-

ture extraction and varying numbers of features. This

makes it challenging to develop a single model that

can be used for all programming languages.

In this study, we propose a novel technique to ad-

dress the challenge of limited vulnerability sample

availability in some programming languages. Our ap-

proach utilizes available datasets to train a machine

learning model for vulnerability detection in a related

programming language and applies this knowledge

to predict vulnerabilities in other programming lan-

guages with limited training samples. The main con-

tributions of our work are:

• An automated and interpretable software vulner-

ability prediction model using machine learning

techniques, which can learn from datasets in one

programming language and apply its knowledge

to predict vulnerabilities in other programming

languages.

• A code representation method that converts source

code into numerical vectors for machine learning

analysis, combined with a syntax matching tech-

nique for matching related components between

Abstract Syntax Trees (ASTs) from different pro-

gramming languages.

• A new method for vulnerability prediction knowl-

edge transfer, which involves customized process-

ing steps for different programming languages and

common ﬁnal steps for vulnerability prediction.

We carried out a preliminary experiment to demon-

strate our approach using the common vulnerabilities

between C and Java programming languages. Our re-

sults show that we were able to transfer vulnerability

prediction knowledge from C programming language

to Java and that our machine learning model was able

to make accurate predictions. Additionally, we lever-

aged explainable AI methods to better understand and

verify the correctness of our model.

2 RELATED WORK

Software vulnerability prediction is dramatically ris-

ing in popularity in research community and industry,

https://samate.nist.gov/SARD

especially among cyber security professionals deal-

ing with vulnerability management and secure soft-

ware development life cycle (SSDLC). This spike of

popularity is due to the rise in cyber security threats

which exploit vulnerabilities for their malicious in-

tent. Although several studies exist in the litera-

ture for software vulnerability prediction, the use of

ML-based approaches are ubiquitous in the literature

(Hanif et al., 2021). In general, two main approaches

of metric-based and pattern-based are utilized with

ML algorithms for software vulnerability prediction

(Li et al., 2021). Metric-based approaches link the

software engineering metrics, such as code com-

plexity metrics (Kalouptsoglou et al., 2022; Halep-

mollası et al., 2023) and developer activity metrics

(Coskun et al., 2022) to software vulnerabilities and

use those metrics to train ML models for vulnerabil-

ities prediction. Although metric-based approaches

are lightweight to analyze a large-scale program, they

suffer from high positive rates. On the other hand,

pattern-based approaches improve the efﬁciency and

the automation of software vulnerability prediction.

Authors in (Bilgin et al., 2020) proposed a method

for software vulnerability prediction on the function

level for C code. While preserving the structural and

semantic information in the source code, the method

transforms the AST of the source code into a numeri-

cal vector and then utilizes 1D CNN for software vul-

nerability prediction. Authors in (Duan et al., 2019),

used the Control Flow Graph (CFG) and AST as the

graph representation and utilized soft attention to ex-

tract high-level features for vulnerability prediction.

Authors in (Zhou et al., 2019), proposed a function-

level software vulnerability prediction method based

on a graph representation that utilize AST, depen-

dency, and natural code sequence information.

With respect to applying transfer learning for soft-

ware vulnerability prediction, authors in (Ziems and

Wu, 2021) developed various deep learning Natural

Language Processing (NLP) models to predict vul-

nerabilities in the C/C++ source code. They used

transfer learning to adapt some pre-trained models

for English language such as Bidirectional Encoder

Representations from Transformers (BERT) (Devlin

et al., 2018) to be reused for software vulnerabil-

ity prediction. The performance is promising despite

the different structure between English language and

the C/C++ programming language. Authors in (Lin

et al., 2019), proposed a software vulnerability pre-

diction method based on transfer learning and Long-

short Term Memory (LSTM). They used several het-

erogeneous and cross-domain data sources combined

to obtain a general representation of patterns for vul-

nerability prediction.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

480

Although pattern-based approaches improve the

prediction performance with respect to false positive

rate, they are not adaptable and ﬂexible enough to be

re-used in other domains rather than the domain they

are trained for. TL-based methods address this issue

to some extent; however, they are mainly limited to

very close domains such as the same or similar pro-

gramming languages. In this study, we address the

aforementioned issues by combing a pattern-based

approach with transfer learning to transfer the knowl-

edge acquired for software vulnerability prediction in

C source code being used for software vulnerability

prediction in Java source code.

3 PROPOSED METHOD

In this section, we discuss our proposed method to

develop an ML-based programming language agnos-

tic software vulnerability prediction model. To this

end, we trained a CNN model with software vulnera-

bility samples of C source code, and then, we updated

trained model to predict vulnerabilities in Java source

code. The main phases of the proposed method are

discussed in the following sub-sections.

3.1 Preprocessing Source Code

Extracting discriminating features from source code

and representing them as numerical vectors is consid-

ered the most critical step in training an ML-based

software vulnerability prediction model. Converted

numerical representation not only needs to preserve

the essential information from the source code, but

also needs to preserve the semantic relations between

the key elements of the source code. Moreover, ex-

tracting the optimal feature set improves the vulner-

ability prediction performance and facilitates training

a generic model.

In this phase, we explain the steps we follow to

convert an input source code into an ML suitable nu-

merical vector that contains the most important fea-

tures for vulnerability prediction.

We demonstrate the steps of converting a source

code into a numerical vector on the code sample pro-

vided below, as it is simple and runnable with both C

and Java compilers:

int sum(int a,int b){

return a+b;

We convert the input source code into a numerical

array through the following steps:

Step-1 (Tokenization): In this step, we ﬁrst make

sure that all non-code related additional parts like

comments, tabs, and newlines are removed. Then, we

convert the main code into a stream of tokens. Each

token is represented as a sequence of characters that

can be treated as a unit in the grammar of the cor-

responding programming language. For this, we use

two lexers developed explicitly for Java and C source

code. The extracted tokens from sum function are as

follow:

’int’: Keyword, ’sum’: Identifier, ’(’:

Separator, ’int’: Keyword, ’a’: Identifier, ’int’:

Keyword, ’b’: Identifier, ’)’: Separator, ’’:

Separator, ’return’: Keyword, ’a’: Identifier,

’+’: Operator, ’b’: Identifier, ’;’: Separator,

’’: Separator.

Step-2 (AST Generation, Normalization, and Sim-

pliﬁcation): AST contains syntax and semantic in-

formation about the source code, and therefore, it is

highly useful for code analysis. In this step, we ex-

tract the AST of the tokenized code using two parsers

developed explicitly for both Java and C source code.

Figure 1 presents the Java AST extracted from sum

function.

Figure 1: Java AST of ”sum” function.

Because Java and C languages have distinct struc-

tures, there is no common parser for both languages,

and thus generated ASTs are different for each lan-

guage. Also, the AST structure is ﬂexible and each

node in the AST can have multiple attributes/chil-

dren, however, with this structure, it is hard to follow

the relationships between AST nodes after converting

the AST to an array. Thus, we normalize and sim-

plify C and Java ASTs to ensure that equivalent ASTs

will be converted into the same numerical array by

the end of the preprocessing phase. We do this by:

ﬁrst, normalizing the AST tokens and replacing all

identiﬁer tokens with assigned abstract representation

forms. For example, all method names are replaced

with ”m name”, and all variable names are replaced

with “v name”, etc. Then, considering the importance

of structural relations between the AST nodes for vul-

nerability prediction, we simplify the tokens that rep-

resent features (leaves in the AST) rather than inde-

pendent tokens by combining them with their parent

token. In Figure 2 we represent normalized and sim-

pliﬁed AST of the function sum.

Software Vulnerability Prediction Knowledge Transferring Between Programming Languages

481

Figure 2: Normalized and simpliﬁed Java N-array-AST of

”sum” function.

Step-3 (Conversion to a Complete Binary AST): In

this step, we convert the AST from normal tree with

unbounded number of children per node into a com-

plete binary tree. By this conversion we not only pre-

serve the relations between nodes but we also unify

the AST structure for both C and Java. We con-

vert the regular AST (referred as N-array-AST) into

a complete binary AST (referred as CB-AST), where

all leaves have the same depth, and all internal nodes

have exactly two children. The conversion from the

N-array-AST to the CB-AST is achieved with respect

to the following rules (Bilgin et al., 2020):

• The root of the N-array-AST is the root of the CB-

AST.

• The left-child of a node in the CB-AST is the left-

most child of the node in the N-array-AST.

• The right-child of a node in the CB-AST is the

right sibling of the node in the N-array-AST.

• When the node does not have children, then its

left-child is set as NULL.

• When the node is the rightmost child of its parent,

its right-child is set as NULL

The Java CB-AST for sum function is represented

in Figure 3. Additionally, to have a complete tree,

NULL children nodes are added until we reach a level

when all nodes are NULL.

Step-4 (Encoding to Numerical 3-Tuples): Since

majority of ML algorithms expect numerical inputs,

we convert the CB-AST into a numerical vector. To

this end, we leverage the encoding method presented

in (Bilgin et al., 2020) to convert CB-AST nodes into

n-tuples which consists of numerical values. In this

study, we used 3-tuple of numbers; the ﬁrst number

represents the type of the token, while second and

third numbers store additional information about the

nodes. For example, the token “BasicType: int, 50”

is encoded to (8.0, 103.0, 50.0). This representa-

tion is considered as a three-dimensional data struc-

ture where each dimension holds a value related to

an associated token. Notice that the numeric values

Figure 3: Java binary AST of ”sum” function.

used for the encoding mechanism are chosen arbitrar-

ily and could be changed as long as different cate-

gories take different values.

Step-5 (Vector Representation): After replacing

each node in CB-AST with its 3-tuple numeric value,

we transfer the CB-AST into a 1D array. To this end,

we used Breadth-ﬁrst search (BFS) algorithm to po-

sition the numeric tuples in their corresponding lo-

cations in a 1D array. The 1D array is ﬁlled by the

CB-AST nodes starting from the root node traversing

all the neighbor nodes at the present level then mov-

ing on to the nodes at the next depth level and so on.

In Figure 4, we present the BFS algorithm passing

through the nodes of the ﬁrst three levels of the Java

CB-AST.

Figure 4: Reading order of the CB-AST nodes.

Proposed approach preserves the structural rela-

tions in the source code via respective positional rela-

tions between elements in the 1D array. Also, with the

rich numerical encoding technique (explained in step-

4), the semantic information of the AST are preserved

and embedded in the 1D array. Another signiﬁcant

advantage of the proposed numerical array represen-

tation is that each element in the 1D array acts as a

feature, and hence, it can be directly used as an input

to an ML model, which facilitates conducting various

automated and intelligent code analysis.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

482

3.2 Syntax Matching and Knowledge

Transferring

Analysed tokens are clustered under three main

groups: tokens that exist only in C language like

CompoundLiteral, tokens that exist only in Java lan-

guage like ClassDeclaratio, and tokens that exist in

both C and Java. Thus, we prepared a dictionary that

involves these three categories. However, the tokens

that exist in both C and Java are equivalent but dif-

ferent names are assigned due to using different lex-

ers and parser. For example: ”FuncDef”, ”Constant”,

and ”IF” are used by C lexer, whereas, ”MethodDec-

laration”, ”BasicType”, and ”IfStatement” are used by

Java lexer. In order to use one model for both Java and

C codes, we need to generate the same numerical ar-

ray for equivalent C and Java codes. Thus, we created

a table to match the tokens used by the compiler to

construct different statements and expressions in each

programming language. We prepared a mapping dic-

tionary following the steps provided in Figure 5 to

encode equivalent tokens similarly. After generating

the CB-AST, tokens are encoded using the prepared

dictionary that contains the equivalent tokens of C and

Java with their 3-tuples. In Table 1, examples of com-

mon tokens between C and Java are provided along

with their numerical 3-tuples.

Figure 5: Steps for matching c and Java tokens.

Table 1: Examples of common C and Java tokens.

C Token Java Token Numerical tuples

DoWhile DoStatement 22.0,0.0,0.0

If IfStatement 30.0,0.0,0.0

Switch SwitchStatement 38.0,0.0,0.0

TernaryOp TernaryExpression 39.0,0.0,0.0

While WhileStatement 44.0,0.0,0.0

3.3 Generic Model Generation

In this study, we aim to overcome the problem of

lack of sufﬁcient training data samples to generalize

an ML-model to predict a software vulnerability in a

source code of a programming language (target). To

this end, we train an ML-model for the same vulner-

ability type using a large number of data samples for

another programming language and then transfer the

knowledge for vulnerability prediction in the target

language, by applying transfer learning method. The

main approach we followed in this study (as shown

in Figure 6) is based on two main steps: First, we

trained and improved a model with a big dataset that

includes vulnerability examples written in C. Then,

we updated the model with the limited Java examples

to predict the vulnerabilities existed in Java codes us-

ing the knowledge gained from C codes.

Figure 6: General approach used for vulnerability predic-

tion knowledge transferring.

To automatically detect the important features of

vulnerabilities without the need for human interven-

tion, we used CNN architecture. Another advantage

of using CNN is to extract the relations and corre-

lations between features by applying convolution be-

tween features.

For knowledge transferring, we developed an ap-

proach that is different from the common transfer

learning approaches (where the input data in the two

datasets are similar but the output labels are different).

As in our case, we aim to get the same output labels

(vulnerable and non-vulnerable) for Java codes lever-

aging the features and weights learned from C dataset.

However, the input Java code instructions have struc-

tural and syntax differences and they need to be re-

ﬂected on their equivalent C code instructions. Thus,

as shown in Figure 7, we customized the ﬁrst steps

of our approach (including tokenization, AST gener-

ation, and encoding to numerical tuples) to process

Java input codes and convert them into numerical ar-

rays peer to the ones extracted from C codes. Then,

we use the trained CNN for the feature extraction and

classiﬁcation steps. One of the advantages of this ap-

Software Vulnerability Prediction Knowledge Transferring Between Programming Languages

483

proach other than knowledge transferring, is that the

trained feature extraction and classiﬁcation layers can

be used as they are, and they also can be improved

and updated with new data samples (C or Java).

Moreover, to decrease the impact of structural dif-

ferences between the two programming languages (C

and Java), we built the system to analyse source code

on function level.

Figure 7: Proposed model for vulnerability prediction

knowledge transferring.

4 DATASET

For this study, we used data samples from two

datasets; Draper VDISC Dataset (Russell et al., 2018)

and Software Assurance Reference Dataset (SARD)

(Black, 2018). While the former consists of the vul-

nerability samples for C source code, the latter in-

cludes vulnerabilities for Java source code. A brief

summary related to each dataset is provided in the fol-

lowing subsections.

4.1 Draper VDISC Dataset

VDISC dataset involves function-level source code of

C and C++ programming languages, labeled by static

analysis for potential vulnerabilities. The data is pro-

vided in HDF5 ﬁles, and raw source code are stored

as UTF-8 strings. Five binary ’vulnerability’ labels

are provided for each function, corresponding to the

four most common CWEs as: CWE-120 (3.7% of

functions), CWE-119 (1.9% of functions), CWE-469

(0.95% of functions), CWE-476(0.21% of functions),

and CWE-other (2.7%of functions).

4.2 SARD Dataset

SARD is a growing collection of test programs with

documented weaknesses. Test cases vary from small

synthetic programs to large applications. The source

of the test cases in SARD dataset are: Wild code sam-

pling (code from known bugs available in industry

and open source software), artiﬁcial code constructing

(codes produced by researchers to cover a wide range

of weaknesses), academic code (Code collected from

computer science and programming courses), and au-

tomatically generated codes. SARD dataset is pro-

vided in terms of test cases that provide the code ﬁles

of each CWE type. In this paper we used Java source

code test cases from Juliet Test Suite.

4.3 Dataset Preparation

The evaluation of the model was carried out by an-

alyzing NULL Pointer Dereference vulnerabilities

(CWE-476), as it is a prevalent CWE that provides

suitable data for both C and Java programming lan-

guages for the purposes of this study. For model train-

ing and validation, we obtained C data samples from

the VDISC dataset, resulting in a prepared dataset of

120,609 non-vulnerable C functions and 1160 vulner-

able C functions. However, the ratio of samples in

the vulnerable class to the non-vulnerable class was

observed to be signiﬁcantly imbalanced, with more

than 100 non-vulnerable samples for every 1 vulner-

able sample, which impacts the model’s ability to

gain knowledge of vulnerable samples. To address

this issue, we down-sampled the dataset to a ratio of

1:4. Additionally, we acquired Java samples from the

SARD dataset, and after performing preprocessing

steps, we generated a labeled dataset with 616 vul-

nerable Java functions and 328 non-vulnerable Java

functions. Due to the limited number of Java samples

in either class, we resorted to using transfer learning

to apply the ML model trained on C codes to detect

vulnerabilities in both C and Java codes. Finally, the

labels were converted to a binary classiﬁcation for-

mat, with 0 indicating non-vulnerable and 1 indicat-

ing vulnerable code sample.

5 RESULTS AND DISCUSSION

In this section, we present and discuss the perfor-

mance result of the proposed method. First, we

present the experimental results of the proposed

model. Then, we discuss the validity, limitations, and

implications of the new work.

5.1 Experimental Results

We used a binary classiﬁer based on a 1D CNN that

consists of two convolutional layers, a maxpooling

layer, dropout layers, a fully connected layer, and a

softmax layer. The CNN architecture is chosen due

to its ability to automatically detect the important fea-

tures without the need for human intervention Also,

by applying convolution between features, it extracts

the relations and correlations between features. A

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

484

dropout layer is applied to avoid over-ﬁtting to the

training data.

Since we are training a binary classiﬁer, it is better

to train it with better balanced dataset. Thus, we used

the down-sampled C dataset for the initial training of

the CNN model. On the other hand, unbalanced test

set represents the expected real world scenario where

there are less functions that could induce vulnerabili-

ties. Therefore, we use the full test set to measure the

performance of the model. The sizes and the num-

ber of vulnerable and non-vulnerable samples used

for training and test steps are shown in Table 2. As

could be noted, Java related samples are few and not

enough to train a model.

Table 2: Number of vulnerable and non-vulnerable samples

used in train and test sets.

Data Language Down-sampled Vulnerable Non-vulnerable

Train C Yes 1160 4640

Train Java No 305 167

Test C No 140 15,062

Test Java No 311 161

To assess the performance of proposed model in

vulnerabilities predicting, we analyze accuracy, pre-

cision, recall, F1-score, and Area Under the Curve

(AUC). The confusion table on both C and Java test

sets is presented in Table 3. And the evaluating

metrics both C and Java source code are provided

in Table 4. As could be observed, the developed

model is able to detect vulnerabilities in both C and

Java source code with accuracy of 99% and 91% re-

spectively. Hence, we transferred the knowledge the

model gained from C vulnerable samples to be used to

also detect vulnerabilities in Java source code. More-

over, the additional Java samples used to update the

model helped to improve the model, as Precision, Re-

call, F1-score, and AUC percentages have been in-

creased.

Table 3: Confusion tables for C and Java examples (No =

non-vulnerable, and Yes = vulnerable).

Predictions

C test set No Yes

Actuals

No 14977 85

Yes 64 76

Java test set No Yes

Actuals

No 128 33

Yes 10 301

Table 4: Vulnerability prediction results for both C and Java

samples.

Performance C Java Average

Accuracy 0.99 0.91 0.95

Precision 0.47 0.90 0.68

Recall 0.54 0.97 0.75

F1-Score 0.51 0.93 0.72

AUC 0.77 0.88 0.82

5.2 Model Explanation

We utilize Lime (Local Interpretable Model-agnostic

Explanations) method to explain and interpret the pre-

dictions of our proposed model. We aim to examine

the effect of our code representation method on the

classiﬁcation model for each programming language.

For this aim, we apply Lime on both C and Java test

sets separately, and analyze the features used to dif-

ferentiate vulnerable source code for each language.

Analysing results of Lime are presented in Figure 8.

We can notice that the most effective features (en-

coded values) used for differentiating vulnerable from

non-vulnerable source code are the same for both pro-

gramming languages. For example, the value 1277,

1280, 1219, 641, and 317 have similar effect on Java

and C source code. This proves that our method have

successfully converted the source code from both pro-

gramming languages (C and Java) into unique, equiv-

alent, simple, ML friendly numeric vector. In addi-

tion, it shows that proposed model is generic and can

be used to detect trained vulnerabilities in other pro-

gramming languages once same code representation

method is applied.

Figure 8: Local explanation of vulnerable examples detec-

tion in C and Java test sets.

5.3 Threats to Validity

Data Insufﬁciency: The limited availability of data

for both C and Java source code may impact the

representativeness of the results. To address this,

data from two different datasets were used. However,

this could lead to potential biases in the results if the

data sets are not fully representative of real-world

applications.

Preprocessing: The preprocessing steps involved

matching the syntax between C and Java tokens,

which may have introduced errors or biases due to

the structural differences between the two languages.

Model Training: To address the lack of labeled

source code from real projects, synthetic codes from

the SARD and VDISC datasets were used for train-

ing the model. This could affect the validity of the

Software Vulnerability Prediction Knowledge Transferring Between Programming Languages

485

results, as the model may not generalize well to

real-world applications.

6 CONCLUSION AND FUTURE

WORK

In this study, we focus on the problem of having vul-

nerability samples for some programming languages

but not others. To overcome this problem, we design

a method that extract vulnerability prediction knowl-

edge from available data samples and then use it to

predict vulnerabilities in another programming lan-

guage. We also, add ﬂexibility to update the model

once new samples are provided. Speciﬁcally, in this

study, we built a model that is able to detect vulnera-

bilities in both Java and C source code. We trained a

CNN-based model with C source code from VDISC

dataset. Then, we modiﬁed the model to detect the

learned vulnerabilities in Java source code. We ex-

tracted Java sample codes from SARD dataset. By

the end of our experiments, we were able to show

that despite the many differences between program-

ming languages, we were able to train one model to

detect vulnerabilities in more than one programming

language. This study could be further extended to de-

tect vulnerabilities in other commonly used program-

ming languages such as Python and Javascript. The

study could be also improved by training the model

on other common vulnerability types from different

programming languages.

ACKNOWLEDGEMENTS

This work was funded by The Scientiﬁc and Techno-

logical Research Council of Turkey, under 1515 Fron-

tier R&D Laboratories Support Program with project

no: 5169902.

REFERENCES

Bilgin, Z., Ersoy, M. A., Soykan, E. U., Tomur, E., C¸ omak,

P., and Karac¸ay, L. (2020). Vulnerability prediction

from source code using machine learning. IEEE Ac-

cess, 8:150672–150684.

Black, P. E. (2018). A software assurance reference dataset:

Thousands of programs with known bugs. Journal

of research of the National Institute of Standards and

Technology, 123:1.

Coskun, T., Halepmollasi, R., Haniﬁ, K., Fouladi, R. F.,

De Cnudde, P. C., and Tosun, A. (2022). Proﬁling

developers to predict vulnerable code changes. In Pro-

ceedings of the 18th International Conference on Pre-

dictive Models and Data Analytics in Software Engi-

neering, pages 32–41.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Duan, X., Wu, J., Ji, S., Rui, Z., Luo, T., Yang, M., and Wu,

Y. (2019). Vulsniper: Focus your attention to shoot

ﬁne-grained vulnerabilities. In IJCAI, pages 4665–

4671.

Halepmollası, R., Haniﬁ, K., Fouladi, R. F., and Tosun,

A. (2023). A comparison of source code represen-

tation methods to predict vulnerability inducing code

changes. In ENASE, page In Press.

Hanif, H., Nasir, M. H. N. M., Ab Razak, M. F., Firdaus, A.,

and Anuar, N. B. (2021). The rise of software vulnera-

bility: Taxonomy of software vulnerabilities detection

and machine learning approaches. Journal of Network

and Computer Applications, 179:103009.

Kalouptsoglou, I., Siavvas, M., Kehagias, D., Chatzigeor-

giou, A., and Ampatzoglou, A. (2022). Examining

the capacity of text mining and software metrics in

vulnerability prediction. Entropy, 24(5):651.

Li, X., Wang, L., Xin, Y., Yang, Y., Tang, Q., and Chen, Y.

(2021). Automated software vulnerability detection

based on hybrid neural network. Applied Sciences,

11(7):3201.

Lin, G., Wen, S., Han, Q.-L., Zhang, J., and Xiang, Y.

(2020). Software vulnerability detection using deep

neural networks: a survey. Proceedings of the IEEE,

108(10):1825–1848.

Lin, G., Zhang, J., Luo, W., Pan, L., De Vel, O., Montague,

P., and Xiang, Y. (2019). Software vulnerability dis-

covery via learning multi-domain knowledge bases.

IEEE Transactions on Dependable and Secure Com-

puting, 18(5):2469–2485.

Palit, T., Moon, J. F., Monrose, F., and Polychronakis, M.

(2021). Dynpta: Combining static and dynamic anal-

ysis for practical selective data protection. In 2021

IEEE Symposium on Security and Privacy (SP), pages

1919–1937. IEEE.

Russell, R., Kim, L., Hamilton, L., Lazovich, T., Harer,

J., Ozdemir, O., Ellingwood, P., and McConley, M.

(2018). Automated vulnerability detection in source

code using deep representation learning. In 2018 17th

IEEE international conference on machine learning

and applications (ICMLA), pages 757–762. IEEE.

Zhou, Y., Liu, S., Siow, J., Du, X., and Liu, Y. (2019). De-

vign: Effective vulnerability identiﬁcation by learn-

ing comprehensive program semantics via graph neu-

ral networks. Advances in neural information process-

ing systems, 32.

Ziems, N. and Wu, S. (2021). Security vulnerability detec-

tion using deep learning natural language processing.

In IEEE INFOCOM 2021-IEEE Conference on Com-

puter Communications Workshops (INFOCOM WK-

SHPS), pages 1–6. IEEE.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

486