A Vulnerability Introducing Commit Dataset for Java: An Improved

SZZ based Approach

Tam

as Aladics

1,2 a

, P

eter Heged

1,2 b

and Rudolf Ferenc

1 c

Department of Sofware Engineering, University of Szeged, Szeged, Hungary

FrontEndART Ltd., Szeged, Hungary

Keywords:

Just-in-Time Vulnerability Detection, Dataset, SZZ, Vulnerability Introducing Commits.

Abstract:

In the domain of vulnerability detection from the source code by applying static analysis, the number and qual-

ity of available datasets for creating and testing security analysis methods is quite low. To be precise, there

are already several public datasets containing vulnerability ﬁxing commits; however, vulnerability introducing

commit datasets are scarce, which would be essential for creating and validating just-in-time vulnerability

detection approaches. In this paper, we propose an SZZ (an algorithm originally developed to ﬁnd bug intro-

ducing commits) based method with a speciﬁc ﬁltering mechanism to create vulnerability introducing commit

datasets from vulnerability ﬁxes. The ﬁltering phase involves measuring a relevance score for each vulnera-

bility introducing commit candidates based on commit similarities. We generated a novel Java vulnerability

introducing dataset from the existing project-KB repository to demonstrate our algorithm’s capabilities. We

also showcase the generated database and the effectiveness of our ﬁltering method through several hand-picked

examples from the dataset.

1 INTRODUCTION

Many software engineering-related tasks, such as

quality assurance or testing, are now aided by ma-

chine learning, which relies heavily on the abundance

of data. Most of these tasks are typically based on ma-

chine learning, therefore the availability of datasets is

crucial to train reliably and to get a generally well-

performing model.

Fortunately, when the goal is related to vulnera-

bility ﬁxes, there are already well established datasets

that can be relied on. These datasets typically contain

validated code changes (i.e. commits) that ﬁx a par-

ticular vulnerability described in a Common Vulner-

abilities and Exposures (CVE) (MITRE Corporation,

v 21) entry, a publicly disclosed security vulnerability

in a software system. One such dataset is published

as part of the repository “project-KB” (project kay-

bee) (Ponta et al., 2019) maintained by SAP.

This dataset contains CVE entries and their cor-

responding commit references of Java software sys-

tems that are known to have ﬁxed the security issues.

https://orcid.org/0000-0002-4689-8878

https://orcid.org/0000-0003-4592-6504

https://orcid.org/0000-0001-8897-7403

Datasets like project-KB can be exploited in various

use cases such as aggregating security related statis-

tics, getting insight on the security state of a system

and also checking which version of a speciﬁc library

or software contains a security risk. The latter use

case is crucial for larger projects that use many other

(typically open source) software components to know

what kind of security issues is the system is exposed

to by using those external libraries.

There are tasks, however, that are not related to

vulnerability ﬁxes but to vulnerability introducing

commits, such as just-in-time vulnerability detection

and localization (Amin et al., 2019; Cao et al., 2021;

Li et al., 2013), when the purpose is to ﬁnd the vul-

nerable part of the system or detect the presence of a

security bug. In these cases, ﬁnding the appropriate

dataset can be challenging, if possible at all. This is

partly due to the fact that while ﬁxing commits are

sometimes available as part of the CVE entries (or at

least it is possible to examine the commit history and

apply heuristics to identify the ﬁxing commit (Ponta

et al., 2019)), the commit which introduced that par-

ticular vulnerability can be ambiguous and not trivial

to ﬁnd.

Our aim in this work is to help solve the lack of

vulnerability introducing commit datasets by provid-

Aladics, T., Heged˝us, P. and Ferenc, R.

A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach.

DOI: 10.5220/0011275200003266

In Proceedings of the 17th International Conference on Software Technologies (ICSOFT 2022), pages 68-78

ISBN: 978-989-758-588-3; ISSN: 2184-2833

ing a method to automatically generate them from

vulnerability ﬁxing datasets. The procedure consists

of two phases, the ﬁrst involves running an implemen-

tation of the SZZ algorithm (Sliwerski et al., 2005)

called SZZ Unleashed (Borg et al., 2019) for each ﬁx-

ing commit in the dataset, which results in a set of

candidate introducing commits for each ﬁxing com-

mit. However, inspecting these candidate commits

shows that the results from the ﬁrst phase are hardly

usable in practice due to a number of issues. The most

prominent of these issues are that the number of the

candidates and more importantly the number of false

positives can be very high.

Therefore, we designed a second phase, which in-

volves the ﬁltering of the results produced by the ﬁrst

phase to overcome the issues. The ﬁltering involves

selecting the top n most relevant commits, where n is

an arbitrarily chosen number, and commit relevance

is determined by a metric called the relevance score.

The relevance score is a heuristic that is assigned to

each candidate introducing commit and it measures

the commit’s vulnerability introducing relevance: the

higher it is the more likely that the commit is indeed

a real introducing commit. The calculation of the rel-

evance score and the ﬁltering process is discussed in

more details in Section 4.

Applying the method brieﬂy explained, we gen-

erated a novel Java vulnerability introducing com-

mit dataset from the project-KB vulnerability ﬁxing

dataset and made it publicly available

together with

the tools implementing the generation process. To

summarize, the main contributions of our work are

as follows:

• We propose a two-phase method to automatically

generate accurate vulnerability introducing com-

mit datasets from vulnerability ﬁxes;

• As part of our method, we suggest a way to

measure introducing commit relevance to a ﬁxing

commit, which we refer to as the relevance score;

• We provide a toolchain that can be used to gen-

erate new vulnerability introducing datasets from

existing repositories, similar to the project-KB

dataset;

• We publish a novel Java vulnerability introduc-

ing commit dataset created from the project-KB

repository using our proposed method and tool.

The rest of the paper is organized as follows. We

list the works related to ours in Section 2. Section 3

gives a motivation for our research through a running

example. We describe the technical details of the pro-

posed two-phase method for generating vulnerability

https://doi.org/10.5281/zenodo.5785239

introducing commit datasets in Section 4. We demon-

strate the usage of the proposed method by creating a

novel Java vulnerability introducing commit dataset,

which is presented in Section 5. In Section 6, we enu-

merate the possible threats to the validity of our work,

while we conclude the paper in Section 7.

2 RELATED WORK

Lately, the number of vulnerabilities is increasing at

an alarming rate, which is mainly traceable by the dis-

closed open source software vulnerability entries. Ac-

cording to the report by WhiteSource (whi, c 14), the

number of published open source software vulnerabil-

ities in 2020 rose by over 50% compared to the previ-

ous year, from 6111 to 9658. This sharp increase is an

unambiguous indicator of the ever-growing problem

of software vulnerabilities and also shows the urgent

need to understand them better and faster.

To understand security issues, analyze them,

draw conclusions, or build tools that can help man-

aging these issues, systematically gathered collec-

tions of data are essential. There are a couple of

datasets (Gkortzis et al., 2018; Ponta et al., 2019)

available containing information about vulnerability

ﬁxes (i.e. set of commits ﬁxing a known vulnera-

bility and the source code version before and after

this ﬁx). Most of them build on the information con-

tained in the Common Vulnerabilities and Exposures

(CVE) repository (cve, v 20) of publicly disclosed

vulnerabilities. CVE provides detailed information

about the speciﬁc vulnerability, in particular a unique

identiﬁer (CVE-ID), a description and a set of public

references. The National Vulnerability Database or

NVD (nvd, v 20) contains practically all vulnerabili-

ties in CVE (except some that are pending at the time

but will be added later) and extends them with addi-

tional information such as vulnerability type (CWE)

and severity scores (CVSS).

Vulnerability ﬁxing datasets leverage the informa-

tion present in the CVE and NVD databases, which

sometimes contain links to the actual ﬁxing patches of

a vulnerability. The project-KB dataset (pro, c 14) is

part of the project-KB repository maintained by SAP.

It contains manually curated entries of vulnerability

ﬁxes in Java projects, where most of these entries have

a corresponding CVE record. The authors describe

this dataset and publish a snapshot of it in a separate

publication (Ponta et al., 2019).

Another dataset that involves automatic collection

of CVE entries from NVD is VulinOSS (Gkortzis

et al., 2018), which contains reported vulnerabilities

of 8694 open-source project versions. As part of their

A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach

research, the authors supplemented the corresponding

source code with various source code metrics.

Yunhui Zhengi et al. (Zheng et al., 2021) use static

analyzer tools to generate a ﬁxing commit dataset

speciﬁcally for machine learning uses. First, they col-

lect several candidate ﬁxing commits using machine

learning methods then they use differential analysis:

they run static analysis on the before and after com-

mit versions of the ﬁxing commit. If a set of issues

detected before the ﬁxing commit disappear in the af-

ter state, they label it as positive, otherwise the ﬁxing

commit is labeled as negative.

Guru Prasad Bhandari et al. (Bhandari et al.,

2021) as the main contribution of their research pub-

lished the tool CVEFixes, which can automatically

generate a ﬁxing commit database by parsing and val-

idating every record from NVD currently available.

The initial release in 2021 contained all published

CVEs up to 9 June, covering 5365 CVE records.

Datasets of vulnerability ﬁxing commits can be

used for a wide variety of downstream tasks, such as

locating security patches (Tan et al., 2021; Li and Pax-

son, 2017; Wang et al., 2021b; Wang et al., 2021a),

vulnerable code clone detection (Woo et al., 2021;

Xiao et al., 2020), and patch presence testing (Dai

et al., 2020; Falleri et al., 2014). Tan et al. (Tan et al.,

2021) facilitate security patch detection by vulnera-

bility commit correlation ranking. They ranked com-

mits by training a RankNet model on features they

parsed from commits and CVE vulnerability entries.

Our research has a similar goal but while they focused

on ﬁxing commits, we target vulnerability introducing

commits. Additionally, they used machine learning

models to achieve the ranking, while in our work we

follow a simpler approach.

As it can be seen, there are various datasets avail-

able when the task is related to vulnerability ﬁxes. In

the case of vulnerability introducing commits, how-

ever, the available resources are a lot more scarce.

Meneely et al. and Shin et al. investigate source code

repository metadata in relation with CVE entries (Me-

neely and Williams, 2012; Meneely et al., 2013; Me-

neely et al., 2014). Using features like code churn and

lines of code they created a database from mappings

of CVEs to commits for the Mozilla Firefox Browser,

Apache HTTP server and parts of the RHEL Linux

kernel. However, this database is not publicly acces-

sible and also not scalable, since it is manually con-

structed.

One attempt to automatize this process is an ap-

proach called VCCFinder by Henning Perl et al. (Perl

et al., 2015). In their work, the authors describe

a mapping of CVEs to GitHub commits in order

to create a vulnerability contributing commit (VCC)

database. This mapping is based on a heuristic that

involves the git blame command and some ﬁltering,

such as excluding lines in documentation. This work

has a similar goal to our research, even though we

took a different approach at some points.

In contrast to VCCFinder, in our work we used

an enhanced version of the well-known SZZ al-

gorithm (Sliwerski et al., 2005), called SZZ Un-

leashed (Borg et al., 2019) to ﬁnd the introducing

commits. For a commit, that is said to be bug intro-

ducing, the SZZ algorithm is using the git blame com-

mand (which maps each line in the commit to the last

modiﬁer) to ﬁnd all of the commits that directly pre-

ceded it. After that, additional steps are made to ﬁlter

out non bug-related commits by using various infor-

mation, such as the bug report date. SZZUnleashed

provides various improvements over the base SZZ al-

gorithm detailed in their work, like line-mappings and

the support for git based issue trackers. As opposed

to VCCFinder, SZZ considers more information and

it will produce a set of candidate vulnerability intro-

ducing commits, while VCCFinder produces at most

one (the one with most lines blamed). This gives us

the possibility to identify multiple commits as intro-

ducing, which is the case in many real-world prob-

lems (a vulnerability can be introduced through mul-

tiple commits). We also provide more ﬂexibility on

the introducing commit ﬁltering phase, such as choos-

ing ﬁle extension, and we also propose a way to mea-

sure the relevance of each candidate commit as well

as each related commit ﬁle’s contribution score. This

way, the user can gain insight into the ranking process

and adjust it accordingly.

3 OVERVIEW AND MOTIVATION

In this section we demonstrate the motivation behind

our research and give intuition on how our method

works through a running example. We discuss our

method in more detail in Section 4.

As already mentioned, the starting point of our ap-

proach is having a vulnerability ﬁxing commit (VFC)

for which we want to generate a set of introducing

commits (VIC). VFCs can be found in VFC datasets

such as project-KB (Ponta et al., 2019), and in most

of the cases a VFC can be linked to a correspond-

ing CVE id (i.e. to the actual vulnerability it ﬁxes).

One such VFC is linked to the CVE 2016-3674 (cve, c

14a) entry, a vulnerability allowing an attacker to per-

form an XML external entity attack (Herzog, 2010)

in multiple components of the XStream (xst, c 14)

project, a Java to XML serializer library. This vulner-

ability occurs in multiple ﬁles, such as Dom4JDriver,

ICSOFT 2022 - 17th International Conference on Software Technologies

DomDriver, SjsxpDriver, StaxDriver, and 3 more.

commit s h a : c 9b12 1 a 8 8 6 6498 8 c c b a bd83f a 2 7 b f c2a5 e 0 b d 1 39

++− x s t r e a m / s r c / . . . / i o / xml / S ta x D r i v e r . j a v a

/ / B e f o r e a p p l y i n g f i x

p r o t e c t e d XM LIn put Fac tor y c r e a t e I n p u t F a c t o r y ( ) {

r etur n XM LIn put Fac tor y . n e w I n s t a n c e ( ) ;

}

/ / A f t e r a p p l y i n g f i x

p r o t e c t e d XM LIn put Fac tor y c r e a t e I n p u t F a c t o r y ( ) {

f i n a l XML Inp utF act ory i n s t a n c e = XML Inp ut F ac t or y .

n e w I n s t a n c e ( ) ;

i n s t a n c e . s e t P r o p e r t y ( XML Inp utF act ory .

IS SUPPORTING EXTERNAL ENTITIES , f a l s e ) ;

r etur n i n s t a n c e ;

}

Figure 1: Before and after applying the changes in ﬁle Stax-

Driver.java in project XStream as part of ﬁxing CVE-2016-

3674.

Figure 1 shows the affected source code state

before and after applying the ﬁx in the vulnerable

StaxDriver .java ﬁle (only the relevant changed

source code is shown). It can be observed that the

commit ﬁxing this vulnerability simply sets an ap-

propriate ﬂag on the XMLInputFactory. Our goal

is to ﬁnd the commit that introduced changes that

led to this vulnerability, that is, yield the ”Before

applying ﬁx” state in Figure 1. In this particu-

lar example, the commit that added the XMLInput-

Factory instantiation statement without setting the

IS_SUPPORTING_EXTERNAL_ENTITIES ﬂag to false.

To achieve this, we choose to use an open-source

implementation of a recent variant of the SZZ algo-

rithm, called SZZ Unleashed (Borg et al., 2019) to

ﬁnd a set of possible introducing commits. SZZ (Sli-

werski et al., 2005) was originally designed to provide

a process to automatically identify the ﬁx inducing

lines to lines that are changed in a bug-ﬁxing com-

mit. Since the vulnerability occurs in multiple ﬁles,

it is very likely that it has been introduced through

multiple commits. We run SZZ Unleashed as part

of our own proposed tool, called BugIntroducerMiner

to generate the VICs. BugIntroducerMiner is a sim-

ple wrapper around SZZ Unleashed and its purpose

is to run SZZ Unleashed on commits stored in VFC

dataset, in our case on project-KB. In Figure 2, we

can see the results of our tool for the VFC shown in

Figure 1, which has the commit hash c9b121...

As discussed before, the ﬁx is rather simple and

involves adding a line that sets a speciﬁc ﬂag. How-

ever, due to the fact that the vulnerability occurs in

multiple ﬁles, SZZ found 17 candidate introducing

In the ﬁgures, we show only part of the results to re-

main concise. Omitting data is marked with ’...’

CVE−2016 −3674:

c o m m i t s W i t h I n t r o d u c e r s :

c 9b121 a 8 8 6 64988 c c b a b d83f a 2 7 b f c2a5 e 0 b d 1 39 :

[ d e e c01be a a 1 b d 8 7 8 f7acda 9 f 0 3 5 a 3 9 238a2 1 7 a e 9 ,

b ba4bc 2 8 e 6 2073f 9 b a a c9c58 c b c 1 4de95 8 d f 3 b 7e ,

72 e f d 4 a37f0 a b 8 1 d 2dfeb 0 1 3 d 3 5 ec7c b e d 0 5 1 0b1 ,

. . .

1 b 0 f 8 02b 0 1 6 3295 4 c 6ba2 a 6 6 055 9 2 e 3e29 7 5 f72f ,

4 fd39 f 2 f 2 6 16d4e a 9 e 1 d 25d30 d c 7 8 931be 0 1 d f b 0 ,

c 9794d 2 f 9 0 5 9 8 5c8e4 5 f a 4 d 7 7525c 1 3 0 a 5 f d0a20 ]

r e p o : h t t p s : / / g i t h u b . com / x−s t r e a m / x s t r e a m

Figure 2: Result of running the BugIntroducerMiner tool on

the VFC corresponding to CVE 2016-3674.

commits. This is hardly manageable because a lot

of practical uses prefer that for each CVE entry we

have only few (ideally just one) introducing commit.

To make things worse, manually checking the candi-

date commits we found that many of them are false

positives or contribute little to the vulnerability intro-

duction. For example, the changes are made in com-

ments, or in source code next to the ﬁx location (i.e

in a neighboring row that SZZ still considers), or hap-

pened in ﬁles that the user is not interested in (conﬁg-

uration ﬁles instead of java source ﬁles).

For these reasons, we applied an additional ﬁl-

tering step, which we implemented as another tool,

FilterBugIntroducer. The ﬁltering is based on rank-

ing the candidate commits according to their rele-

vance score that is calculated by measuring similarity

between the candidate commits and the source code

state before the ﬁxing commit. The calculated scores

for our running example can be seen on Figure 3. We

will detail how these scores are calculated in the fol-

lowing sections, but here we brieﬂy summarize their

purpose. For each candidate VIC we calculate a score

called relevance score (Overall score in the ﬁgure)

that measures how relevant that commit is as vulner-

ability introducing. This score is calculated by ag-

gregating the contribution scores (Total in the ﬁgure)

for each ﬁle corresponding to the VIC. Contribution

score corresponds to the ﬁle’s contribution to the vul-

nerability. It is calculated by multiplying the ﬁle’s

similarity to the ﬁxing ﬁle (calculated as the portion

of identical lines in all lines) with the ﬁxing ﬁle’s base

score, where the base score is just a simple metric

that denotes the quanitity of changes happened in that

speciﬁc ﬁle relative to all the changes in the commit.

In the ﬁgure, the Overall score denotes the relevance

score we used to rank the candidate VICs.

For our running example, we choose the top 2 can-

didate VICs, which are displayed in Figure 3. The

ﬁx patch with the highest relevance score can be seen

in Section 5 (Figure 8), where we further discuss the

beneﬁts of our results over plain SZZ. After manually

inspecting these commits, we can conclude that the

ﬁltering produced reasonable results:

• 4fd39... introduces the vulnerability in multiple

A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach

ﬁles, for example, ﬁle SjsxpDriver.java is cre-

ated in this commit and the vulnerable part has

not changed until the VFC. In ﬁle StaxDriver

.java, the vulnerability is introduced in the

method added in this commit (i.e. the method

contains the instantiation without setting the ap-

propriate ﬂag).

• 72efd... changed several ﬁles that are patched in

the ﬁx with multiple smaller changes whose result

is changed in the VFC.

After this brief overview and motivating example,

in the following sections we are elaborating on the

way we are performing the mapping of VFCs to sets

of VICs, we explain how we calculate the relevance

scores and how can these results be used in general.

============ CVE−2016−3674 ============

Repo : h t t p s : / / g i t h u b . com / x−s t r e a m / x s t r e a m

SHA: c9b1 2 1 a 8 8 6649 8 8 c c b abd83 f a 2 7 bfc2a 5 e 0 b d 139

F i l e b a s e s c o r e s :

S j s x p D r i v e r . j a v a : 0 .25 9 829 5 298 3 227 9 36

S t a n d a r d S t a x D r i v e r . j a v a : 0 .36 3 486 3 898 8 177 0 7

S t a x D r i v e r . j a v a : 0 . 2 073 1 372 0 098 9 827

W stxD r i v er . j a v a : 0 . 169 3 703 6 018 6 967 2 7

I n t r o d u c i n g commi t SHAs :

. . .

− 4 f d 3 9 f 2 f 2 6 1 6d4ea9 e 1 d 2 5 d 3 0 d c78931 b e 0 1 d f b 0

− S j s x p D r i v e r . j a v a :

S i m i l a r i t y : 0.71 4 285 7 142 8 571 4 3

C o n t r i b u t i o n : 0 .18 5 592 5 213 0 877 0 96

− S t a x D r i v e r . j a v a :

S i m i l a r i t y : 0 . 4

C o n t r i b u t i o n : 0 .08 2 925 4 880 3 959 3 08

− Wstx D r i v er . j a v a :

S i m i l a r i t y : 0.57 1 428 5 714 2 857 1 4

C o n t r i b u t i o n : 0 .09 6 783 0 629 6 398 1 29

R e l e v ance s c o r e : 0 .36 5 301 0 723 1 2 345 3

. . .

− 72 ef d 4 a 3 7 f 0 a b 81d2df e b 0 1 3 d 3 5 e c 7 cbed051 0 b 1

− S j s x p D r i v e r . j a v a :

S i m i l a r i t y : 0.28 5 714 2 857 1 428 5 7

C o n t r i b u t i o n : 0 .07 4 237 0 085 2 350 8 38

− S t a n d a r d S t a x D r i v e r . j a v a :

S i m i l a r i t y : 0.2 1 428 5 714 2 857 1 427

C o n t r i b u t i o n : 0 .07 7 889 9 406 8 895 0 86

− S t a x D r i v e r . j a v a :

S i m i l a r i t y : 0 . 4

C o n t r i b u t i o n : 0 .08 2 925 4 880 3 959 3 08

− Wstx D r i v er . j a v a :

S i m i l a r i t y : 0.2 1 428 5 714 2 857 1 427

C o n t r i b u t i o n : 0 .03 6 293 6 486 1 149 2 986

R e l e v ance s c o r e : 0 .27 1 346 0 858 6 3 545 3

Figure 3: The calculated relevance scores per candidate

VIC (Relevance score), contribution scores for each ﬁle cor-

responding to a VIC (Contribution), and the base scores for

each ﬁle in the VFC (File base scores).

4 METHODOLOGY

One of the main contributions of this paper is a two-

phase method to generate VIC datasets from VFC

databases. The two phases of the method are:

1. Identifying the Vulnerability Introducing

Commits (VICs): Run SZZ Unleashed for each

commit in a vulnerability ﬁxing commit (VFC)

database to identify a set of candidate VICs. We

implemented a tool called BugIntroducerMiner

that is able to perform this phase for databases

structured like project-KB.

2. Filtering: Taking the previous phase’s output (the

SZZ results) as input, we perform a ﬁltering phase

(using another tool we created, called FilterBug-

Introducer). The output of this phase is the top n

most relevant commits ranked by their relevance

scores, where n is an arbitrarily chosen number.

4.1 Phase 1 - Identifying the

Introducing Commits

The input to our proposed VIC extraction algorithm

is, as already discussed, a VFC database. We adjusted

this algorithm to databases structured like the project-

KB dataset, which we brieﬂy described in Section2,

however, the general idea discussed here can easily

be applied to different datasets as well.

To understand the properties of a typical VFC

database, we describe the structure of the project-

KB dataset (i.e. the dataset we use to demon-

strate our method), which can be seen in Figure 4a.

The database contains its data organized into fold-

ers named after the CVE identiﬁers of the vulnera-

bilities to which ﬁxing commits are published. Each

folder contains a statement.yaml ﬁle that describes

the found vulnerability ﬁxing commits linked to the

CVE. Figure 4b shows an example statement ﬁle for

the vulnerability referenced as CVE-2008-1728. It

can be seen that a vulnerability ﬁxing entry contains

some metadata, such as the textual description of the

vulnerability, the CVE id and a section ﬁxes that iden-

tiﬁes VFCs, such as the repository URL, the branch

and the commit hash.

The statement.yaml contain all the necessary in-

formation about the VFCs in the database. Our goal

was to extract the VICs for each VFC entry and we

made some decisions regarding the parsing:

• A statement.yaml ﬁle’s ﬁxes section can have mul-

tiple elements. This happens when a vulnerability

is ﬁxed in multiple branches or in different repos-

itories. Usually, the master branch of the main

repository is the ﬁrst element of the ﬁxes section,

so we chose that as the ﬁx. Other entries are

typically the duplicates of the same ﬁx in other

branches.

• A statement.yaml ﬁle’s ﬁx (an element of the ﬁxes

section) can have multiple associated commits.

This happens when a vulnerability ﬁx involves

multiple commits. In such cases, we choose the

ICSOFT 2022 - 17th International Conference on Software Technologies

a) Repository structure

<r e p o r o o t >/

s t a t e m e n t s /

CVE−2005 −3745/

s t a t e m e m t . yaml

CVE−2006 −1546/

s t a t e m e n t . yaml

. . .

LICENSE . t x t

README. md

. . .

b) Example statement.yaml ﬁle

v u l n e r a b i l i t y i d : CVE−2008−1728

n o t e s :

− t e x t : C onne c t i o nMa n a g e rImp l . j a v a i n I g n i t e R e a l t i m e

O p e n f i r e 3 . 4 . 5 a l l o w s rem o t e a u t h e n t i c a t e d u s e r s t o

c a u s e a d e n i a l o f s e r v i c e ( daemon out a g e ) by

t r i g g e r i n g l a r g e o u t g o i n g q u e ues w i t h o u t r e a d i n g

m ess a g e s .

f i x e s :

− i d : DEFAULT BRANCH

commi ts :

− i d : c 9 c d 1 e 5 2 1 6 7 3 ef0cccb 8 7 9 5 b 7 8 d 3 c b a efb8a57 6 a

r e p o s i t o r y : h t t p s : / / g i t h u b . com / i g n i t e r e a l t i m e / O p e n f i r e

Figure 4: Project-KB structure (a) with an example state-

ment ﬁle (b).

latest commit as it will contain all the previous

changes, and as so it represents the ﬁnal, ”ﬁxed”

state.

After parsing the database, we get a set of VFCs

for which we try to identify VICs by running SZZ Un-

leashed, an implementation of the SZZ algorithm. An

example output of this process can be seen in Figure 2

(the complete output of this phase contains multiple

such CVE elements).

In summary, phase 1 involves parsing every vul-

nerability ﬁxing entry in the source database to get a

set of VFCs. Then, SZZ Unleashed is run on each

VFC and the results from multiple SZZ Unleashed

runs are aggregated to get a candidate VIC database.

However, as we mentioned in Section 3, the SZZ

algorithm’s results are hardly usable as is for a num-

ber of reasons:

• SZZ takes into account every change in the com-

mit and it cannot be conﬁgured to detect changes

only in a special ﬁle type. So it is possible that

a change happens in a documentation ﬁle and the

change will generate many false positive introduc-

ing commits.

• SZZ does not offer a ranking on the provided re-

sults, so if a ﬁx is complex, involving multiple

ﬁles and multiple changes per ﬁle, the extracted

number of introducing commits are too large to

handle, even if we exclude false positives. In such

cases, a way to choose the commit with most rele-

vance to the vulnerability would be welcome. For

example, running SZZ Unleashed on CVE-2016-

2141 results in a VIC set of 634 elements.

• Results are problematic to explain as no detailed

information is provided of the way the candidate

VICs were chosen, making it hard to draw conclu-

sions from them.

Taking these issues into consideration, we de-

signed a ﬁltering phase, which aims to address the

above mentioned problems.

4.2 Phase 2 - Filtering

The input for this phase is the output of the ﬁrst phase,

that is a list in which every element is a pair of VFC

and a set of candidate VICs (like in Figure 2). Our

aim in this phase is to provide a score for each candi-

date VIC that measures its relevance to the VFC. We

refer to this score as relevance score, and it can be

calculated for a pair of VFC and VIC. The score cal-

culation algorithm is shown in Figure 5a as a pseudo

code.

Relevance score is produced by iterating over each

ﬁle changed in the candidate introducing commit and

identifying and comparing it to the corresponding ﬁle

in the ﬁxing commit - if it exists. This usually in-

volves an enhanced name check for equality that also

considers name changes through the git history. In

the pseudo code, the function get_by_name refers to

this action and returns the corresponding ﬁxing ﬁle or

the None value if it is not found. If the value is not

None, it is possible to continue with calculating the

contribution score as the product of the ﬁxing ﬁle’s

base score and the similarity between the ﬁxing ﬁle

and the candidate introducing commit ﬁle. The ﬁnal

relevance score is simply the sum of the contribution

scores calculated for all pairs of changed ﬁles in the

ﬁxing and introducing commits.

a) Relevance score

r e l e v a n c e s c o r e = 0

f o r i n t r o d u c i n g f i l e in i n t r o d u c i n g c o m m i t . f i l e s :

f i x i n g f i l e = f i x i n g c o m m i t . f i x i n g f i l e s . g e t b y n a m e (

i n t r o d u c i n g f i l e )

i f f i x i n g f i l e i s None :

c o n t i n u e

f i l e s i m i l a r i t y s c o r e = c o m p u t e s i m i l a r i t y ( f i x i n g f i l e ,

i n t r o d u c i n g f i l e )

c o n t r i b u t i o n s c o r e = f i x i n g f i l e . b a s e s c o r e ∗

f i l e s i m i l a r i t y s c o r e

r e l e v a n c e s c o r e += c o n t r i b u t i o n s c o r e

b) Base score

summ e d le n g th = sum ( f i x i n g c o m m i t . p a t c h e s )

f o r f i l e i n f i x i n g c o m m i t . a l l f i l e s :

i f f i l e . i s J a v a ( ) :

b a s e s c o r e = f i l e . p a t c h . l e n g t h / summ e d le n g th

f i x i n g c o m m i t . f i x i n g f i l e s . a dd ( f i l e , b a s e s c o r e )

Figure 5: Pseudo code for calculating relevance score (a)

for a candidate VIC (introducing commit) and calculating

base score (b) while selecting Java ﬁles. In both cases the

VFC (ﬁxing commit) is given.

A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach

The similarity method (denoted as

compute_similarity in the pseudo code) can

be an arbitrary function that quantiﬁes similarity

between two texts. Here, for the sake of simplicity,

we decided to use a straightforward method: we

counted the ratio of identical lines in the two ﬁles

after excluding empty lines.

The other operand of the product is the base score

that estimates in what proportion does a ﬁle take part

in the VFC (i.e. the ratio of changed lines in the

ﬁle compared to the total lines changed in the ﬁxing

patch). Every corresponding ﬁle in the candidate VIC

is weighted by this score as a VIC has more contri-

bution to the vulnerability if the ﬁles it changes have

greater part in the ﬁxing (which means more changes

were needed in them as part of the vulnerability ﬁx, so

they have more faulty parts). This score is determined

beforehand as a result of the algorithm presented in

Figure 5b. As part of calculating the base scores, we

can ﬁlter on the ﬁles based on their types in the ﬁx-

ing commit, in this particular example, only Java ﬁles

will contribute to the ﬁnal relevance score of the VIC.

Please note however, this part is easily changeable and

as such the method can be freely extensible to any ﬁle

type.

After this second, ﬁltering phase the ﬁnal output

of the method is the proposed VIC dataset contain-

ing pairs of VFC and a set of VICs. Here, the set of

VICs are ﬁltered by ranking them based on relevance

scores and keeping only the top n elements. A part of

the dataset extracted this way from project-KB can be

seen in Figure 6 (see Section 5 for the details).

5 RESULTS

In this section, we showcase the usage of our pro-

posed method by presenting the tools we developed

to perform the two phases: BugIntroducerMiner and

FilterBugIntroducer. We also describe the dataset

extracted from the project-KB database using these

tools,

and brieﬂy discuss the improvement of our

method over simply running the plain SZZ on the ﬁx-

ing commits.

5.1 BugIntroducerMiner

To perform the ﬁrst phase of our method (see Sec-

tion 4.1), we developed a Java tool called BugIntro-

ducerMiner. Information about its prerequisites and

additional details (such as the exact parametrization)

Both the extracted VIC dataset and the tools are avail-

able publicly: https://doi.org/10.5281/zenodo.5785239

can be found in its README.md ﬁle located in the repli-

cation package.

BugIntroducerMiner is a simple Java program,

which iterates over the directory structure of a

project-KB like database and for each entry it runs

the bug introducer ﬁnder script from the SZZ Un-

leashed implementation. SZZ Unleashed is basi-

cally a toolchain, using a number of Python and

Java programs that, among other things, mine com-

mits from issue trackers, ﬁlter the results or perform

the bug introducing commit search. We use one of

these programs, the szz_find_bug_introducers-

<version_number>.jar ﬁle that searches for bug in-

troducing commits.

For each invocation of the jar ﬁle, BugIntro-

ducerMiner prepares the necessary inputs (for de-

tails, see the SZZ Unleashed repository (szz, c 14)),

for example, it clones the repository containing the

vulnerability and its ﬁx. After running the pro-

gram, we get the results in a JSON ﬁle called

fix_and_bug_introducing_pairs.json. Note

that this ﬁle contains the results for a single run of

SZZ Unleashed but we need to run SZZ Unleashed on

the whole set of VFCs and aggregate its results with

BugIntroducerMiner. An example output for this pro-

cess is represented in Figure 2 (the complete output

might contain multiple instances of such structure).

5.2 FilterBugIntroducer

To perform the second phase of our method (see Sec-

tion 4.2), we developed the tool FilterBugIntroducer,

a Python program with the aim to calculate the rele-

vance scores introduced in Section 4, rank the com-

mits based on this score, and output the ﬁnal VIC

database. As with this other tool, information regard-

ing the setup can be found in its README.md ﬁle.

The tool iterates over every CVE entry in the in-

put VFC database and calculates the relevance scores

for all the candidate VICs. To this end, for each

VFC it starts iterating over the corresponding VICs.

For each VIC, it calculates the similarity score for

the ﬁles that are also present in the VFC. To get in-

formation about the commit and their ﬁles, the tool

uses the GitHub API and some URL speciﬁc mecha-

nisms to overcome some limitations of the API, like

the limitation on commit ﬁles. The tool then aggre-

gates the data according to the method described in

Section 4.2 to calculate the overall relevance score.

If the relevance score is greater than zero, the com-

mit is considered relevant. It is important to note that

for each VFC we only consider the ﬁrst m relevant

VIC, where m can be set by the optional parameter

--introducing-commit-limit (default is 30). This

ICSOFT 2022 - 17th International Conference on Software Technologies

is to prevent entries with big number of VICs to run

unexpectedly long.

After calculating the relevance scores, the tool se-

lects the top n VICs with the highest scores, where n

can be set by the optional parameter --n (default is 2).

The ﬁnal result will be stored in the structure shown

in Figure 6 and will be saved to a YAML ﬁle named

filtered-results.yaml by default. The example

in the ﬁgure is extracted by running the tool with the

top n parameter set to 2 (i.e. at most 2 commits with

highest relevance scores are kept).

CVE−2008 −1728:

c o m m i t s W i t h I n t r o d u c e r s :

c 9 c d1e5216 7 3 e f 0 c c c b 8 7 9 5b78d3c b a e f b 8 a 5 7 6 a :

− 6088 e21ca06fb6279 0 d 9 e a 0 2 f a f 8 c 8 8 4 3 0 2 e 0 c d 9

r e p o : h t t p s : / / g i t h u b . com / i g n i t e r e a l t i m e / O p e n f i r e

CVE−2008 −6505:

c o m m i t s W i t h I n t r o d u c e r s :

04 fce f a 4 4 b a e 1 2 6 3 c 7 c a d 6 9 8 6 a 9 d a f e d 6 7 f 0 1 6 4 f :

− e0 5 d 71b a 329 3 37b a 637 8 4 555 f bbe 9 bb8 e 029 0 543

− 78 e 853 b c b32 e a91 b 84a 0 7 0b3 d 2dc 0 3ab1 4 bc6 b 23

r e p o : h t t p s : / / g i t h u b . com / a pac he / s t r u t s

. . .

Figure 6: The resulting VIC dataset structure (a YAML ﬁle).

Some notes regarding the usage of FilterBugIntro-

ducer:

• Caching: The program generates heavy HTTP

trafﬁc (for the project-KB database it accesses

over 130,000 HTTP URLs) mainly in the GitHub

domain. To avoid unnecessary trafﬁc and to en-

able multiple runs of the tool, we implemented

a caching mechanism. If caching is enabled (as

is by default), the tool can be re-run with differ-

ent parameters without generating additional traf-

ﬁc (for example, to extract a new dataset with dif-

ferent VIC limit). However, the cache might take

up considerable hard disk space (for project-KP, it

takes up

13 GB).

• Documenting: By setting the --document argu-

ment to a path on the ﬁle system, the tool will

output the relevance and contribution scores while

calculating them for the VICs. An excerpt of

such a ﬁle can be seen in Figure 3, where we see

the scores generated for the VFC linked to CVE-

2016-3674.

5.3 Vulnerability Introducing Commit

Database from project-KB

Our ﬁnal contribution is a VIC dataset extracted from

the project-KB VFC database with the tools imple-

menting our proposed method on. The VIC dataset

follows the structure shown in Figure 6 containing

564 VFC entries with at most two but at least one

VIC assigned to it, while the unﬁletered SZZ gener-

ated dataset had VIC entries ranging from 1 to nearly

700 for each VFC. While generating the dataset, more

than 110.000 ﬁles were considered (corresponding

to ﬁxing and introducing commits) from 198 open

source projects.

To demonstrate our approach, we present two

hand-picked examples to highlight the method’s ef-

fectiveness:

• CVE-2016-3674: This vulnerability is already

described in Section 3, however, here we elabo-

rate further on the impact of our ﬁltering on the

SZZ extracted VIC list. Recall that the ﬁx to this

vulnerability can be seen in Figure 1. We also

mentioned that SZZ Unleashed generated 17 can-

didate commits, two of them are shown in Fig-

ure 7. As it can be seen, in commits deec... and

3adb... the changes are clearly unrelated to the

vulnerability and as such their relevance scores

are lower than the selected commits with high rel-

evanace ( deec... has a relevance score of 0.055,

3adb... has 0.014). Moreover, in these commits

only one ﬁle was changed from the 7 ﬁles that are

part of the vulnerability.

Our method choose commit 4fd3... with the

highest relevance score (see Figure 3). Figure 8

shows that it is indeed the commit that introduced

the vulnerability in the StaxDriver.java ﬁle by

instantiating an object without setting the appro-

priate ﬂag. Furthermore, in this commit, two other

ﬁles are also changed that contributed to the vul-

nerability in relevant places.

• CVE-2016-2141: This vulnerability is ﬁxed in a

commit (cve, c 14b) that spans through a large

number of ﬁles (77) with some of them not be-

ing Java source codes (since projekt-KB is a Java

vulnerability dataset, non-Java ﬁles should not be

considered). Running SZZ Unleashed on this ﬁx

(as part of running BugIntroducerMiner) gener-

ates 634 candidate introducing commits. This

high number of VICs is unacceptable for most of

the applications, so ﬁltering is essential. Using

our method, we can conclude that the most rele-

vant VIC has the SHA of e2453...

with a rel-

evance score of 0.23, which indeed seems to be

a good pick as it is associated with 12 vulnerable

ﬁles mostly with changes that are present in the

ﬁxing commit (usually because these ﬁles are cre-

ated here and the vulnerable parts have never been

changed since).

https://github.com/belaban/JGroups/commit/e24538a4

590684d910dbdac8762c85881f519dd5

A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach

deec01beaa1bd878f7acda9f035a39238a217ae9

++− x s t r e a m / s r c / . . . / i o / xml / S t a x D r i v e r . j a v a

− p r i v a t e boo l e an r e p a i r i n g N a m e s p a c e = f a l s e ;

+ / ∗ ∗

+ ∗ @ dep r ec a ted s i n c e 1 . 2 , u s e an e x p l i c i t c a l l t o {

@li nk # set R e p a i r i n g N a m e s p a c e ( b o o l e a n ) }

+ ∗ /

p u b l i c S t a x D r i v e r (QNameMap qnameMap , b o ole a n

r e p a i r i n g N a m e s p a c e ) {

− t h i s ( qnameMap , r e p a i r i n g N a m e s p a c e , new

X m l F r i e n d l y R e p l a c e r ( ) ) ;

+ t h i s ( qnameMap , new X m l F r i e n d l y R e p l a c e r ( ) ) ;

+ s e t R e p a i r i n g N a m e s p a c e ( r e p a i r i n g N a m e s p a c e ) ;

. . .

3adb51d6c3a1a20adf88f091b200dde676d10352

++− x s t r e a m / s r c / . . . / i o / xml / S t a x D r i v e r . j a v a

+ impo rt com . t h o u g h t w o r k s . x s t r e a m . i o .

H i e r a r c h i c a l S t r e a m D r i v e r ;

impor t com . t h o u g h t w o r k s . x s t r e a m . i o .

H i e r a r c h i c a l S t r e a m R e a d e r ;

+ impo rt j a v a . i o . I n p u t S t r e a m ;

+ impo rt j a v a . i o . O u t p u t S t r eam ;

p u b l i c H i e r a r c h i c a l S t r e a m R e a d e r c r e a t e R e a d e r ( Rea d e r

xml ) {

+ l o a d L i b r a r y ( ) ;

+ t r y {

+ r etur n new S t a x R e a d e r ( qnameMap , c r e a t e P a r s e r (

xml ) ) ;

+ }

. . .

Figure 7: Parts of two commits falsely identiﬁed as vulner-

ability introducing by SZZUnleashed for CVE-2016-3674.

4fd39f2f2616d4ea9e1d25d30dc78931be01dfb0

++− x s t r e a m / s r c / . . . / i o / xml / S t a x D r i v e r . j a v a

+ p r o t e c t e d XML Inp utF act ory c r e a t e I n p u t F a c t o r y ( ) {

+ r etur n XM LIn put Fac tor y . n e w I n s t a n c e ( ) ;

+ }

Figure 8: Part of an introducing commit to CVE-2016-3674

which is selected with highest relveance score.

6 THREATS TO VALIDITY

The provided vulnerability introducing dataset has

not been validated manually, therefore we cannot rule

out the possibility of including false positive vulner-

ability introducing commits. To mitigate this threat,

we performed manual validation on a small random

sample, which conﬁrmed that all the included intro-

ducing commits are correct. Nonetheless, a complete

manual validation is among our future plans.

The dataset contains at most two vulnerability in-

troducing commits for each vulnerability ﬁx. There is

a chance that there are more valid introducing com-

mits that we omit from the dataset. However, we pro-

vide the dataset extraction tools as well with which

the dataset can be re-generated with adjusted number

of introducing commits.

As the published vulnerability introducing dataset

is extracted from the project-KB database, it’s quality

and accuracy inﬂuences our dataset. However, such

problems in project-KB are highly unlikely as it is a

manually curated dataset, therefore this threat has a

very low probability.

7 CONCLUSIONS AND FUTURE

WORK

In our work, we focused on source code-related vul-

nerability datasets, which are fundamental building

blocks of vulnerability scanning and detection meth-

ods. Although datasets containing ﬁxing patches for

some vulnerabilities already exist for various pro-

gramming languages, there is a lack of so-called vul-

nerability introducing commit datasets, which would

be essential for creating and validating just-in-time

vulnerability detection approaches.

To address this issue, we proposed a novel method

that maps vulnerability ﬁxing commits (VFCs) to a

set of vulnerability introducing commits (VICs) using

a recent implementation of the well-known SZZ al-

gorithm. Empirical results show that applying SZZ in

itself introduces a lot of false-positive commits; there-

fore, we extended the algorithm with an additional ﬁl-

tering phase. We deﬁned a so-called relevance score

for each commit that quantiﬁes the level of connection

between a ﬁxing and an introducing commit in terms

of common ﬁles and source code they affect. With

this relevance score, we were able to rank introduc-

ing commits reliably and perform ﬁltering by keeping

only the highest-ranked elements.

We implemented our approach and published it as

two tools (implementing the two phases) described in

detail. To demonstrate the usage of these tools, we ran

them on a VFC database called project-KB and as our

main contribution, we extracted and published a new

vulnerability introducing dataset based on project-

KB. We manually inspected a sample of the produced

results and concluded that our method: i) correctly

assigns the highest scores to commits that introduce

vulnerable code parts, and ii) commits ranked at the

bottom are irrelevant for introducing the vulnerable

behavior.

Despite the encouraging ﬁrst results, there are

some possible directions that we would like to address

in future work:

• We choose a simple approach to measure simi-

larity between texts. Investigating other ways for

quantifying similarity between source codes could

probably increase the accuracy of the method.

ICSOFT 2022 - 17th International Conference on Software Technologies

• Taking inspiration from the work of Tan et al. (Tan

et al., 2021), the ranking could be performed

with the use of machine learning models such as

RankNet. It would probably increase the resource

usage in exchange for a possibly more robust and

accurate method.

• Manually validating all the extracted VICs would

improve the conﬁdence in the dataset quality and

further strengthen the validity of our proposed

method.

• Building efﬁcient just-in-time vulnerability detec-

tion algorithms based on machine learning models

trained on the extracted VICs dataset.

ACKNOWLEDGMENTS

The presented work was carried out within the

SETIT project (2018-1.2.1-NKP-2018-00004),

sup-

ported by project TKP2021-NVA-09,

and the Min-

istry of Innovation and Technology NRDI Ofﬁce

within the framework of the Artiﬁcial Intelligence

National Laboratory Program (MILAB). The research

was partly supported by the EU-funded project As-

sureMOSS (Grant no. 952647) as well.

Furthermore, P

eter Heged

us was supported by the

Bolyai J

anos Scholarship of the Hungarian Academy

of Sciences and the

UNKP-21-5-SZTE-570 New Na-

tional Excellence Program of the Ministry for Innova-

tion and Technology.

REFERENCES

(2021. dec. 14.a). Cve 2016-3674: https://nvd.nist.gov/

vuln/detail/cve-2016-3674.

(2021. dec. 14.b). Jgroups ﬁxing commit - https://

github.com/belaban/jgroups/commit/38a882331035ff

ed205d15a5c92b471fd09659c.

(2021. dec. 14.). Sap - project kb: https://github.com/sap/

project-kb/tree/master/vulnerability-data.

(2021. dec. 14.). The state of open source vulnerabili-

ties 2021: https://www.whitesourcesoftware.com/

resources/research-reports/the-state-of-open-source-

vulnerabilities/.

Project no. 2018-1.2.1-NKP-2018-00004 has been im-

plemented with the support provided from the National Re-

search, Development and Innovation Fund of Hungary, ﬁ-

nanced under the 2018-1.2.1-NKP funding scheme.

Project TKP2021-NVA-09 was implemented with the

support provided by the Ministry of Innovation and Tech-

nology of Hungary from the National Research, Develop-

ment and Innovation Fund, ﬁnanced under the TKP2021-

NVA funding scheme.

(2021. dec. 14.). Szz unleashed:https://github.com/ wogsc-

par/szzunleashed.

(2021. dec. 14.). Xstream: https://github.com/x-

stream/xstream.

(2021. nov. 20.). The mitre corporation - common vulnera-

bilities and exposures: https://www.cve.org/.

(2021. nov. 20.). U.s. national institute of standards

and technology - national vulnerability database:

https://nvd.nist.gov/.

Amin, A., Eldessouki, A., Magdy, M. T., Abdeen, N.,

Hindy, H., and Hegazy, I. (2019). Androshield: Au-

tomated android applications vulnerability detection,

a hybrid static and dynamic analysis approach. Infor-

mation, 10(10).

Bhandari, G. P., Naseer, A., and Moonen, L. (2021).

Cveﬁxes: Automated collection of vulnerabilities

and their ﬁxes from open-source software. CoRR,

abs/2107.08760.

Borg, M., Svensson, O., Berg, K., and Hansson, D. (2019).

Szz unleashed: an open implementation of the szz

algorithm - featuring example usage in a study of

just-in-time bug prediction for the jenkins project.

Proceedings of the 3rd ACM SIGSOFT International

Workshop on Machine Learning Techniques for Soft-

ware Quality Evaluation - MaLTeSQuE 2019.

Cao, S., Sun, X., Bo, L., Wei, Y., and Li, B. (2021).

Bgnn4vd: Constructing bidirectional graph neural-

network for vulnerability detection. Information and

Software Technology, 136:106576.

Dai, J., Zhang, Y., Jiang, Z., Zhou, Y., Chen, J., Xing, X.,

Zhang, X., Tan, X., Yang, M., and Yang, Z. (2020).

BScout: Direct Whole Patch Presence Test for Java

Executables. USENIX Association, USA.

Falleri, J.-R., Morandat, F., Blanc, X., Martinez, M., and

Monperrus, M. (2014). Fine-grained and accurate

source code differencing. ASE 2014 - Proceedings of

the 29th ACM/IEEE International Conference on Au-

tomated Software Engineering.

Gkortzis, A., Mitropoulos, D., and Spinellis, D. (2018).

Vulinoss: A dataset of security vulnerabilities in open-

source systems. In 2018 IEEE/ACM 15th Interna-

tional Conference on Mining Software Repositories

(MSR), pages 18–21.

Herzog, S. (2010). Xml external entity attacks (xxe). Re-

trieved October, 13:2013.

Li, F. and Paxson, V. (2017). A large-scale empirical study

of security patches. In Proceedings of the 2017 ACM

SIGSAC Conference on Computer and Communica-

tions Security, CCS ’17, page 2201–2215, New York,

NY, USA. Association for Computing Machinery.

Li, H., Kim, T., Bat-Erdene, M., and Lee, H. (2013).

Software vulnerability detection using backward trace

analysis and symbolic execution. In 2013 Interna-

tional Conference on Availability, Reliability and Se-

curity, pages 446–454.

Meneely, A., Srinivasan, H., Musa, A., Tejeda, A. R.,

Mokary, M., and Spates, B. (2013). When a patch

goes bad: Exploring the properties of vulnerability-

contributing commits. In 2013 ACM / IEEE Interna-

A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach

tional Symposium on Empirical Software Engineering

and Measurement, pages 65–74.

Meneely, A., Tejeda, A. C. R., Spates, B., Trudeau, S., Neu-

berger, D., Whitlock, K., Ketant, C., and Davis, K.

(2014). An empirical investigation of socio-technical

code review metrics and security vulnerabilities. In

Proceedings of the 6th International Workshop on So-

cial Software Engineering, SSE 2014, page 37–44,

New York, NY, USA. Association for Computing Ma-

chinery.

Meneely, A. and Williams, O. (2012). Interactive churn

metrics: socio-technical variants of code churn. ACM

SIGSOFT Software Engineering Notes, 37:1–6.

MITRE Corporation (2021. nov. 21.). CVE - Common

Vulnerabilities and Exposures. https://cve.mitre.org/.

[Online; accessed 29-April-2020].

Perl, H., Dechand, S., Smith, M., Arp, D., Yamaguchi,

F., Rieck, K., Fahl, S., and Acar, Y. (2015). Vc-

cﬁnder: Finding potential vulnerabilities in open-

source projects to assist code audits. In Ray, I., Li,

N., and Kruegel, C., editors, Proceedings of the 22nd

ACM SIGSAC Conference on Computer and Commu-

nications Security, Denver, CO, USA, October 12-16,

2015, pages 426–437. ACM.

Ponta, S. E., Plate, H., Sabetta, A., Bezzi, M., and Dan-

gremont, C. (2019). A manually-curated dataset of

ﬁxes to vulnerabilities of open-source software. In

Proceedings of the 16th International Conference on

Mining Software Repositories.

Sliwerski, J., Zimmermann, T., and Zeller, A. (2005). When

do changes induce ﬁxes? volume 30.

Tan, X., Zhang, Y., Mi, C., Cao, J., Sun, K., Lin, Y.,

and Yang, M. (2021). Locating the security patches

for disclosed oss vulnerabilities with vulnerability-

commit correlation ranking. In Proceedings of the

2021 ACM SIGSAC Conference on Computer and

Communications Security, CCS ’21, page 3282–3299,

New York, NY, USA. Association for Computing Ma-

chinery.

Wang, X., Wang, S., Feng, P., Sun, K., and Jajodia,

S. (2021a). Patchdb: A large-scale security patch

dataset. In 2021 51st Annual IEEE/IFIP Interna-

tional Conference on Dependable Systems and Net-

works (DSN), pages 149–160.

Wang, X., Wang, S., Feng, P., Sun, K., Jajodia, S., Ben-

chaaboun, S., and Geck, F. (2021b). Patchrnn: A

deep learning-based system for security patch identi-

ﬁcation.

Woo, S., Park, S., Kim, S., Lee, H., and Oh, H. (2021).

Centris: A precise and scalable approach for identify-

ing modiﬁed open-source software reuse. In Proceed-

ings of the 43rd International Conference on Software

Engineering, ICSE ’21, page 860–872. IEEE Press.

Xiao, Y., Chen, B., Yu, C., Xu, Z., Yuan, Z., Li, F., Liu, B.,

Liu, Y., Huo, W., Zou, W., and Shi, W. (2020). MVP:

detecting vulnerabilities using patch-enhanced vulner-

ability signatures. In Capkun, S. and Roesner, F., edi-

tors, 29th USENIX Security Symposium, USENIX Se-

curity 2020, August 12-14, 2020, pages 1165–1182.

USENIX Association.

Zheng, Y., Pujar, S., Lewis, B. L., Buratti, L., Epstein,

E. A., Yang, B., Laredo, J., Morari, A., and Su, Z.

(2021). D2A: A dataset built for ai-based vulnerability

detection methods using differential analysis. CoRR,

abs/2102.07995.

ICSOFT 2022 - 17th International Conference on Software Technologies