PreSTyDe: Improving the Performance of within-project Defects

Prediction by Learning to Classify Types of Software Faults

Gabriela Czibula

, Ioana-Gabriela Chelaru

, Arthur Molnar

and Istvan Gergely Czibula

Department of Computer Science, Babes¸-Bolyai University, Cluj-Napoca, Romania

Keywords:

Software Defect Prediction, Machine Learning, Unsupervised Learning, Deep Neural Network.

Abstract:

Software defect prediction (SDP) is an important task within software development. It is a challenging activity,

as the detection of software modules that are prone to malfunction in new versions of software contributes to

an improved testing process and also increases the quality of the software. In this paper, we propose a two-

stage hybrid approach for predicting the error-proneness of the application classes in an upcoming version of

a software project by employing a taxonomy of defects unsupervisedly uncovered from the previous software

releases. The ﬁrst stage of the proposed approach consists of an unsupervised labelling of software defects

from the available versions of the analysed software system. During the second stage, a supervised classiﬁer

is used to predict the error proneness during the software project’s evolution employing the taxonomies of

defects uncovered in the previous stage. Experiments carried out with Calcite software in a SDP scenario

within a project highlighted that the performance of predicting software defects during a project evolution

increases by approximately 5%, in terms of the average Area under the Receiver Operating Characteristic

curve, by developing predictors for different classes of software defects.

1 INTRODUCTION

Software defect prediction (SDP) represents an active

research area in the search-based software engineer-

ing ﬁeld, being an important task within software de-

velopment. With the current expansion of the pro-

gramming industry, SDP is receiving increased atten-

tion from the scientiﬁc community. Software defects

are errors such as logic or implementation faults that

can cause the system to behave incorrectly or pro-

duce erroneous results. Thus, the detection of soft-

ware modules that are prone to malfunction in new

versions of software contributes to an improved test-

ing process and also increases the quality of the soft-

ware. SDP is helpful in process management and to

improve software maintenance and evolution.

As SDP provides us with difﬁcult learning con-

texts, most models may not be suitable for such a

task. Firstly, the principal characteristic of SDP is

that we are given a codebase, along with its track-

ing history of issues and bugs, and we are required

to detect software bugs in the code of other reposito-

https://orcid.org/0000-0001-7852-681X

https://orcid.org/0000-0002-9274-6349

https://orcid.org/0000-0002-4113-2953

https://orcid.org/0000-0003-0076-584X

ries, follow the evolution of the system, and suggest

revisions, with the ﬁnal goal of having integrated such

intelligent algorithms into our daily IDEs. Secondly,

most software project releases have very few bugs,

thanks to quality control; therefore, the defect class

is considerably under-represented. That may result in

classiﬁers that simply return the majority (non-defect)

class, without analysing the data at all, and having

very high accuracy despite poor precision and recall

values. However, Machine learning (ML) is still an

effective tool for complex classiﬁcation tasks. Fur-

thermore, more sophisticated deep learning (DL) has

consistently had groundbreaking improvements in re-

cent months. Two main directions have been investi-

gated in the ﬁeld of SDP (Wang et al., 2016). As a ﬁrst

line of research, the performance of machine learn-

ing techniques has been intensively investigated. In

terms of binary classiﬁers proposed to detect software

faults, there is a wide range of approaches, from con-

ventional ML predictors (Linear Regression, Deci-

sion Trees, Artiﬁcial Neural Networks, Support Vec-

tor Machines, Ensemble Learning (Zhou et al., 2022))

to recurrent neural networks (RNNs) (Pachouly et al.,

2022) and DL models (Miholca et al., 2022).

In SDP, another direction of great interest for the

research community concerns the input features. Re-

214

Czibula, G., Chelaru, I., Molnar, A. and Czibula, I.

PreSTyDe: Improving the Performance of within-project Defects Prediction by Learning to Classify Types of Software Faults.

DOI: 10.5220/0012732300003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 214-225

ISBN: 978-989-758-696-5; ISSN: 2184-4895

cent feature engineering literature reveals many ap-

proaches proposed to learn features from software ar-

tifacts through DL models: Deep Belief Networks

(Wang et al., 2016), Convolutional Neural Networks

(CNNs) (Li et al., 2017), Long-Short Term Neural

Networks , CNNs combined with Graph Neural Net-

works (Zhou et al., 2022), etc.

Although having the upper hand in feature ex-

traction, DL models can still struggle to classify im-

balanced data, and therefore one-class classiﬁcation

(OCC) adapted ML models were also investigated as

a possible solution to improve defect prediction qual-

ity. Even if one-class predictive models are employed

in the ML literature to address various unbalanced

classiﬁcation tasks, the literature on the use of OCC

models for SDP is still scarce (Chen et al., 2016)

(Moussa et al., 2022) (Ciubotariu et al., 2023) and the

results are not reliable. One-class SVM (OCSVM)

models trained on one class of instances (defective

or non-defective) were investigated in various SDP

scenarios and data sets, but the results revealed poor

performance mainly for the within-project scenario

(Ciubotariu et al., 2023) (Moussa et al., 2022). The

binary classiﬁers such as Random Forest ﬁnetuned

seem to still have a better performance than the OCC

models. Recently, (Zhang et al., 2022) proposed the

ADGAN-SDP model for anomaly detection based on

Bidirectional Generative Adversarial networks. The

proposed model was trained on non-defects, and it

was designed to capture relevant features for the

non-faulty class. The results show that GAN-based

models achieve good results, provided that there are

enough training data samples.

Besides the data imbalancement problem, which

signiﬁcantly affects the performance of supervised

classiﬁers, another important issue in SDP is related

to the feature-based representation of the software en-

tities (modules, classes, components, etc.). In most

open source data sets used in the SDP literature (data

sets from public repositories such as NASA (Shep-

perd et al., 2018) or SeaCraft (Menzies et al., 2017)),

software metrics are used to characterise the software

entities, and these data sets may not be appropriate for

training models that must perform in real-world sce-

narios. Recent studies in the SDP literature (Miholca

et al., 2022) highlighted that features extracted from

the source code may be more relevant to distinguish

between faults and non-faults. Another important is-

sue revealed by (Czibula et al., 2023) is that, while

defects can be of various types (e.g. numeric errors,

pointer issues, security vulnerabilities, etc.) and it is

very likely that each defect type has its own prop-

erties, existing SDP approaches consider all defect

types together and try to come up with a universal de-

fect prediction model. By considering speciﬁc types

of defects, it is very likely to increase the speciﬁcity

of defect predictors, as we may expect that speciﬁc

classes of faults could have common characteristics.

The current research starts from our previous ﬁnd-

ings (Czibula et al., 2023) in which we emphasised

that speciﬁc types of software defects have particular

behaviours and introduced, as a proof of concept, an

unsupervised learning (UL) based approach for min-

ing behavioural patterns for speciﬁc classes of soft-

ware defects. In this paper, we make a further step

towards our broader goal of developing defect predic-

tors for particular defect types by introducing, as a

proof of concept, a two-stage hybrid approach named

PreSTyDe (Predictor for Speciﬁc Types of Defects) to

validate our hypothesis that the performance of pre-

dicting software defects during software project evo-

lution (within-project SDP scenario) would improve

by developing predictors for speciﬁc classes of soft-

ware defects. The ﬁrst stage of the proposed approach

consists of an unsupervised labelling of software de-

fects from the available versions of the analysed soft-

ware system. Then, a supervised classiﬁer is used to

predict the error proneness of the software entities in

an upcoming version of the software project by em-

ploying the taxonomy of defects unsupervisedly un-

covered from the previous versions. Experimental

evaluation is carried out with Apache Calcite (Begoli

et al., 2018) software, an open source framework for

data management. To the best of our knowledge, our

proposed approach is new in the SDP literature. In

summary, in this paper we aim to answer the follow-

ing research questions:

RQ1. Could the performance of predicting the error-

proneness of the software entities in a spe-

ciﬁc version of a software project be enhanced

by employing a taxonomy of defects unsuper-

visedly uncovered from the previous software

versions?

RQ2. Does the proposed approach result in im-

proved software defect prediction performance

in the case of a complex, open-source applica-

tion?

The response to RQ1 and RQ2 will help guide

the effort to create more speciﬁc SDP models for real

life software projects and give an insight into the va-

lidity of our assumption that more performant SDP

models can be developed if the models are more spe-

ciﬁc to a set of defect types. Such models developed

for speciﬁc defect types may be better customized

to express the speciﬁcity of different types of de-

fects, which may require different kind of informa-

tion from the software system to be used as input into

the model. For instance, for defects related to some

PreSTyDe: Improving the Performance of within-project Defects Prediction by Learning to Classify Types of Software Faults

215

arithmetic overﬂow, the model should be fed with in-

struction and expression level information from the

source code, while if the defects are related to some

incorrect/ambiguous speciﬁcation, the model should

use the speciﬁcations and comments from the code or

other sources as input for the model.

The remainder of the paper is organised as fol-

lows. Section 2 introduces the PreSTyDe approach

and the methodology employed. Section 3 presents

our case study used to assess the performance of

PreSTyDe, then the experimental results and research

ﬁndings are presented, discussed, and compared with

related work in Section 4. Section 5 discusses the

threats to the validity of the study, while Section 6

presents directions to further extend our current work.

2 METHODOLOGY

We introduce the PreSTyDe two-stage hybrid ap-

proach for predicting the error-proneness during a

software project’s evolution employing a taxonomy of

software defects learnt from the available releases of

the software. Even if the literature contains several

defect type classiﬁcations and taxonomies (e.g., the

Orthogonal Defect Classiﬁcation proposed by IBM,

Defect Origins, Types and Modes proposed by HP or

IEEE Standard Classiﬁcation for Software Anoma-

lies), none of these taxonomies became widely ap-

plied in practice (Wagner, 2008) or employed in the

SDP research literature. In addition, labelling defects

according to such taxonomies is not a common prac-

tice in the industry yet.

As the results of existing SDP research have not

been adopted within the industry, it is clear that the

state of the art can be further improved. Thus, instead

of employing an existing taxonomy of software de-

fects, we propose learning such a taxonomy of faults

from the available releases of the analysed software

project. Such an approach of uncovering, through un-

supervised learning, the taxonomy of defects speciﬁc

to a certain software project may offer higher ﬂexibil-

ity and may increase the SDP performance.

Empirical studies have shown that the frequency

of software defects and their type can differ between

applications. In a study that focused on server-side

software, (Sahoo et al., 2010) showed differences

between the types of defects encountered in several

popular open-source applications. Three open-source

GUI-driven applications were statically analysed in

(Molnar and Motogna, 2020), with the authors con-

cluding that over the lifetime of the target appli-

cations, defect prevalence and distribution remained

application-speciﬁc.

2.1 Problem Statement and

Formalisation

In the general SDP task formalised as a binary clas-

siﬁcation problem, two target classes are given: the

class of software defects (the positive class, labelled

as 1) and the class of non-defects (the negative class,

labelled by 0). The goal of a supervised defect predic-

tor is to build a classiﬁer (from a training data set of

software entities labelled with 0 or 1) capable of pre-

dicting if a certain software entity is likely to belong

to a positive or negative target class.

Let us consider an object-oriented software sys-

tem S having multiple versions S

...S

, with

, k ̸= 1 being the current software version un-

der development. We assume that for previous

software versions V

···V

k−1

historical data D =

...D

k−1

} is available, where D

, 1 ≤ j ≤

k −1 consists of software entities (application classes,

in our approach) from S

labelled with their class

(defective - positive or non-defective - negative). In

our approach, a label of 0 is assigned to the negative

class, while the positive class is labelled as 1.

In the data sets D

, ∀1 ≤ j ≤ k − 1, each applica-

tion class from the j-th version of S (S

) is charac-

terised by a set of m features considered to be relevant

for discriminating between the defective and non-

defective application classes. Thus, each application

class c ∈ D

is represented as an m-dimensional nu-

merical vector c = (c

,...,c

), where c

(∀1 ≤ i ≤ m)

represents the value of the feature f

computed for the

application class c.

Our approach consists of two main stages. During

the ﬁrst stage, a clustering algorithm is applied to soft-

ware defects from previous versions V

,··· ,V

k−1

of the analysed software system and the software de-

fects will be labelled according to the identiﬁed par-

tition {P

,...,P

}. Then, a supervised classiﬁer is

used to predict the proneness to errors of the soft-

ware entities in the current version V

of the software

project by employing the taxonomy of defects unsu-

pervisedly uncovered at the previous stage. There-

fore, the goal of our formalised SDP task as a multi-

class classiﬁcation problem is to determine (from the

available training data D ) an approximation

f of the

function f : D

→ {0,1,...,nc} that will assign for

each application class c ∈ S

the label 0 if c is not

error-prone and the label i (i ∈ [1,nc]) if c belongs to

the type of defects represented by the cluster P

2.2 Data Representation

As previously shown in Section 2.1, the application

classes are represented as vectors of high-dimensional

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

216

real-valued features. Existing approaches from the

SDP literature reveal that these feature vectors may

express structural characteristics of the software en-

tities through software metrics or conceptual charac-

teristics directly extracted from the source code.

The feature-based representation employed in our

approach is motivated by recent work (Miholca et al.,

2022) which emphasised that conceptual-based fea-

tures unsupervisedly learnt from the source code are

more informative than structural-based ones, those

based on software metrics. The extensive perfor-

mance evaluation conducted by (Miholca et al., 2022)

on all releases of Apache Calcite software revealed

that natural language-based models such as Doc2Vec

(Le and Mikolov, 2014) and Latent Semantic Index-

ing (LSI) (Deerwester et al., 1990) are capable of pro-

viding semantic features that are suitable for discrim-

inating between faulty and non-faulty software enti-

ties. Both models, Doc2Vec and LSI, are used by

(Miholca et al., 2022) to represent the source codes

as numeric vectors of a predeﬁned length to cap-

ture the semantic characteristics of the code. Fur-

thermore, the concatenation of Doc2Vec and LSI rep-

resentations (further denoted as Doc2Vec+LSI) pro-

vided higher predictive performance than when using

only Doc2Vec or LSI semantic representations.

Consequently, the vectorial representation

,...,c

ℓ+p

) of an application class c is obtained

by concatenating the ℓ-dimensional conceptual

vector representing c in the Doc2Vec space with the

p-dimensional conceptual vector representing c in

the LSI space.

2.3 Building the PreSTyDe Classiﬁer

As discussed in Section 2.1, PreSTyDe is a hybrid

classiﬁer which is trained on historical data D =

...D

k−1

} available in versions 1, 2, . ..,k −

1 of the software system under analysis in two stages,

an unsupervised classiﬁcation stage followed by a su-

pervised classiﬁcation one.

Unsupervised classiﬁcation stage. During this stage,

a k-means clustering algorithm is applied on the set

De f of defective instances from the set

k−1

[

i=1

. We

chose the k-means method for determining groups

of defective application classes due to its ability to

minimise the distance between the defective instances

within a group, but also due to its high ﬂexibility and

quick prototyping.

The number nc of clusters used in k-means is cal-

culated using the Elbow method (Shi et al., 2021). To

further validate the results we obtained with the El-

bow method, we also employed the Silhouette score

(Shi et al., 2021). After the optimal number of clus-

ters (nc) was determined in the defective instances

data set, the clustering step was performed using the

k-means algorithm. After determining the partition

,...,P

of software defects, the defective in-

stances of De f are relabelled with labels 1,2,...nc so

that all the defects from partition P

(1 ≤ j ≤ nc) will

be labelled with j. After this relabelling stage, the de-

fective application classes of D which initially were

labelled as 1 (defective) will now have multiple labels

(from 1 to nc). This multi-class labelling of software

defects suggests that instead of identifying all defects

as a whole, the defects are split into categories which

may represent speciﬁc types of faults. We note that

the relabelling procedure is applied only for the de-

fective instances; thus, all the non-defective instances

from the D remain with their initial label (i.e., 0).

Denoting by P

the set of non-defective applica-

tion classes, the unsupervisedly uncovered partition

of D is {P

,...,P

}, such that D =

[

i=0

. In this

partition, the set P

corresponds to non-faulty appli-

cation classes, while P

(1 ≤ i ≤ nc) each represent a

speciﬁc type of defect.

Supervised classiﬁcation stage. The goal of this

stage is to build a multi-class classiﬁcation model

classi f ier which, for a new application class ex-

pressed as a conceptual feature vector in the concate-

nated Doc2Vec and LSI space (as shown in Section

2.2) will be able to predict a class from {0, 1, . . . , nc}

(0 representing non-defect, and c ∈ {1,...,nc} repre-

senting a speciﬁc type of software defect).

The prediction model classi f ier may be: (1) a

multi-class classiﬁcation model (e.g., deep neural net-

work, support vector machine, etc) trained on the re-

labeled data set D in which the labels of the defec-

tive entities are updated according to the unsupervised

learning stage; and (2) the previously uncovered clus-

tering model itself (thought as an online clustering

model), in which the new instance (application class)

will be assigned to the cluster P

(1 ≤ i ≤ nc ) which is

the most similar to it (considering a speciﬁc distance

metric) and thus labeled with i.

The algorithmic description of the training step of

PreSTyDe is given in Algorithm 1.

3 CASE STUDY

The case study proposed in our work targets Apache

Calcite, an open-source framework for data manage-

ment. Its use as a case study target is relatively new,

being introduced by (Herbold et al., 2022) and em-

ployed in recent SDP researches (Ciubotariu et al.,

PreSTyDe: Improving the Performance of within-project Defects Prediction by Learning to Classify Types of Software Faults

217

Algorithm 1: The training stage of PreSTyDe.

Function Training(D) is

Input: D = D

...D

k−1

the training data set in the form (c, ℓ

) - c is an application class

represented as a conceptual numerical vector in the concatenated Doc2Vec and LSI space (see

Section 2.2) and ℓ

∈ {0, 1} is c’s label (0: non-defect; 1 - defect).

Output: the classiﬁcation model PreSTyDe

/* Stage 1. Unsupervised classification stage */

/* Determine the subset De f of defective application classes from D */

De f ← {(c|c ∈

k−1

[

i=1

, ℓ

= 1} /* Determine a partition P = {P

,...,P

c} of De f using

the k-means clustering method */

nc ←ElbowMethod(De f )

P ←k-means(De f , nc)

/* Extend the partition with the cluster of non-defective application classes */

← {(c|c ∈

k−1

[

i=1

, ℓ

= 0}

/* Recompute the labels of the defective instances from D according to the

identified partition P */

for i ← 1, nc do

for c ∈ P

ℓ

← i

end

/* Stage 2. Supervised classification stage */

/* Build a multi-class classifier on the training data set D */

PreSTyDe←buildClassiﬁer(D)

Training←PreSTyDe

EndFunction

2023) (Czibula et al., 2023) (Briciu et al., 2023). In

the original data set (Herbold et al., 2022), there are

16 releases of the Calcite software (from 1.0.0 to

1.15.0). Application classes from each version are

characterised by the values of 4189 software metrics

(such as static code metrics, metrics extracted from

the Abstract Syntax Tree representation of the source

code, code churn metrics), and a binary label indicat-

ing whether the class was identiﬁed as being defective

or not. This identiﬁcation was done via a modiﬁed

version of the SZZ algorithm, designed to work with

the JIRA issue tracking system, together with manual

data validation in order to establish the ground truth

(Herbold et al., 2022).

We note very high data imbalancement (number of

defective software entities vs. number of non-defects)

for all Calcite versions: the defective rates range from

0.033 (for version 1.15.0) to 0.166 (for version 1.0.0).

The total number of defective application classes in

all Calcite versions is 1577.

As highlighted in Section 2.2, the relevance of

the feature set employed for characterizing the ap-

plication classes is of major importance in our SDP

task, as these features will have an impact on the

PreSTyDe model and its ability to create a good sep-

aration boundary between the class of defective and

non-defective software entities. Starting from the re-

search ﬁndings of (Miholca et al., 2022), we aim to

use conceptual-based features for capturing semantic

characteristics of source code instead of the classical

features expressed through software metrics.

For a better understanding of the structural and

conceptual feature-based representation of the defects

from all the Calcite releases we depict in the 2D

visualization of a self-organizing map (SOM) (Cz-

ibula and Czibula, 2012) unsupervisedly trained on

the defective software application classes represented

as software metrics (left side image of Figure 1) and

conceptual features (right side image of Figure 1) pro-

posed by (Miholca et al., 2022).

For visualizing the SOMs from Figure 1 the U-

Matrix (L

otsch and Ultsch, 2014) method was em-

ployed. The adjacent neurons from the map are

coloured based on their distance in the 2D output

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

218

Figure 1: 2D visualisation of the set De f of software defects from all 16 Calcite releases visualized using a SOM: using the

software metrics-based features proposed by (Herbold et al., 2022) (left side image), and semantic vectors extracted using

Doc2Vec+LSI (right). Darker areas represent clusters of similar instances while lighter areas express cluster separators.

space. In our visualization, we employed a darker

colouring for neurons that correspond to input in-

stances whose vectors are close in the input space

and thus darker regions express clusters of similar in-

stances. Lighter areas represent larger distances be-

tween the neurons and thus express separation bound-

aries between the clusters. The visualizations from

Figure 1 highlight that the conceptual feature-based

representations in Doc2Vec+LSI spaces are the most

informative, being able to distinguish more than two

groups (types) of defective application classes. On

the SOM built considering the software metrics-based

representation (left image) we observe two clusters

of software defects, while the semantic representation

reveals about ﬁve groups. Based on this observation

and the literature results, the combined Doc2Vec and

LSI representations proposed and used by (Miholca

et al., 2022) for the software application classes will

be further employed in our study.

In order to help replicate our study, we made our

data set and analysis source code publicly available at

(Chelaru, 2024).

Experiment. We describe below the experiment used

as a proof of concept for assessing the predictive per-

formance of the PreSTyDe classiﬁer on the previously

described Calcite data set. Let us consider the data

set D of all (defective and non-defective) application

classes from all 16 Calcite releases, represented as

real-valued vectors in the Doc2Vec+LSI space. In our

experiment we opted for exactly the same represen-

tations as those employed in the related work paper

in order to allow a more accurate comparison with

the results of (Miholca et al., 2022). Thus, the con-

catenated Doc2Vec+LSI representation is of length 60

(a dimensionality of 30 for the vectors extracted us-

ing both Doc2Vec and LSI). As shown by (Miholca

et al., 2022), the Doc2Vec and LSI representations

were extracted using the Gensim library (

Reh

rek and

Sojka, 2010), while the corpora used for training the

Doc2Vec and LSI models consisted of the source code

with comments but without operators, special sym-

bols, English stop words or Java keywords.

The data set D = De f

NonDe f consists of

19571 application classes, from which 1577 are de-

fective (the set De f , with instances labeled with 1)

and 17994 are non-defective (the set NonDe f , with

instances labeled with 0). We note that the defective

class is severely outnumbered by the non-defective

one, thus revealing an extreme data imbalancement

(a defective rate of 8.06%).

Following the methodology introduced in Sec-

tion 2, we present the two main stages to build the

PreSTyDe classiﬁer: the unsupervised classiﬁca-

tion stage followed by the supervised classiﬁcation

stage which also includes the performance evaluation

methodology.

Unsupervised classiﬁcation stage. As described in

Section 2.3, the ﬁrst step before applying the k-means

classiﬁcation method to the set of defects De f of all

Calcite versions is to determine the optimal number

nc of clusters using the Elbow method (Shi et al.,

2021). Figure 2 is a visual representation of the elbow

score applied to the set of defective instances. This

method determines the optimal number of clusters

by identifying the point at which the sum of squares

within the cluster starts to decrease at a slower rate,

similar to an “elbow” in the plot.

To further validate these ﬁndings, we also em-

ployed the Silhouette method, as seen in Figure 3,

which measures how well deﬁned and separated the

clusters are, helping to identify an optimal number

of clusters when the elbow point in the within-cluster

sum of squares plot is ambiguous. Having the conﬁr-

mation of this second test, we move further with the

division into 5 clusters.

After determining the optimal number of clusters

in the defective instances data set (nc = 5), the cluster-

ing step was performed using the k-means algorithm

from the scikit-learn library, which partitions the data

set into k clusters, in our case 5, by iteratively assign-

PreSTyDe: Improving the Performance of within-project Defects Prediction by Learning to Classify Types of Software Faults

219

Figure 2: Elbow score applied to the set of bugs.

Figure 3: Silhouette score applied to the set of bugs.

ing data points to the nearest cluster centroid and up-

dating the centroids to minimise the sum of squares

within the cluster; this process continues until con-

vergence. Figure 4 presents a 2D visualisation of the

ﬁve clusters. The ﬁgure was constructed using the

Principal Component Analysis (PCA) dimensionality

reduction technique, which allowed us to reduce the

number of data features, but retain as much informa-

tion as possible. We observe that there is no clear

delimitation between all clusters, and this may be due

to the vectorial representation of the data.

After the set De f of defects was partitioned in 5

clusters using the k-means method, the labels of the

defective application classes from D are changed ac-

cording to the identiﬁed partition, i.e., the label of all

software defects from the i-th cluster (i = 1,2,3,4,5)

is changed to i.

Supervised classiﬁcation stage. We describe the su-

pervised classiﬁcation stage used to assess our initial

assumption that it would be more easy to differentiate

the non-defects from a speciﬁc type of software faults

than from the entire set of software faults.

Thus, for each cluster of defects cl ∈

,...P

}, our goal is to train a deep neu-

ral network (DNN) model M

able to differentiate

Figure 4: 2D visualisation of the 5 uncovered clusters. PCA

is applied for dimensionality reduction.

the speciﬁc class of defects P

from the set NonDe f

of non-defects. Inspired by the inherent process of

software evolution, M

will be trained only on the

application classes (from P

and NonDe f ) from the

ﬁrst 10 software releases (1.0.0 to 1.9.0) and tested

on those instances (from P

and NonDe f ) that belong

to the last software releases (1.10.0 to 1.15.0).

The DNN is based on the keras library and com-

prises an input layer of 60 input features, two hid-

den layers employing ReLU activation, and a sigmoid-

activated output layer. An additional dropout layer

with a dropout rate of 0.2 was incorporated for regu-

larisation. The loss function chosen is binary cross-

entropy, optimised with the Adam optimiser having a

learning rate of 0.001. Training spans 15 epochs with

a batch size of 16, accompanied by the use of class

weights to address class imbalance.

To ﬁnd the optimal conﬁguration for our model,

we performed hyperparameter tuning experiments

using GridSearchCV, exploring learning rates, the

number of hidden layers and neurons per layer, the

number of epochs, and batch sizes.

Performance evaluation. The performance of each

model trained as previously described will be com-

pared with the performance of a DNN model M

all

trained on releases 1.0.0 to 1.14.0 to differentiate the

entire set De f of defective application classes from

the set of non-defective ones NonDe f . The testing of

all

is performed, as for the models M

, on version

1.15.0. If the performance of M

will be higher than

the performance of M

all

then our initial hypothesis

stands. For a more accurate performance evaluation.

all

will be trained/tested using the same methodol-

ogy as the individual M

models.

Once the binary classiﬁcation models

all

,...,M

) were trained, their per-

formance on the testing set (Calcite version 1.15.0) is

evaluated using the following performance measures:

Sensitivity (Sens) - the true positive rate of the

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

220

classiﬁer (also known as recall or probability of

detection); Speciﬁcity (Spec) - the true negative rate

of the classiﬁer Area under the ROC curve (AUC) -

considered among the best metrics for performance

evaluation in SDP (Fawcett, 2006), is computed as

the average between the speciﬁcity and sensitivity

values; Weighted F1 score (W F1) - computed as

the weighted average of the F1-scores for the defect

and non-defect classes; Average Area under the

Precision-Recall curve (Avg AU PRC) - computed

as the average between the AUPRC value of the

positive and negative class. Matthews Correlation

Coefﬁcient (MCC) - an evaluation metric used in the

ML literature mainly for imbalanced classiﬁcation.

All the evaluation measures should be maximised

in order to obtain better defect predictors. MCC range

in [-1,1], while the other measures range in [0,1].

4 RESULTS AND DISCUSSION

We present the results of the experiment described in

Section 3 conducted on the Apache Calcite software

for answering RQ1 in what concerns the performance

of the PreSTyDe classiﬁer.

As presented in Section 3, for each cluster of de-

fects identiﬁed in all 16 Calcite releases, we train

a DNN-based model M

(1 ≤ i ≤ 5) for separating

from the set NonDe f of non-defects as well as a

DNN model M

all

trained to differentiate the entire set

of defects (De f ) from the set of non-defective ones

NonDe f . Each of the learning models is trained on all

application classes from Calcite releases 1.0.0-1.9.0,

while the testing set consists of the instances from

Calcite versions 1.10.0-1.15.0. The decision to divide

the data set as such is based on two facts: ﬁrst, we

wanted to follow the evolution of the project; there-

fore we trained our model on early releases and used

the latest versions possible for testing, but also had

to include entities from all the uncovered clusters in

both training and testing.

Table 1 presents, for each of the evaluated learn-

ing models (M

,...,M

and M

all

) the number of

defects from the testing data, the obtained confusion

matrix together with the values for the performance

metrics employed. We note that the testing data sets

for each of the models contains 7595 non-defective

instances (non-defects from releases 1.10.0-1.15.0).

A ﬁrst issue that should be noted (as revealed by

the results from Table 1) is that the difﬁculty of clas-

siﬁcation increases when attempting to distinguish a

speciﬁc class of defects (clusters P

-P

) from the class

of non-defective instances. This was to be expected,

due to a more severe data imbalancement for train-

ing the models M

-M

(signiﬁcantly smaller number

of defects, as depicted in the last column of the ta-

ble) which makes it more difﬁcult for the classiﬁer to

recognize the minority class.

To mitigate this issue of data imbalancement and

reduce the bias of the classiﬁer towards the non-

defective class, weights were used during the train-

ing. Nevertheless, classiﬁers M

-M

have a false neg-

ative rate (FNRate) higher with more than 20% than

the false negative rate of M

all

. The higher values of

the FNRate reveal that more defects are misclassiﬁed

by M

-M

than by M

all

which lead to lower positive

predictive values (PPVs) - precision values for the de-

fective class. This is also reﬂected in the sensitivity

of the classiﬁers: M

all

has a higher sensitivity value

(0.482) than all classiﬁers M

-M

Figure 5 depicts the variation of sensitivity with

respect to the number of defects from the testing data

and the imbalancement degree (ID - computed as the

number of non-defects divided to the number of de-

fects) for each of the employed classiﬁers (M

-M

all

). The Pearson correlation coefﬁcients (PCC)

were also computed for measuring the degree of linear

relationship between the speciﬁcity and ID/number of

defects. We note weak correlations between Spec and

ID (PCC of −0.123) and the number of defects (PCC

of 0.235). However, the PPV values are strongly cor-

related with ID (PCC of −0.738) and very strongly

correlated with the number of defects (PCC of 0.933)

suggesting that the precision of the classiﬁer in detect-

ing the defects are strongly dependent on the number

of defects from the training/testing data.

Figure 5: Variation of Spec with respect to the imbalance-

ment degree (ID) and the number of defects from the testing

data for the models M

-M

, M

all

. The values for the ID and

number of defects were scaled to [0.1, 0.5].

Even if the models M

-M

have lower probabil-

ity of detection values than M

all

(as revealed by the

previous analysis), it should be noted that the speci-

ﬁcity values for the models M

-M

are higher than the

speciﬁcity of M

all

with more than 30%. Thus, mod-

els M

-M

are able to better recognize the class of

PreSTyDe: Improving the Performance of within-project Defects Prediction by Learning to Classify Types of Software Faults

221

Table 1: Experimental results. Each of the evaluated learning models (M

,... , M

and M

all

) is trained on the application

classes from the ﬁrst 10 Calcite releases and tested on the ﬁve remaining versions.

Learning TD FD TN FN Sens (↑) S pec (↑) AUC (↑) W F1 (↑) Avg AU PRC (↑) MCC (↑) # of defects

model

3 723 6872 8 0.273 0.905 0.589 0.948 0.545 0.023 11

60 1479 6116 90 0.400 0.805 0.603 0.871 0.557 0.071 150

3 111 7484 158 0.019 0.985 0.502 0.962 0.502 0.005 161

19 847 6748 51 0.271 0.888 0.580 0.929 0.544 0.048 70

1 345 7250 3 0.250 0.955 0.602 0.976 0.552 0.023 4

all

191 3349 4246 205 0.482 0.559 0.521 0.675 0.512 0.018 396

non-defects than M

all

, meaning that the non-defective

instances are more easily distinguished from the spe-

ciﬁc types of defects (clusters P

-P

) than from the

whole set of defects. We also note that for all the

other metrics (AUC, W F1, Avg AUPRC, MCC), the

models M

, M

and M

have better values than

all

. In terms of AUC measure, which is considered

among the best metrics in SDP, there is an increase in

performance with more than 6% for the classiﬁcation

of speciﬁc type of defects.

It is surprising that the cluster P

is the worst sep-

arated from the set of non-defects, even if the num-

ber of defects from the cluster is the highest com-

pared to the other clusters. However, the speci-

ﬁcity of the model M

is the highest, which means

that non-defects are better recognized than the de-

fects. We analyzed the bug reports, the source code

and all the changes made to the misclassiﬁed in-

stances from P

for identifying common reasons for

the wrong classiﬁcations. In some cases the differ-

ence between the version containing the defect and

the version that ﬁxed the bug is very small and con-

sequently the generated feature representation in the

combined Doc2vec and LSI space did not manage

to meaningfully capture the change. For example

the class CassandraEnumerator from the package

org.apache.calcite.adapter.cassandra having

116 lines of code contained a bug ([CALCITE-1855])

in version 1.12.0, then the ﬁx was provided in version

1.13.0 by the commit 43e32fa5a and it only con-

sisted in merging two if statements and removing one

line. In other cases, for example for the bug report

[CALCITE-1569] where the ﬁx contains classes from

, there are lengthy discussions about the reported

issue suggesting that even the human experts are not

in agreement on the exact requirements and speciﬁ-

cation for a given class. The discussion around this

particular bug report contains multiple alternatives to

ﬁx the issue, potentially affecting a different set of

classes.

Overall, considering all six evaluation metrics, the

performance of the models M

-M

exceeds the perfor-

mance of M

all

in 73% of the cases (22 out of 30 com-

parisons). Thus, our hypothesis that the classiﬁcation

on speciﬁc types of defects is more performant than

the classiﬁcation of all software defects considered as

a whole is sustained and RQ1 is answered.

However, we acknowledge the poorer perfor-

mance of PreSTyDe in terms of probability of de-

tection (Spec). This may be due to a limitation of

the employed classiﬁer (DNN) which misclassiﬁes

the software defects (due to the severe imbalancement

of the data) and can be caused by inappropriate class

weights or the architecture itself. Other classiﬁcation

models, such as support vector machines or gradient

boosting methods may be more suitable for the con-

sidered SDP tasks and will be further explored. The

one-class classiﬁcation models may also be an alter-

native to address the low classiﬁcation recall.

Another possible cause for the poorer perfor-

mance in terms of Spec may be the representation

employed. It would be possible that the used rep-

resentation (doc2vec+LSI) is not able to distinguish

the faulty entities and should be enriched with other

features (e.g., software metrics or semantic features

learned from other software artifacts).

4.1 Comparison to Related Work

Despite of the vast research in the ﬁeld of SDP us-

ing machine learning and recently deep learning ap-

proaches, there are very few approaches in the liter-

ature addressing the problem of predicting speciﬁc

types of software defects. Existing SDP approaches

highlight the need to shift the SDP research towards

predicting speciﬁc types of software faults such as

security vulnerability issues (Dam et al., 2019), nu-

meric errors, complexity issues, pointer issues (Cz-

ibula et al., 2023), energy defects, performance de-

fects (Kamei and Shihab, 2016) as speciﬁc defect

types have their own speciﬁc behavior. Still, most

SDP approaches consider all defect types together and

try to propose a universal defect prediction model.

There are very few methods in the recent SDP litera-

ture addressing the problem of predicting speciﬁc de-

fect types.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

222

(Xu et al., 2021) attempted to address the dif-

ﬁculty of predicting all types of defects using a

single model and introduced a graph representation

that allowed to extract defect region candidates for

speciﬁc defect types. Experiments were carried

out on the Software Assurance Reference Dataset

(SARD, 2023) and three different types of software

faults: improper validation of an array index vul-

nerability (CWE-129), unchecked input for loop con-

dition (CWE-606) and divide by zero (CWE-369).

Very good performance was obtained for the models

speciﬁcally developed for the three types of defects,

with Area Under the ROC curve values ranging be-

tween 0.846 and 0.995.

It is difﬁcult to provide a direct comparison be-

tween our approach and (Xu et al., 2021), as they

have different goals. First, we do not use predeﬁned

types of software defects, but we aim to detect de-

fect classes from the available releases of the software

and further employ these classes for predicting types

of defects in future software releases. Thus, our aim

is to adapt the types of defects to the analysed soft-

ware. Second, the current work is not focused on im-

proving the performance of prediction for the speciﬁc

types of defects, but to point out that the performance

of predicting speciﬁc defect types is higher than the

performance of predicting all defect types together.

And last, (Xu et al., 2021) do not perform their exper-

iments on Calcite.

4.2 Analysis from a Software

Engineering Perspective

We aim to answer RQ2 from a software engineering

point of view. We focus on the 1.10.0 - 1.15.0 ver-

sions, on which the M

models were tested. Changes

to these versions

included support for Java 10, nu-

merous improvements and bugﬁxes, extensions to the

supported SQL syntax as well as new functionalities

and API changes. We carried out a manual com-

parison between all consecutive version pairs (e.g.,

1.10.0 and 1.11.0). We observed that Calcite’s ﬁle

structure did not undergo signiﬁcant changes, which

was expected from already mature application ver-

sions (Molnar and Motogna, 2020).

While a detailed evaluation of the classiﬁer’s

per cluster performance is beyond the scope of our

present paper, we selected one defect from each of

the studied versions, in order to highlight the inherent

complexity of the SDP task.

CALCITE-1501

was a major bug affecting a

class in version 1.10, which PreSTyDe mislabelled

https://calcite.apache.org/news/releases/

https://issues.apache.org/jira/browse/CALCITE-1501

as non-defective. Fixing it required detailed knowl-

edge of the operators used within Calcite, proof being

the detailed discussion recorded on JIRA. CALCITE-

1949 was a major bug ﬁrst reported in version 1.11

that lead to potential memory leaks and which was re-

solved in version 1.17. Repairing the defect entailed

calling the close() method of a superclass belonging

to the Avatica external library. For version 1.12, we

selected CALCITE-1881, a major bug that induces an

error due to a defect in working with date and times-

tamp types. The ﬁx consisted in updating how Calcite

worked with data types interpretable as a timestamp.

We use source ﬁle DruidQuery.java

as an exam-

ple illustrative of the difﬁculties in pinpointing the

decisions taken by SDP approaches. Both the origi-

nal data set (Herbold et al., 2022) and our classiﬁer

marked the source ﬁle as defective in version 1.13;

our manual examination revealed the existence of sev-

eral overlapping issues that were ﬁxed either in ver-

sion 1.13 (CALCITE-1853) or one of the following

ones (CALCITE-2035, CALCITE-2094, CALCITE-

2101). CALCITE-2055 was a major bug present in

1.14, which was ﬁxed in version 1.15. It had to do

with correct parsing of invalid date components and

the responsible source ﬁle was not correctly classiﬁed

by PreSTyDe. The issue was resolved by adding the

required veriﬁcation code and raising an exception in

case invalid input was provided. For Calcite version

1.15, we selected CALCITE-2188, a major bug that

resulted in errors in speciﬁc use of date-type objects

in SQL queries. While ﬁxed in version 1.17, the asso-

ciated discussion on JIRA is a good illustration of the

complexities of source code that must remain compat-

ible with several relational database systems, each of

which handles issues such as timestamp differently.

Our examination reafﬁrmed the variance in soft-

ware defects. We conﬁrmed the ﬁndings of (Herbold

et al., 2022) that deﬁning defects is not practical for

most real-life software systems. In several of the ex-

amined cases, defect resolution involved the inclusion

of large sections of new code as the result of lengthy

discussion between subject matter experts.

5 THREATS TO VALIDITY

The research conducted in this paper followed the

guidelines stated in (Runeson and H

ost, 2009). Start-

ing from a research hypothesis and two well-deﬁned

research questions, we established the methodology

(all defect data and details are available on JIRA)

https://github.com/apache/calcite/blob/main/druid/src/

main/java/org/apache/calcite/adapter/druid/DruidQuery.

java

PreSTyDe: Improving the Performance of within-project Defects Prediction by Learning to Classify Types of Software Faults

223

of our study, the target application, collected, pro-

cessed and analyzed the data. In order to allow the

replication or extension of our study, the data used in

our experiments was published as an open-data pack-

age (Chelaru, 2024). Moreover, to address the inter-

nal, external and construct threats to our study’s valid-

ity the existing best practices in both machine learning

and software engineering research were adopted.

In what concerns internal threats we note that the

feature set employed for representing the application

classes is of major importance and may have an inﬂu-

ence on the obtained results. In addition, an analysis

of the source code of all Calcite project releases re-

vealed several situations which may cause noise in

the training data and affect the performance of our

approach. We found 56 application classes labelled

as non-defects (0) in a certain software version, label

that changed to defective (1) in a future release with-

out any change in the source code of the application

class. We also detected an application class initially

labelled as 1, with the label changed in a newer re-

lease to 0 without any source code changes. These

data points corroborate some of our earlier ﬁndings

(Czibula et al., 2023) regarding limitations in the anal-

ysis of large volumes of source code and associated

information. We note these instances were not re-

moved from the data sets and could have introduced

noise into the ML models’ building and evaluation.

We leveraged previous work (Herbold et al., 2022)

that combined automated tooling and manual valida-

tion in order to ensure the validity of our method’s in-

put data. However, this limited the scope of our study

to the 16 versions covered by the research. As such,

the most important external threat to our study’s va-

lidity regards its scope, which is currently limited to

a subset of all the versions of a single member of the

Apache suite.

We addressed construct threats by targeting a pop-

ular open-source application for which source code

and associated development data is available. We em-

ployed a combination of widely used performance

metrics together with complementing the results of

the PreSTyDe approach with an additional analysis

from a software engineering point of view. How-

ever, it is possible that additional, yet hidden software

defects exist which could confound our analysis and

skew obtained results. In addition, dividing the avail-

able versions into training (ﬁrst 10 versions) and test-

ing (next 6 versions) meant that the latter was carried

out using mature application versions. We aim to ex-

tend our work in order to cover multiple systems as

well as application development phases, in order to

capture the target application’s entire lifecycle.

6 CONCLUSIONS AND FUTURE

WORK

Starting from the hypothesis that the performance of

predicting software defects would improve by devel-

oping predictors for speciﬁc classes of software de-

fects, we proposed the PreSTyDe defect prediction

model. PreSTyDe consists of an unsupervised la-

belling of defects from the available versions of the

analysed software system followed by a supervised

defect predictor used to predict the error proneness of

the application classes in an upcoming version of the

software by employing the taxonomy of defects unsu-

pervisedly uncovered during the previous stage.

An initial case study was conducted on the Cal-

cite software as a proof of concept for assessing the

predictive performance of PreSTyDe; it conﬁrmed the

feasibility of predicting speciﬁc types of software de-

fects instead of developing one classiﬁer for predict-

ing the whole class of software faults. The research

questions stated in Section 1 have been answered. As

a response to the RQs, we highlighted that the perfor-

mance of predicting the defect-proneness of the appli-

cation classes in a speciﬁc version of a software could

be improved by employing a taxonomy of defects un-

supervisedly uncovered from the previous software

releases. We also carried out an initial analysis from

a software engineering perspective conducted on a se-

lected number of software defects. We used Apache

Calcite (Begoli et al., 2018), a mature real life soft-

ware project to indicate the potential and viability of

the proposed approach in an industrial context.

We believe that our current work on classifying

defects by their category should be the new line of

research in the SDP ﬁeld as, on one hand, it could

lead to a better understanding of software defects, and

from a practical perspective it may aid development

teams to early detect speciﬁc faults.

In what concerns future work, the immediate goal

is to increase the speciﬁcity of the proposed approach

by investigating other classiﬁers such as support vec-

tor machines or gradient boosting methods. Another

direction regards complementing existing data sets

such as (Herbold et al., 2022) with additional results

from static analysis tools such as SonarQube (Mol-

nar and Motogna, 2020) or PTIDEJ (Lenarduzzi et al.,

2019); we believe it is likely that having input data at a

higher level than class-based metrics could further im-

prove the precision of methods such as PreSTyDe. We

aim to complement these efforts by carrying out an

extensive evaluation of our classiﬁer’s accuracy from

a software engineering standpoint, by examining the

generated clusters and using a known defect taxon-

omy to evaluate cluster-speciﬁc errors.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

224

Finally, we will focus on extending our investi-

gation to other systems within the Apache Software

Foundation due to their strict inclusion criteria, ma-

ture nature and comprehensive feature and software

defect information (Lenarduzzi et al., 2019).

REFERENCES

Begoli, E., Camacho-Rodr

ıguez, J., Hyde, J., and et al.

(2018). Apache Calcite: A Foundational Framework

for Optimized Query Processing Over Heterogeneous

Data Sources. In SIGMOD ’18, page 221–230. ACM.

Briciu, A., Czibula, G., and Lupea, M. (2023). A study

on the relevance of semantic features extracted us-

ing BERT-based language models for enhancing the

performance of software defect classiﬁers. Procedia

Computer Science, 225:1601–1610.

Chelaru, I.-G. (2024). PreSTyDe FigShare dataset. https:

//doi.org/10.6084/m9.ﬁgshare.25237600.

Chen, L., Fang, B., and Shang, Z. (2016). Software fault

prediction based on one-class SVM. In ICMLC 2016,

volume 2, pages 1003–1008.

Ciubotariu, G., Czibula, G., Czibula, I. G., and Chelaru, I.-

G. (2023). Uncovering behavioural patterns of one-

and binary-class SVM-based software defect predic-

tors. In ICSOFT 2023, pages 249–257. SciTePress.

Czibula, G., Chelaru, I.-G., Czibula, I. G., and Molnar, A.-

J. (2023). An UL-based methodology for uncovering

behavioural patterns for speciﬁc types of software de-

fects. Procedia Computer Science, 225:2644–2653.

Czibula, G. and Czibula, I. G. (2012). Unsupervised restruc-

turing of OO software systems using self-organizing

feature maps. IJICIC journal, 8(3(A)):1689–1704.

Dam, H. K., Pham, T., Ng, S. W., Tran, T., Grundy, J.,

Ghose, A., Kim, T., and Kim, C.-J. (2019). Lessons

Learned from Using a Deep Tree-Based Model for

SDPin Practice. In MSR 2019, pages 46–57.

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas,

G. W., and Harshman, R. A. (1990). Indexing by latent

semantic analysis. JASIST journal, 41:391–407.

Fawcett, T. (2006). An introduction to ROC analysis. Pat-

tern Recognition Letters, 27(8):861–874.

Herbold, S., Trautsch, A., Trautsch, F., and Ledel, B.

(2022). Problems with szz and features: An empirical

study of the state of practice of defect prediction data

collection. Empirical Software Engineering, 27(2).

Kamei, Y. and Shihab, E. (2016). Defect Prediction: Ac-

complishments and Future Challenges. In SANER

2016, volume 5, pages 33–45.

Le, Q. V. and Mikolov, T. (2014). Distributed represen-

tations of sentences and documents. Computing Re-

search Repository (CoRR), abs/1405.4:1–9.

Lenarduzzi, V., Saarim

aki, N., and Taibi, D. (2019).

The Technical Debt Dataset. In Proceedings of

PROMISE’19, page 2–11. ACM.

Li, J., He, P., Zhu, J., and Lyu, M. R. (2017). Software

Defect Prediction via Convolutional Neural Network.

In QRS 2017, pages 318–328.

otsch, J. and Ultsch, A. (2014). Exploiting the structures

of the u-matrix. In Advances in Self-Organizing Maps

and Learning Vector Quantization, pages 249–257.

Springer Publishing.

Menzies, T., Krishna, R., and Pryor, D. (2017). The

SEACRAFT Repository of Empirical Software Engi-

neering Data.

Miholca, D.-L., Tomescu, V.-I., and Czibula, G. (2022). An

in-depth analysis of the software features’ impact on

the performance of deep learning-based software de-

fect predictors. IEEE Access, 10:64801–64818.

Molnar, A.-J. and Motogna, S. (2020). Long-Term Evalu-

ation of Technical Debt in Open-Source Software. In

ESEM 2020, New York, NY, USA. ACM.

Moussa, R., Azar, D., and Sarro, F. (2022). Investigating the

use of one-class support vector machine for software

defect prediction. CoRR, abs/2202.12074.

Pachouly, J., Ahirrao, S., Kotecha, K., Selvachandran, G.,

and Abraham, A. (2022). A systematic literature re-

view on SDP using AI: Datasets, Data Validation, Ap-

proaches, and Tools. volume 111, page 104773.

Reh

rek, R. and Sojka, P. (2010). Software framework for

topic modelling with large corpora. In LREC 2010,

pages 45–50. ELRA.

Runeson, P. and H

ost, M. (2009). Guidelines for conduct-

ing and reporting case study research in software en-

gineering. Empirical Softw. Engg., 14(2):131–164.

Sahoo, S. K., Criswell, J., and Adve, V. (2010). An Em-

pirical Study of Reported Bugs in Server Software

with Implications for Automated Bug Diagnosis. In

ICSE’10, page 485–494, USA. ACM.

SARD (2023). Software Assurance Reference Dataset.

https://samate.nist.gov/SARD/.

Shepperd, M., Qinbao, S., Zhongbin, S., and Mair, C.

(2018). NASA MDP Software Defects Data Sets.

Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., and Liu, J.

(2021). A quantitative discriminant method of elbow

point for the optimal number of clusters in clustering

algorithm. J. Wirel. Commun. Netw., 2021(1):31.

Wagner, S. (2008). Defect Classiﬁcation and Defect Types

Revisited. In Proc. of the 2008 Workshop on Defects

in Large Software Systems, page 39–40, New York,

NY, USA. Association for Computing Machinery.

Wang, S., Liu, T., and Tan, L. (2016). Automatically learn-

ing semantic features for defect prediction. In Pro-

ceedings of the 38th ICSE, pages 297–308. ACM.

Xu, J., Ai, J., and Shi, T. (2021). Software Defect Prediction

for Speciﬁc Defect Types based on Augmented Code

Graph Representation. In DSA 2021, pages 669–678.

Zhang, S., Jiang, S., and Yan, Y. (2022). A Software Defect

Prediction Approach Based on BiGAN Anomaly De-

tection. Scientiﬁc Programming, 2022:ID 5024399.

Zhou, C., He, P., Zeng, C., and Ma, J. (2022). Software

defect prediction with semantic and structural infor-

mation of codes based on graph neural networks. In-

formation and Software Technology, 152:107057.

PreSTyDe: Improving the Performance of within-project Defects Prediction by Learning to Classify Types of Software Faults

225