Feature Extraction, Learning and Selection in Support of Patch

Correctness Assessment

Viktor Csuvik

, Dániel Horváth

and László Vidács

Department of Software Engineering, University of Szeged, Hungary

Keywords:

Automated Program Repair, Patch Assessment, Overﬁtting Patch, Code Features, PCC.

Abstract:

Automated Program Repair (APR) strives to minimize the expense associated with manual bug ﬁxing by

developing methods where patches are generated automatically and then validated against an oracle, such as

a test suite. However, due to the potential imperfections in the oracle, patches validated by it may still be

incorrect. A signiﬁcant portion of the literature on APR focuses on this issue, usually referred to as Patch

Correctness Check (PCC). Several approaches have been proposed that use a variety of information from

the project under repair, such as diverse manually designed heuristics or learned embedding vectors. In this

study, we explore various features obtained from previous studies and assess their effectiveness in identifying

incorrect patches. We also evaluate the potential for accurately classifying correct patches by combining

and selecting learned embeddings with engineered features, using various Machine Learning (ML) models.

Our experiments demonstrate that not all features are equally important, and selecting the right ML model

also has a huge impact on the overall performance. For instance, using all 490 features with a decision tree

classiﬁer achieves a mean F1 value of 64% in 10 independent trainings, while after an in-depth feature- and

model selection with the selected 43 features, the MLP classiﬁer produces a better performance of 81% F1.

The empirical evaluation shows that this model is able to correctly classify samples on a dataset containing

903 labeled patches with 100% precision and 97% recall on it’s peak, which is complementary performance

compared to state-of-the-art methods. We also show that independent trainings can exhibit varying outcome,

and propose how to improve the stability of model trainings.

1 INTRODUCTION

The concept of Automated Program Repair (APR),

which involves the automated resolution of software

bugs, has gained signiﬁcant traction alongside the

growing prevalence of software usage. The pre-

dominant focus of APR research revolves around

Generate-and-Validate (G&V) approaches, wherein

patch candidates are generated (e.g., via genetic al-

gorithm, heuristics, or learned code transformations)

and subsequently veriﬁed against an oracle. If the or-

acle is the test suite (which is usually), the approach is

referred to as test-suite-based program repair. Despite

facing criticism on multiple occasions, these methods

still shape the trajectory of APR research (Kechagia

et al., 2022). A notable obstacle encountered in test-

suite-based repair is the potential to create a patch that

enables the entire test suite to pass, yet remains in-

https://orcid.org/0000-0002-8642-3017

https://orcid.org/0000-0001-8855-921X

https://orcid.org/0000-0002-0319-3915

correct. This phenomenon is commonly referred to

as the overﬁtting patch problem (Wang et al., 2021)

and the goal of Patch Correcness Check (PCC) is to

determine the actual correctness of a patch, without

additional manual effort. We call a patch overﬁtting,

if it only passes the test suite, but it does not ﬁx the

program. The expectation from a correct patch is that

(1) the test suite passes, (2) it ﬁxes the original fault

and (3) does not introduce new bugs.

The generation of overﬁtting patches leads to the

generation of program repair patches with limited

utility, thereby substantially affecting the practical ap-

plicability of program repair. It also makes develop-

ers less conﬁdent in APR tools, thus reducing their

widespread use. The use of data-augmentation tech-

niques (Xin and Reiss, 2017a; Ye et al., 2021), and

repair operator curation (Wen et al., 2018) can lead

to more correct patches, but at the time of writing

this paper, the classiﬁcation of generated patches is

the most popular research direction. Among these re-

cent studies have introduced static methods for detect-

Csuvik, V., Horváth, D. and Vidács, L.

Feature Extraction, Learning and Selection in Support of Patch Correctness Assessment.

DOI: 10.5220/0012746900003753

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Software Technologies (ICSOFT 2024), pages 23-34

ISBN: 978-989-758-706-1; ISSN: 2184-2833

ing overﬁtting patches, mainly because of their ease

and speed of use. Xin et al. (Xin and Reiss, 2017b)

deﬁned 2 static features in ssFix, while 3 static fea-

tures are leveraged in S3 (Le et al., 2017) and 202

in ODS (Ye et al., 2019). Another approach is the

use of source code embeddings either directly (Lin

et al., 2022) or by calculating similarity in the em-

bedded vectors (Csuvik et al., 2020). Most of these

works operate on distinct datasets, and approaches are

in competition with each other and not complemen-

tary. In this work, our aim is to handle all of these

crafted knowledge as features for a machine learning

model and select which of these are the most useful

in the PCC domain.

In our research, we used 903 patches generated by

APR tools in Defects4J (Just et al., 2014) and previ-

ously labeled by researchers (Wang et al., 2021). For

these patches we mapped features achieving state-of-

the-art results in previous works: hand-crafted fea-

tures (Xin and Reiss, 2017b; Le et al., 2017; Wen

et al., 2018), static code features (Ye et al., 2019),

embedding vectors (Lin et al., 2022) and similarity

metrics (Csuvik et al., 2020) - thus forming a feature

vector of 490 dimension. On this set, we ﬁrst per-

formed a feature selection process, which resulted in

a 43-dimensional feature vector. The selected features

served as the basis for our model selection, in which

we trained Machine Learning (ML) models and se-

lected MLP (Multi-Layer Perceptron) that yielded the

overall best performance. In Deep Learning (DL) ap-

proaches, stability is essential because only reporting

good DL performance may threaten the experimental

conclusions (Liu et al., 2021a). To this end, we pub-

lish the training dataset along with ﬁxed seeds and re-

port the random effects of the initialization along with

our results. We also propose a technique that can im-

prove the stability of model trainings. The following

points summarize our main contributions:

• Evaluation on a Uniﬁed Benchmark: previous

works are evaluated on distinct datasets, and now

we bring these to a common ground.

• Feature selection and Combination: ap-

proaches are handled complementary, here the

best features are selected from previous works.

• Reproducible ML Model Trainings: ML mod-

els are selected and trained in a reproducible man-

ner to achieve competitive performance.

2 RESEARCH OBJECTIVES

We organize our experiment around the following re-

search questions:

RQ1. Which features are the most useful for PCC?

RQ2. What ML models are the most promising?

RQ3. How can we further enhance the performance

and stability of predictions?

The source code and detailed experimental data

can be found in the online appendix of this paper (onl,

2024).

3 BACKGROUND

Traditionally, a patch is deemed correct if it success-

fully passes all the test cases, unfortunately practi-

cal test suites often fail to guarantee the accuracy of

generated patches. Consequently, patches that pass

all tests (referred to as plausible patches) may incor-

rectly address the bug, fail to fully ﬁx the issue, or

disrupt intended functionalities, thus becoming over-

ﬁtting patches (Lutellier et al., 2019). This work

investigates the usefulness of features proposed in

previous works to tackle the problem of identifying

correct patches among incorrect and plausible APR-

generated patches. We ﬁrst provide the necessary

background on the used engineered and learned fea-

tures.

3.1 Features

Generally, within machine learning, a feature repre-

sents a distinct and measurable attribute or character-

istic of a phenomenon, while a feature vector refers

to numbers (can be binary (0-1), integer or real) in

an n-dimensional space, consisting of features (Sei-

del et al., 2017). Features are widely used in software

engineering for diverse tasks including just-in-time

quality assurance (Kamei et al., 2013), fault localiza-

tion (Kim et al., 2019), vulnerability prediction (Dam

et al., 2021) and others. In this work, we focus on

optimal feature selection in the domain of Patch Cor-

rectness Check.

3.1.1 Hand-Crafted Features

Manually crafting features so that a classiﬁer can pri-

oritize correct patches over overﬁtting ones is not un-

common in the literature. We adapt the features of

ssFix (Xin and Reiss, 2017b), S3 (Le et al., 2017) and

CapGen (Wen et al., 2018), as their implementation is

available and because these values are available in the

seminal work of Wang et al. (Wang et al., 2021). The

ICSOFT 2024 - 19th International Conference on Software Technologies

Table 1: PCC features used in this study.

Category Feature Origin Description # dim

Token- Strct/Conpt ssFix (Xin and Reiss, 2017b) Structural & conceptual token similarity obtained from the buggy code and the generated patch. 2

Hand-crafted AST/Variable- Dist S3 (Le et al., 2017) Number of the AST changes and distances between the vectors representing AST patch nodes. 4

Var/Syn/Sem- Simi CapGen (Wen et al., 2018) Similarity between variables / syntactic structures / contextual nodes affected by the change. 3

Code descriptor ODS (Ye et al., 2019) Describe the characteristics of code elements (operators, variables, statements, AST operations) 155

Engineered Repair pattern ODS (Ye et al., 2019) Repair patterns based on the work of Sobreira et al. (Sobreira et al., 2018) as binary features. 38

Contextual Syntactic ODS (Ye et al., 2019) Describe the scope, parent and children’s similarities of modiﬁed statements. 24

Dynamic Embedding Cache (Lin et al., 2022) Dimensions of the embedded patch. 256

Similarity (Csuvik et al., 2020) Similarity metrics measured between the embeddings of the patch and the original program. 8

490

features introduced in S3 quantify both syntactic and

semantic disparities between a candidate solution and

the original buggy code. Subsequently, these features

are used to prioritize and discern correct patches. In

ssFix, authors employ token-based syntax represen-

tation of code to pinpoint syntax-related code frag-

ments, aiming to generate accurate patches. CapGen

has proposed three context-aware models - the geneal-

ogy model, variable model, and dependency model,

respectively - to prioritize correct patches over over-

ﬁtting ones.

3.1.2 Engineered Features

In this study, we used ODS (Ye et al., 2019) to ex-

tract metrics from the source code. It performs static

analysis of the differences in AST between the buggy

and patched programs; these differences are encoded

as feature vectors. The authors grouped the ODS

features into three categories: code description fea-

tures, repair pattern features, and contextual syntac-

tic features. We extracted these features using Com-

ing

, an open source commit analysis tool. Due to

space limitations, we do not describe all of the fea-

tures here (they are available in the original study),

just an overview in Table 1.

3.1.3 Code Embeddings and Patch Similarity

Mapping source code into vector space can be ben-

eﬁcial in many ways, it can grasp aspects of the pro-

gram that other metrics cannot. In the domain of patch

correctness assessment, these techniques assess patch

correctness by embedding token sequences extracted

from the changed code of a generated patch (Lin et al.,

2022). Then these embeddings can be used in various

ways; we consider two scenarios in our study.

1. Use the generated embeddings directly, as if a

dimension of the embedded vector is a feature.

Thus, each dimension is treated as a metric mea-

sured on the patch.

2. Measure the vector distance between the embed-

ded vector of the patch and the original program.

https://github.com/SpoonLabs/coming

The intuition here is that correct patches are usu-

ally simpler (i.e. more similar to the original pro-

gram) than overﬁtting ones. Note that in previous

studies (Csuvik et al., 2021) authors used a pre-

deﬁned threshold to classify patches as correct or

overﬁtting ones, while in this study the threshold

is learned dynamically.

One can use diverse embedding architectures,

some of which are more resource intensive than oth-

ers. In this study we used the Doc2Vec (Le and

Mikolov, 2014) implemented in Gensim (Rehurek

and Sojka, 2010). The choice is motivated by sev-

eral factors: (1) ease of use and availability; (2) have

history in the PCC domain (Csuvik et al., 2021) and

(3) advanced embedding techniques did not surpass it

by a large margin despite being more complicated and

resource intensive (Csuvik et al., 2020). Doc2vec is

derived from Word2vec (Mikolov et al., 2013), which

is an artiﬁcial neural network, which can transform

(embed) documents into vector space. The main idea

of it is that the hidden layer of the network has fewer

neurons than the input and output layers, thus forcing

the model to learn a compact representation.

3.2 Dataset

Wang et al. (Wang et al., 2021) published a cu-

rated dataset comprising 902 labeled patches from

Defects4J bugs, generated by 19 repair tools. We used

this data set, which encompasses 654 patches labeled

as overﬁtting and 248 patches labeled as correct by

the respective authors. The dataset is still actively

maintained, easily available, and popular to this day.

In their online appendix

, authors have also published

the measured hand-crafted features on this dataset

which we used directly from there. ODS features are

calculated using their tool, while dynamic features are

calculated using our implementation available in the

attached repository (onl, 2024).

https://github.com/claudeyj/patch_correctness/

Feature Extraction, Learning and Selection in Support of Patch Correctness Assessment

… f

h+1

h+2

… f

e+1

e+2

… f

s+1

s+2

… f

. . .

… f

h+1

h+2

… f

e+1

e+2

… f

s+1

s+2

… f

Hand crafted Engineered Distances Code embedding

Feature selection

… f

l’-1

. . .

… f

l’-1

l’ < l

Training models

Probability of a

patch being correct

Correct /

Overfitting

correct

overfitting

… f

s+1

s+2

… f

s+1

s+2

… f

Selected

embedding dims

Selected static

features

Dense layers Dense layers

Cocatenation

Dense layers

0 - 1

(a) Naïve Vector Concatenation (b) Deep Representation Model

Figure 1: A high level overview of the used features and their optimization for PCC. On part (a) all features are concatenated

then the most descriptive ones are selected to teach several ML models. On (b) static features (Hand-crafted, Engineered and

Distances) and embeddings are ﬁrst fed into dense layers and the neural network concatenates them, allowing it to learn a

dynamic representation.

4 EXPERIMENT SETUP

Figure 1 shows a comprehensive overview of our

study. Our goal is straightforward: ﬁnd the set of

features and ML models that most effectively detect

overﬁtting patches. In Figure 1 (a) one can see that

features are concatenated as is and then selected some

of them, while (b) part depicts a deep representation

model where features are fed into a neural architec-

ture, allowing the net to learn the weights and biases

of each feature. The obtained features from previ-

ous studies, described in Section 3.1, form a feature

vector of l=480 dimensions (composing of both static

and dynamic features). Using feature selection tech-

niques, we select l’ features from these (l’ < l) - the

ones that best explain the input data. ML models

are trained and evaluated on this subset to determine

which yields the most optimal results. In our experi-

ments we used a 32-bit Intel(R) Core(TM) i7-10510U

CPU of 1.80GHz to train and evaluate each model and

feature conﬁguration. All of our code runs in Python

3.11.6, using the scikit-learn library version 1.4.0 (Pe-

dregosa et al., 2011). The source code and detailed

experimental data can be found in the online appendix

of this paper (onl, 2024).

4.1 Feature Selection

Feature selection is the process of selecting a subset of

relevant features to be used in model training (Sarangi

et al., 2020). There are many available feature selec-

tion algorithms, from which we used the scikit-learn

implementation of RFECV (RFECV documentation,

2024) to achieve the goal depicted in Figure 1 (a). It

recursively eliminates features with cross-validation

to select the most important features. The number of

features selected is tuned automatically by ﬁtting an

RFE selector to the different cross-validation splits.

We used a RandomForestClassiﬁer as an estimator to

provide information about feature importance mainly

because it is a preferred model in previous PCC stud-

ies (Wang et al., 2021; Ye et al., 2019) and it has eas-

ily accessible coefﬁcients required by the feature se-

lection algorithm. Experiments were carried out in a

Stratiﬁed K-Fold setting using 10 splits and the mini-

mum number of features was required to be 10.

4.2 ML Models

The utilization of scikit-learn is motivated by its ac-

cessibility, robust performance, and inclusion of well-

established, reliable models, facilitating the execu-

tion of our experiments with ease and efﬁcacy. The

following 9 models were used in part of our exper-

iments (Figure 1 (a)): DecisionTreeClassiﬁer, Gaus-

sianNB, KNeighborsClassiﬁer, LinearDiscriminant-

Analysis, LogisticRegression, MLPClassiﬁer, Ran-

domForestClassiﬁer, SGDClassiﬁer and SVC. The

description each of these models can be found the of-

ﬁcial documentation of the scikit-learn library (Scikit-

learn documentation, 2024). The concatenated fea-

tures form the input for these models.

To further enhance the performance, in addition

to built-in ML models, we built a neural network us-

ing Pytorch 2.1.2. It is able to combine features with

learned embeddings as suggested by Tian et al. (Tian

et al., 2023). The approach can be observed on Fig-

ure 1 (b). Note that this model operates on the al-

ICSOFT 2024 - 19th International Conference on Software Technologies

ready selected features, but treats embeddings dimen-

sions and numeric features separately and concate-

nates them dynamically. The gist of this approach is

that the neural architecture can learn the weighting of

each feature and is able to inference more complex

relations compared to naive concatenation. The im-

plementation of this model is available in the online

appendix of this paper (onl, 2024).

4.3 Evaluation

Previous studies have underscored the importance of

PCC techniques in avoiding the dismissal of correct

patches (as they are quite expensive to generate in the

ﬁrst place) (Yu et al., 2019). Consequently, we pro-

pose that a PCC technique is deemed effective if it

produces minimal false positives while maintaining a

high recall rate. To quantify our results, we introduce

these metrics in the following:

• True Positive: An overﬁtting patch is accurately identi-

ﬁed.

• False Positive: A correct patch identiﬁed as overﬁtting.

• False Negative: An overﬁtting patch identiﬁed as correct.

• True Negative: A correct patch identiﬁed as correct.

Using the above items, precision, recall, and F-

meaure can be computed. Precision is the propor-

tion between correctly classiﬁed overﬁtting patches

among all the classiﬁed instances, while recall is

the proportion between correctly classiﬁed overﬁtting

patches and all relevant items. They are computed as:

precision =

T P

T P + FP

recall =

T P

T P + FN

The F-measure can be deﬁned by the two metrics

above:

(β

+ 1) ∗ precision ∗ recall

∗ precision + recall

β signiﬁes the importance of precision or recall. If we

want precision and recall to weigh in with exact the

same importance, we simply assign the value 1 to β.

5 RESULTS

5.1 RQ1: Feature Selection

To examine the effect of each feature, we ﬁrst car-

ried out 10 independent trainings and executed the

feature selection algorithm described in Section 4.1.

The results of the feature selection can be observed in

Table 2. In the table, the feature set which opts for

the best results (cells of color gray ) and the inter-

section of the 10 independent trainings (cells of color

violet ) are included. In the performed experiments,

we found that the output of the feature selection algo-

rithm varies greatly due to the effect of random factors

Table 2: Features selected using the RFECV algorithm: features that yield best performance for a single execution

among the 10 feature selections. intersection between all of the features that were selected in the 10 feature optimization

turns.

Hand-crafted Engineered Distances Embeddings

s3-tool patchedFileNo cosine_distance vec_dim_0 vec_dim_62 vec_dim_182

AST-tool addLineNo braycurtis_distance vec_dim_5 . . . vec_dim_183

Cosine-tool rmLineNo canberra_distance vec_dim_6 vec_dim_84 vec_dim_184

s3variable-tool insertIfFalse chebyshev_distance vec_dim_7 . . . vec_dim_185

variable-tool updIfFalse cityblock_distance vec_dim_8 vec_dim_90 vec_dim_186

syntax-tool ifFalse euclidean_distance . . . .. . . . .

semantic-tool dupArgsInvocation minkowski_distance vec_dim_12 vec_dim_108 vec_dim_192

structural_score removeNullinCond seuclidean_distance . . . vec_dim_109 ...

conceptual_score condLogicReduce vec_dim_21 vec_dim_110 vec_dim_200

insertBooleanLiteral . . . vec_dim_111 . . .

insertNewConstLiteral vec_dim_27 . . . vec_dim_208

UpdateLiteral . . . vec_dim_130 vec_dim_209

wrapsTryCatch vec_dim_31 . . . vec_dim_210

. . . vec_dim_32 vec_dim_144 vec_dim_211

P4J_LATER_MEMBER_VF . . . vec_dim_145 . . .

P4J_LATER_MODIFIED_SIMILAR_VF vec_dim_43 . . . vec_dim_220

P4J_LATER_MODIFIED_VF . . . vec_dim_161 vec_dim_221

P4J_LATER_NONZERO_CONST_VF vec_dim_51 . . . . . .

P4J_LATER_OP_ADD_AF vec_dim_52 vec_dim_167 vec_dim_225

. . . . . . .. . . . .

S6_METHOD_THROWS_EXCEPTION vec_dim_58 vec_dim_243

9 / 7 7 217 / 2 2 8 / 1 0 256 / 33 6 Overall: 490 / 43 15

Feature Extraction, Learning and Selection in Support of Patch Correctness Assessment

during the training phase. Despite these differences,

it is clear that most features can be opted out, and

that hand-crafted ones form the most important sub-

set of such features. It is also interesting to observe

that some embedding dimensions hold more valuable

information than others.

Answer to RQ1. Based on our experiments, on the

used ML models and parameter conﬁgurations,

some features are not beneﬁcial - omitting these

does not affect the results negatively, quite the op-

posite, precision and recall improved in best sce-

narios. Overall 43 features have been selected by

a single run and 15 joint features have been iden-

tiﬁed across all the 10 independent trainings.

To further investigate each feature subset, 9 classi-

ﬁers have been trained on each. The results are listed

in Table 3. What we can see is that the MIN and MAX

values vary greatly in all of the feature sets. Despite

the deviations, it is evident that some features can

be opted out without any negative consequences and

that on average the RFECV

intersect

yielded the best re-

sult in Precision, while RFECV

best

in Recall and F1

- thus it is more suitable for PCC. Certain embed-

ding dimensions apear to contain valuable informa-

tion; however, the model fails to encompass all nec-

essary components. Sole reliance on embeddings led

to a decrease in classiﬁer performance. A future re-

search dimension could be to investigate what (if any)

embedding dimension is equivalent to which hand-

crafted/engineered feature.

The selected engineered features rmLineNo and

P4J_LATER_NONZERO_CONST_VF seem to grasp an im-

portant aspect of PCC, as these are selected in all

feature selection attempts. Together with the hand-

crafted features, these form the most essential part of

the features. Table 3 supports this observation, as the

Engineered

plus

subset yields only slightly lower F1

values than the optimized sets. However, it should

be noted that Engineered features only bring an addi-

tional absolute growth of 1% on average, which can

also be attributed to random factors. Random inter-

play is reﬂected in huge differences in performance

in our experiments. This is not unique for PCC, but

for example, if we consider the subset of the distance

metrics, it can be seen that in the worst-case scenario

it achieved 0% precision, while on the best case 100%

precision (but on average quite moderate). We did not

explore the random effects on the embedding model

but hypothesize that they might have a similar impact.

By selecting alternative features, not limited to em-

beddings and distances, one may potentially mitigate

this effect.

Table 3: Measures on various feature subsets.

F1 Prec Recall

All .62 .65 .59

RFECV

best

.65 .69 .59

RFECV

intersect

.58 .58 .50

Distances .00 .00 .00

Embeddings .54 .51 .45

Engineered .56 .52 .55

Engineered

plus

.63 .63 .61

MIN

Hand-crafted .17 .35 .10

All .80 .81 .82

RFECV

best

.81 .84 .78

RFECV

intersect

.80 .85 .76

Distances .18 .59 .11

Embeddings .70 .69 .71

Engineered .76 .76 .76

Engineered

plus

.77 .77 .78

MEAN

Hand-crafted .42 .57 .35

All .92 .96 .96

RFECV

best

.91 1.00 .97

RFECV

intersect

.91 1.00 .90

Distances .41 1.00 .31

Embeddings .81 .88 .90

Engineered .89 .95 .93

Engineered

plus

.90 .96 .93

MAX

Hand-crafted .59 .89 .61

The grouped rows (MIN, MEAN, MAX) indicate the mini-

mum, average and maximum values of each metric we used

in the 10 independent trainings. Each subset contains the

followings: All (all 490 features), RFECV

best

(43 features

from the feature selection algorithm), RFECV

intersect

(the

15 joint feature that were selected in all 10 runs), Distances

(the 8 distance metrics), Embeddings (256 dimension of

the embedded code vectors), Hand-crafted (9 hand-crafted

features from previous studies), Engineered (217 feature

from ODS) and Engineered

plus

(static features compris-

ing of hand-crafted and engineered ones -Engineered

plus

Engineered ∪ Hand − cra f ted).

5.2 RQ2: Model Selection

Prior studies on PCC (Ye et al., 2019; Wang et al.,

2021) have exhibited a predilection for Random For-

est as the classiﬁer of choice. Additionally, it has

been widely adopted in addressing various Software

Engineering-related issues due to its demonstrated ef-

ﬁcacy across diverse tasks (Bludau and Pretschner,

2022; Bowes et al., 2018). These experiences drove

our intuition to use Random Forest in feature selec-

tion, but it also raises the question of whether other

classiﬁers might outperform it. In the subsequent ex-

periment, we trained 9 classiﬁers 10 times each to ob-

tain the results presented in Table 4. We used the

RFECV

best

feature set obtained in the previous sec-

ICSOFT 2024 - 19th International Conference on Software Technologies

tion on all observed models. What we can see is that

on average the MLPClassiﬁer is the most harmonic:

the F1 metric reaches highest values here on average.

Also, apart from the GaussianNB classiﬁer (which is

insufﬁcient in terms of Precision), MLP provides the

highest Recall values.

Answer to RQ2. While GaussianNB consistently

produced the highest Recall values, its efﬁcacy in

precisely detecting overﬁtting patches appears in-

efﬁcient. On the other hand, as previous stud-

ies suggested, RandomForest consistently pro-

vides reliable results, however, our ﬁndings indi-

cate that MLPClassiﬁer outperforms it by a small

margin.

Relying only on the MAX values would ﬂaw the

ﬁndings of our paper, thus we try to see the whole

picture and are looking for a classiﬁer that works well

in real life scenarios most of the time (even in the

worst case). What we can see in Table 4 is that the

MLPClassiﬁer performs reasonably well compared to

other models in the MIN case, also. On the other

hand, the motivation behind the use of Random Forest

is understandable: it provides a well-explainable out-

put with moderate training costs. The beneﬁt of the

MLPClassiﬁer might lie in its ﬂexibility, as the Multi-

layer Perceptron is a built-in model within scikit-learn

with limited possibilities for customization, building

a neural network from scratch and including domain-

speciﬁc knowledge might add additional value to this

model. While in this section we treated every fea-

ture equally and combined them naively (i.e. form-

ing a feature vector which includes all the features of

a subset), in the following we explore the possibility

to combine the selected features dynamically by ex-

panding the MLPClassiﬁer and implementing a Neu-

ral Network in Pytorch.

5.3 RQ3: Potential Improvements

As depicted in Figure 1 (b) and described in Sec-

tion 4.2, we further try to enhance the performance.

In the previous experiments we already identiﬁed the

RFECV

best

feature set as the 43 features worth train-

ing on and the MLPClassiﬁer due to its ﬂexible na-

ture and reliable outcomes. A Neural Network has

been constructed that learns a deep representation of

the input features; the measured results are shown in

Table 5. Due to space limitations the exact architec-

ture is not detailed here, but the interested Reader

can ﬁnd it in the online appendix (onl, 2024). The

network has been trained using k-fold cross valida-

tion, when the validation loss became smaller than

a certain epsilon, the training process was stopped

(early stopping) - thus reducing the chance of over-

ﬁtting. Having a Neural Network gives the possibil-

ity to weight input features - apart from ﬁltering un-

necessary features out this approach can give different

weights to features depending on how important they

are. Similarly to the previous experiments, the model

was trained and evaluated 10 times, thus we display

the MIN, MEAN, and MAX values of each metric. It

is evident that the metric values did not improve on

average (or at least not signiﬁcantly, which cannot

be attributed to random factors). On the other hand,

the stability of the approach improved: the previous

absolute deviation of 30% in the F1 score has been

reduced to 16%, thus making the model much more

reliable than before. These ML predictors are com-

plementary to other state-of-the-art methods and sim-

ilar to them in ﬁltering out patches generated by APR

tools (Tian et al. (Tian et al., 2023) 79%, Wang et al.

(Wang et al., 2021) 87% F1 score).

Ensemble Learning (Polikar, 2012) and Majority

Voting (Penrose, 1946) are both techniques employed

Table 4: Evaluation of the RFECV

best

feature set on 9 ML

classiﬁers.

F1 Prec Recall

DecisionTree .46 .45 .45

GaussianNB .44 .30 .82

KNeighbors .45 .43 .46

LDA .52 .46 .54

MIN LogRegression .62 .59 .62

MLPClassiﬁer .62 .65 .59

RandomForest .61 .70 .48

SGDClassiﬁer .55 .53 .52

SVC .60 .60 .52

DecisionTree .64 .64 .65

GaussianNB .51 .35 .94

KNeighbors .70 .69 .71

LDA .67 .60 .76

MEAN LogRegression .74 .69 .79

MLPClassiﬁer .80 .81 .82

RandomForest .77 .86 .71

SGDClassiﬁer .71 .68 .75

SVC .76 .74 .79

DecisionTree .79 .81 .89

GaussianNB .56 .39 1.00

KNeighbors .83 .85 .90

LDA .80 .72 .97

MAX LogRegression .88 .89 1.00

MLPClassiﬁer .92 .96 .96

RandomForest .91 1.00 .93

SGDClassiﬁer .84 .89 .93

SVC .89 .88 .97

Feature Extraction, Learning and Selection in Support of Patch Correctness Assessment

Table 5: Evaluation of Deep Representation Learning.

F1 Precision Recall

MIN 72.73% 69.70% 68.97%

MEAN 81.92% 80.77% 83.68%

MAX 88.89% 91.67% 96.55%

in ML to enhance predictive performance by combin-

ing the outputs of multiple (usually weak) individual

models. Through the aggregation of predictions, ma-

jority voting leverages the collective wisdom of di-

verse models to make decisions. As decision in a Ran-

dom Forest is obtained by majority voting of the indi-

vidual trees, it alone can be treated as a Majority Vot-

ing approach. However, several recent approaches in-

tegrate the learned models either by Ensemble Learn-

ing or by Majority Voting strategies. To investigate

the performance of such approaches, we also com-

bined the output of the nine observed ML models by

weighting their output. We call this method stack-

ing, and it is based on a StackingRegressor (Scikit-

learn StackingRegressor, 2024). Stacked generaliza-

tion consists of stacking the output of individual esti-

mator and using a regressor to compute the ﬁnal pre-

diction.

The results of this approach can be observed in Ta-

ble 6. Stacking allows us to use the strength of each

individual estimator by using their output as input of

a ﬁnal estimator. Although the F1 score and Preci-

sion improved on average, Recall decreased making

this method unsuitable for PCC. Another unfavorable

inspection is that the deviation of all three metrics has

doubled compared to the previous measurement. Dur-

ing the experiment, we noticed that results are close

to the ones obtained with MLP. After further inves-

tigation we found that the algorithm assigns most of

its weights to the MLP (on average 26%), Random

Forest (38%) and SVC (32%) classiﬁers and relies

only negligibly on other models. The implication of

this observation suggests that machine learning mod-

els exhibit equal confusion regarding the remaining

incorrectly classiﬁed samples, whereas the correctly

classiﬁed examples are largely identical. The under-

lying reason might be data quality, inaccurate oracle

(human error on classiﬁcation), imbalanced data, sub-

optimal network architecture and parameterization,

etc. Investigating such a problem might be a separate

research direction for future work.

Table 6: Results showcasing the stacked performance of the

9 ML models.

F1 Precision Recall

Min 64.15% 66.67% 58.62%

Mean 82.38% 85.83% 79.68%

Max 93.10% 96.30% 93.10%

Answer to RQ3: Improvement in stability can

be achieved to a certain extent; however, the

improvements may not sufﬁce to ensure consis-

tently reliable outcomes. Both Deep Representa-

tion Learning and Stacking failed at improving ﬁl-

tering out overﬁtting patches, although the former

yields similar results with a more reliable stan-

dard deviation.

6 DISCUSSION AND THREATS

TO VALIDITY

In this study, our primary emphasis lay in optimiz-

ing the features utilized within machine learning mod-

els, with a secondary focus on enhancing performance

and stability through deep learning techniques. We

only examined the domain of PCC, however, this

might be a general problem in Software Engineer-

ing research using ML. As we have seen, features

signiﬁcantly inﬂuence both the performance and sta-

bility of the applied machine learning model; how-

ever, careful construction of a neural architecture may

also enhance stability. Through the application of our

method, we contend that improved practices can be

established for the publication of machine learning

applications and the assessment of their stability in

APR and also in the wider domain of Software Engi-

neering.

An interesting insight arises from the study of

Wang et al. (Wang et al., 2021), where they achieved

87.01% Precision and 89.14% Recall using only

Hand-crafted features and a Random Forest classi-

ﬁer, contrasting sharply with our own results of 57%

Precision and 35% Recall on average using the same

features. While these ﬁgures closely approximate

Wang et al. ’s results under the best-case scenario

(MAX), they remain unreproducible. Another fac-

tor to consider is the choice of library; the aforemen-

tioned article utilized the Weka app (Weka Webite,

2024), which, by its graphical interface, inadver-

tently undermines reproducibility, unlike our use of

the scikit-learn library. Reproducibility can be sig-

niﬁcantly improved by sharing a reproduction pack-

age; however, the applicability of the proposed model

remains limited without evaluating its stability. No-

tably, the selected feature set RFECV

best

yields more

reliable results than previous iterations, and the con-

structed neural networks contribute to the stability

and reproducibility of our study. Additionally, the on-

line appendix of this paper offers full reproducibility

of the experiments conducted (onl, 2024).

A possible threat to the validity of our research re-

ICSOFT 2024 - 19th International Conference on Software Technologies

lates to the benchmark we used. We only utilized the

dataset of Wang et al. (Wang et al., 2021) for evalu-

ation; however it is the most complete dataset, to the

best of our knowledge, in the literature for conduct-

ing such studies. We believe that this choice does not

impact the assessment of feature selection. Another

threat might be the effect of random initialization,

which can have a huge impact on the resulting model.

By training the models in a cross-validation setting 10

times with different random seeds and averaging the

results, we tried to mitigate this effect. Due to space

constraints, we did not include the output of all the

trainings, settings, and detailed experiments that were

executed in the paper. However, our online appendix

package (onl, 2024) ensures full reproducibility and

detailed data on the research conducted.

7 RELATED WORK

Evaluating existing APR approaches is crucial, but as-

sessing APR tools solely on plausible patches is in-

accurate due to the overﬁtting issue inherent in test

suite-based automatic patch generation. Identifying

the correct patches among plausible ones requires ad-

ditional developer effort. Recently several approaches

have been proposed to tackle the problem of Patch

Correctness Check (Yang et al., 2023).

Feature-Based PCC

Liu et al. (Liu et al., 2021b) proposes eight evalu-

ation metrics for fairly assessing the performance of

APR tools in addition to providing a critical review

on the existing evaluation of patch generation sys-

tems. Authors in (Wang et al., 2021) benchmark the

state-of-the-art patch correctness techniques based on

the largest patch benchmark so far and gather the ad-

vantages and disadvantages of existing approaches.

Tian et al. (Tian et al., 2020) use Doc2Vec, Bert,

code2vec and CC2Vec to investigate the discrimi-

native power of features. They claim that Logistic

Regression with BERT embedding scored 0.72% F-

Measure and 0.8% AUC on labeled dataset of 1,000

patches. Opad (Yang et al., 2017) (Overﬁtted Patch

Detection) is another tool which aims to ﬁlter out

incorrect patches. Opad uses fuzz testing to gener-

ate new test cases and employs two test oracles to

enhance the validity checking of automatically gen-

erated patches. Anti-pattern based correction check

is also a viable approach (Tan et al., 2016). Syn-

tactic or semantic metrics such as cosine similarity

and output coverage (Le et al., 2017) can also be ap-

plied to measure similarity, like in the tool named

Qlose (D’Antoni et al., 2016). A recent study (Is-

mayilzada et al., 2023) presents a new lightweight

speciﬁcation method that enhances failing tests with

preservation conditions, ensuring that patched and

prepatched versions produce identical outputs under

speciﬁc conditions. It also introduces a differen-

tial fuzzer for efﬁcient patch classiﬁcation, surpass-

ing four state-of-the-art approaches, with a user study

showing preference for the semi-automatic patch as-

sessment method over manual assessment.

Dynamic PCC

Numerous studies (Mechtaev et al., 2015; White

et al., 2019; Tufano et al., 2019) emphasize the im-

portance of simplifying the generated repair patches.

A recent study (Wang et al., 2019) found that 25.4%

(45/177) of correct patches generated by APR tech-

niques differ syntactically from those provided by a

developer. Other methods, such as learning from

human-written code (Le et al., 2016; Kim et al.,

2013), have shown promise but have faced recent

criticism (Monperrus, 2014). In other works (Csu-

vik et al., 2021) candidate patches were ranked ac-

cording to their similarity to the original program

and assessed as a recommendation system. Oth-

ers have also used embedding techniques, but not

only focusing on the changed code, but also taking

into consideration the unchanged correlated part (Lin

et al., 2022) by measuring the similarity between the

patched method name and the semantic meaning of

body of the method (Phung et al., 2022) or based

their approach onthe fact that similar failing test cases

should require similar patches (Tian et al., 2022). The

reliability of automated annotations for patch correct-

ness has also been proposed (Le et al., 2019). Authors

compared them with a gold standard of correctness

labels for 189 patches, ﬁnding that although indepen-

dent test suites may not sufﬁce as effective APR or-

acles, they can augment author annotations. Mean-

while, Xiong et al. (Xiong et al., 2018) proposed

leveraging behavior similarity in test case executions

to determine correct patches. By improving test suites

with new inputs and using behavior similarity, they

prevented 56.3% of incorrect patches from being gen-

erated. Ortin et al. (Ortin et al., 2020) achieved better

performance than the existing systems for source code

classiﬁcation using embeddings, even though it has

not been used for APR. Another work that is highly

utilized these days in ML for source code is Code-

BERT, which has been employed in many different

domains, including APR (Feng et al., 2020).

This Study

In this work, we mainly focused on the optimization

of the features used in ML models and partly with

enhancing the performance with deep learning. We

Feature Extraction, Learning and Selection in Support of Patch Correctness Assessment

used data from the study of Wang et al. (Wang et al.,

2021) and the engineered features from ODS (Ye

et al., 2019). Our work is fundamentally different

from these: (1) we do not concentrate on the deﬁ-

nition of features but rather use existing ones, (2) all

available features are treated as complements to each

other, (3) the goal was to achieve a cross-research fea-

ture set which is the most optimal to PCC. In a recent

study, Tian et al. (Tian et al., 2023) already proposed

deep-combination of features, but their approach (1)

does not apply feature selection, (2) no diverse ML

model training is carried out, and (3) the used static

feature set is not as thorough as the ones we have

presented. We also publish the training dataset along

with the ﬁxed seeds and the source code of the ex-

periments (which is only partially true for previous

studies).

8 CONCLUSIONS

In this study, we acquired 490 features, comprising

both engineered features and code embeddings, to ad-

dress Patch Correctness Check. Initially, a feature se-

lection algorithm was used to extract 43 features from

the extensive feature set, indicating the limited infor-

mational contribution of most original features. Sub-

sequently, we conducted training and evaluation of

nine machine learning models to discern the optimal

performer. To counteract random factors, each model

underwent 10 training iterations using different ran-

dom seeds. Our ﬁndings suggest better performance

of the models on average when utilizing the selected

feature set in comparison to the entire feature set or

other subsets. Among the models examined, Multi-

Layer Perceptron (MLP) and Random Forest consis-

tently exhibited the most reliable results, achieving

average F1 scores of 0.8 and 0.77, respectively. How-

ever, due to random factors, the MLP score ﬂuctu-

ated to 0.62 in unfavorable cases or peaked at 0.92

in fortuitous circumstances. Employing a more com-

plex neural architecture that integrates learned em-

beddings with other features enabled us to mitigate

this variability, reducing the absolute ﬂuctuation in

the F1 score from 30% to 16%.

Our research underscores two major implica-

tions. First, the development of PCC classiﬁers re-

quires careful planning of both feature selection and

model construction; While hand-crafted features re-

main paramount, embeddings may also contain useful

information. Second, machine learning methodolo-

gies must prioritize model stability, as it profoundly

inﬂuences the validity and signiﬁcance of results.

ACKNOWLEDGEMENTS

The research presented in this paper was supported

in part by the ÚNKP-23-3-SZTE-435 New National

Excellence Program of the Ministry for Culture and

Innovation from the source of the National Research,

Development and Innovation Fund, and by the Euro-

pean Union project RRF-2.3.1-21-2022-00004 within

the framework of the Artiﬁcial Intelligence National

Laboratory. The national project TKP2021-NVA-09

also supported this work. Project no TKP2021-NVA-

09 has been implemented with the support provided

by the Ministry of Culture and Innovation of Hun-

gary from the National Research, Development and

Innovation Fund, ﬁnanced under the TKP2021-NVA

funding scheme.

REFERENCES

(2024). Supplemental material for "feature extraction,

learning and selection in support of patch cor-

rectness assessment". https://github.com/AAI-USZ/

PCC-2024.

Bludau, P. and Pretschner, A. (2022). Feature sets in just-in-

time defect prediction: an empirical evaluation. Pro-

ceedings of the 18th International Conference on Pre-

dictive Models and Data Analytics in Software Engi-

neering.

Bowes, D., Hall, T., and Petri

c, J. (2018). Software defect

prediction: do different classiﬁers ﬁnd the same de-

fects? Software Quality Journal, 26(2):525–552.

Csuvik, V., Horvath, D., Horvath, F., and Vidacs, L. (2020).

Utilizing Source Code Embeddings to Identify Cor-

rect Patches. In 2020 IEEE 2nd International Work-

shop on Intelligent Bug Fixing (IBF), pages 18–25.

IEEE.

Csuvik, V., Horváth, D., Lajkó, M., and Vidács, L.

(2021). Exploring plausible patches using source

code embeddings in javascript. 2021 IEEE/ACM In-

ternational Workshop on Automated Program Repair

(APR), pages 11–18.

Dam, H. K., Tran, T., Pham, T., Ng, S. W., Grundy, J., and

Ghose, A. (2021). Automatic feature learning for pre-

dicting vulnerable software components. IEEE Trans-

actions on Software Engineering, 47(1):67–85.

D’Antoni, L., Samanta, R., and Singh, R. (2016). QLOSE:

Program repair with quantitative objectives. Technical

report.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,

M., Shou, L., Liu, T., Jiang, D., and Zhou, M. (2020).

Codebert: A pre-trained model for programming and

natural languages.

Ismayilzada, E., Rahman, M. M. U., Kim, D., and Yi,

J. (2023). Poracle: Testing patches under preserva-

tion conditions to combat the overﬁtting problem of

program repair. ACM Trans. Softw. Eng. Methodol.,

33(2).

ICSOFT 2024 - 19th International Conference on Software Technologies

Just, R., Jalali, D., and Ernst, M. D. (2014). Defects4J: A

database of existing faults to enable controlled test-

ing studies for Java programs. In 2014 International

Symposium on Software Testing and Analysis, ISSTA

2014 - Proceedings, pages 437–440. Association for

Computing Machinery, Inc.

Kamei, Y., Shihab, E., Adams, B., Hassan, A. E., Mockus,

A., Sinha, A., and Ubayashi, N. (2013). A large-

scale empirical study of just-in-time quality assur-

ance. IEEE Transactions on Software Engineering,

39(6):757–773.

Kechagia, M., Mechtaev, S., Sarro, F., and Harman, M.

(2022). Evaluating automatic program repair capa-

bilities to repair api misuses. IEEE Transactions on

Software Engineering, 48(7):2658–2679.

Kim, D., Nam, J., Song, J., and Kim, S. (2013). Auto-

matic patch generation learned from human-written

patches. In Proceedings - International Conference

on Software Engineering, pages 802–811. IEEE.

Kim, Y., Mun, S., Yoo, S., and Kim, M. (2019). Pre-

cise learn-to-rank fault localization using dynamic and

static features of target programs. ACM Trans. Softw.

Eng. Methodol., 28(4).

Le, D. X. B., Bao, L., Lo, D., Xia, X., Li, S., and Pasareanu,

C. (2019). On Reliability of Patch Correctness Assess-

ment. In Proceedings - International Conference on

Software Engineering, volume 2019-May, pages 524–

535. IEEE Computer Society.

Le, Q. V. and Mikolov, T. (2014). Distributed Representa-

tions of Sentences and Documents. Technical report.

Le, X.-B. D., Chu, D.-H., Lo, D., Le Goues, C., and Visser,

W. (2017). S3: syntax- and semantic-guided repair

synthesis via programming by examples. In Pro-

ceedings of the 2017 11th Joint Meeting on Founda-

tions of Software Engineering, ESEC/FSE 2017, page

593–604, New York, NY, USA. Association for Com-

puting Machinery.

Le, X. B. D., Lo, D., and Goues, C. L. (2016). History

Driven Program Repair. In 2016 IEEE 23rd Inter-

national Conference on Software Analysis, Evolution,

and Reengineering (SANER), pages 213–224. IEEE.

Lin, B., Wang, S., Wen, M., and Mao, X. (2022).

Context-aware code change embedding for better

patch correctness assessment. ACM Trans. Softw. Eng.

Methodol., 31(3).

Liu, C., Gao, C., Xia, X., Lo, D., Grundy, J., and Yang,

X. (2021a). On the reproducibility and replicability

of deep learning in software engineering. ACM Trans.

Softw. Eng. Methodol., 31(1).

Liu, K., Li, L., Koyuncu, A., Kim, D., Liu, Z., Klein, J.,

and Bissyandé, T. F. (2021b). A critical review on

the evaluation of automated program repair systems.

Journal of Systems and Software, 171:110817.

Lutellier, T., Pang, L., Pham, V. H., Wei, M., and Tan, L.

(2019). ENCORE: Ensemble Learning using Convo-

lution Neural Machine Translation for Automatic Pro-

gram Repair.

Mechtaev, S., Yi, J., and Roychoudhury, A. (2015). Direct-

Fix: Looking for Simple Program Repairs. In 2015

IEEE/ACM 37th IEEE International Conference on

Software Engineering, pages 448–458. IEEE.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean,

J. (2013). Distributed Representations of Words and

Phrases and their Compositionality. Technical report.

Monperrus, M. (2014). A critical review of "auto-

matic patch generation learned from human-written

patches": essay on the problem statement and the eval-

uation of automatic software repair. In Proceedings of

the 36th International Conference on Software Engi-

neering - ICSE 2014, pages 234–242, New York, New

York, USA. ACM Press.

Ortin, F., Rodriguez-Prieto, O., Pascual, N., and Garcia, M.

(2020). Heterogeneous tree structure classiﬁcation to

label java programmers according to their expertise

level. Future Gener. Comput. Syst., 105(C):380–394.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Penrose, L. S. (1946). The elementary statistics of major-

ity voting. Journal of the Royal Statistical Society,

109(1):53–57.

Phung, Q.-N., Kim, M., and Lee, E. (2022). Identifying

incorrect patches in program repair based on meaning

of source code. IEEE Access, 10:12012–12030.

Polikar, R. (2012). Ensemble learning. Ensemble machine

learning: Methods and applications, pages 1–34.

Rehurek, R. and Sojka, P. (2010). Software Framework for

Topic Modelling with Large Corpora. Proceedings of

the LREC 2010 Workshop on New Challenges for NLP

Frameworks, pages 45–50.

RFECV documentation (2024). Rfecv documentation.

https://scikit-learn.org/stable/modules/generated/

sklearn.feature_selection.RFECV.html.

Sarangi, S., Sahidullah, M., and Saha, G. (2020). Optimiza-

tion of data-driven ﬁlterbank for automatic speaker

veriﬁcation. Digital Signal Processing, 104:102795.

Scikit-learn documentation (2024). Scikit-learn documen-

tation. https://scikit-learn.org/stable/user_guide.html.

Scikit-learn StackingRegressor (2024). Scikit-learn stack-

ingregressor. https://scikit-learn.org/stable/modules/

generated/sklearn.ensemble.StackingRegressor.html.

Seidel, E. L., Sibghat, H., Chaudhuri, K., Weimer, W., and

Jhala, R. (2017). Learning to blame: localizing novice

type errors with data-driven diagnosis. Proc. ACM

Program. Lang., 1(OOPSLA).

Sobreira, V., Durieux, T., Delﬁm, F. M., Martin, M.,

and de Almeida Maia, M. (2018). Dissection of

a bug dataset: Anatomy of 395 patches from de-

fects4j. 2018 IEEE 25th International Conference

on Software Analysis, Evolution and Reengineering

(SANER), pages 130–140.

Tan, S. H., Yoshida, H., Prasad, M. R., and Roychoudhury,

A. (2016). Anti-patterns in search-based program re-

pair. In Proceedings of the 2016 24th ACM SIGSOFT

International Symposium on Foundations of Software

Feature Extraction, Learning and Selection in Support of Patch Correctness Assessment

Engineering - FSE 2016, pages 727–738, New York,

New York, USA. ACM Press.

Tian, H., Li, Y., Pian, W., Kaboré, A. K., Liu, K., Habib,

A., Klein, J., and Bissyandé, T. F. (2022). Predicting

patch correctness based on the similarity of failing test

cases. ACM Trans. Softw. Eng. Methodol., 31(4).

Tian, H., Liu, K., Kaboré, A. K., Koyuncu, A., Li, L., Klein,

J., and Bissyandé, T. F. (2020). Evaluating representa-

tion learning of code changes for predicting patch cor-

rectness in program repair. In 2020 35th IEEE/ACM

International Conference on Automated Software En-

gineering (ASE), pages 981–992.

Tian, H., Liu, K., Li, Y., Kaboré, A. K., Koyuncu, A.,

Habib, A., Li, L., Wen, J., Klein, J., and Bissyandé,

T. F. (2023). The best of both worlds: Combin-

ing learned embeddings with engineered features for

accurate prediction of correct patches. ACM Trans.

Softw. Eng. Methodol., 32(4).

Tufano, M., Pantiuchina, J., Watson, C., Bavota, G., and

Poshyvanyk, D. (2019). On learning meaningful code

changes via neural machine translation. In Proceed-

ings of the 41st International Conference on Software

Engineering, ICSE ’19, page 25–36. IEEE Press.

Wang, S., Wen, M., Chen, L., Yi, X., and Mao, X. (2019).

How Different Is It between Machine-Generated and

Developer-Provided Patches? : An Empirical Study

on the Correct Patches Generated by Automated Pro-

gram Repair Techniques. In International Sympo-

sium on Empirical Software Engineering and Mea-

surement, volume 2019-Septe. IEEE Computer Soci-

ety.

Wang, S., Wen, M., Lin, B., Wu, H., Qin, Y., Zou, D., Mao,

X., and Jin, H. (2021). Automated patch correctness

assessment: how far are we? In Proceedings of the

35th IEEE/ACM International Conference on Auto-

mated Software Engineering, ASE ’20, page 968–980,

New York, NY, USA. Association for Computing Ma-

chinery.

Weka Webite (2024). Weka webite. https://www.weka.io.

Wen, M., Chen, J., Wu, R., Hao, D., and Cheung, S.-

C. (2018). Context-aware patch generation for bet-

ter automated program repair. In Proceedings of the

40th International Conference on Software Engineer-

ing, ICSE ’18, page 1–11, New York, NY, USA. As-

sociation for Computing Machinery.

White, M., Tufano, M., Martinez, M., Monperrus, M., and

Poshyvanyk, D. (2019). Sorting and transforming pro-

gram repair ingredients via deep learning code sim-

ilarities. In Wang, X., Lo, D., and Shihab, E., edi-

tors, 26th IEEE International Conference on Software

Analysis, Evolution and Reengineering, SANER 2019,

Hangzhou, China, February 24-27, 2019, pages 479–

490. IEEE.

Xin, Q. and Reiss, S. P. (2017a). Identifying test-suite-

overﬁtted patches through test case generation. In

Proceedings of the 26th ACM SIGSOFT International

Symposium on Software Testing and Analysis, ISSTA

2017, page 226–236, New York, NY, USA. Associa-

tion for Computing Machinery.

Xin, Q. and Reiss, S. P. (2017b). Leveraging syntax-related

code for automated program repair. In 2017 32nd

IEEE/ACM International Conference on Automated

Software Engineering (ASE), pages 660–670.

Xiong, Y., Liu, X., Zeng, M., Zhang, L., and Huang, G.

(2018). Identifying patch correctness in test-based

program repair. In Proceedings - International Con-

ference on Software Engineering, pages 789–799.

IEEE Computer Society.

Yang, J., Wang, Y., Lou, Y., Wen, M., and Zhang, L. (2023).

A large-scale empirical review of patch correctness

checking approaches. In Proceedings of the 31st ACM

Joint European Software Engineering Conference and

Symposium on the Foundations of Software Engineer-

ing, ESEC/FSE 2023, page 1203–1215, New York,

NY, USA. Association for Computing Machinery.

Yang, J., Zhikhartsev, A., Liu, Y., and Tan, L. (2017). Better

test cases for better automated program repair. Pro-

ceedings of the 2017 11th Joint Meeting on Foun-

dations of Software Engineering - ESEC/FSE 2017,

pages 831–841.

Ye, H., Gu, J., Martinez, M., Durieux, T., and Mar-

tin, M. (2019). Automated classiﬁcation of over-

ﬁtting patches with statically extracted code fea-

tures. IEEE Transactions on Software Engineering,

48:2920–2938.

Ye, H., Martinez, M., and Monperrus, M. (2021). Auto-

mated patch assessment for program repair at scale.

Empirical Softw. Engg., 26(2).

Yu, Z., Martinez, M., Danglot, B., Durieux, T., and Mon-

perrus, M. (2019). Alleviating patch overﬁtting with

automatic test generation: a study of feasibility and ef-

fectiveness for the nopol repair system. Empir. Softw.

Eng., 24(1):33–67.

ICSOFT 2024 - 19th International Conference on Software Technologies