Improve Classiﬁcation of Commits Maintenance Activities with

Quantitative Changes in Source Code

Richard V. R. Mariano

1 a

, Geanderson E. dos Santos

2 b

and Wladmir Cardoso Brand

1 c

Department of Computer Science, Pontiﬁcal Catholic University of Minas Gerais (PUC Minas), Belo Hozizonte, Brazil

Department of Computer Science, Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil

Keywords:

Software Maintenance, Quantitative Changes, Classiﬁcation, Machine Learning.

Abstract:

Software maintenance is an important stage of software development, contributing to the quality of the soft-

ware. Previous studies have shown that maintenance activities spend more than 40% of the development effort,

consuming most part of the software budget. Understanding how these activities are performed can support

managers to previously plan and allocate resources. Despite previous studies, there is still a lack of accu-

rate models to classify software commits into maintenance activities. In this work, we deepen our previous

work, in which we proposed improvements in one of the state-of-art techniques to classify software commits.

First, we include three additional features that concern the size of the commit, from the state-of-art technique.

Second, we propose the use of the XGBoost, one of the most advanced implementations of boosting tree al-

gorithms, and tends to outperform other machine learning models. Additionally, we present a deep analysis

of our model to understand their decisions. Our ﬁndings show that our model outperforms the state-of-art

technique achieving more than 77% of accuracy and more than 64% in the Kappa metric.

1 INTRODUCTION

The present article is an extension of a work presented

in (Mariano et al., 2019). The development of a soft-

ware project is marked by several stages. Thus, the

software maintenance is important to keep certain lev-

els of software quality (Gupta, 2017). Furthermore,

previous studies (Lientz et al., 1978; Schach et al.,

2003; Levin and Yehudai, 2016; Gupta, 2017; Levin

and Yehudai, 2017b) have already discussed that soft-

ware maintenance is the most costly step in soft-

ware projects. Studies with different approaches have

been present (Swanson, 1976; Mockus and Votta,

2000), seeking to understand the maintenance activ-

ities. This information can be helpful to professionals

in the area to plan and allocate their resources, reduce

efforts and cost, and execute projects more efﬁciently

to improve maintenance tasks, allowing greater cost-

beneﬁt in the development of the project. In Soft-

ware Engineering, we usually use a version control

system (VCS) to manage changes and evaluations in

the project. To understand this process, it is necessary

https://orcid.org/0000-0003-0351-1159

https://orcid.org/0000-0002-7571-6578

https://orcid.org/0000-0002-1523-1616

to classify these maintenance activities based on the

history of commits. In the software industry and sci-

entiﬁc community, three classiﬁcation categories are

widely used, proposed by Mockus and Votta (Mockus

and Votta, 2000):

- Adaptive: new features are added to the system.

- Corrective: both functional and non-functional

issues are ﬁxed.

- Perfective: system and its design are improved.

Previous works that investigated commit classiﬁca-

tion into maintenance activities use only commit texts

(through text analysis, such as word frequency) and

reported an average accuracy of below 60% when

evaluated within a unique project. Furthermore, it

performed an average accuracy below 53% when the

model was evaluated within several projects (Amor

et al., 2006; Hindle et al., 2009). The state-of-art

technique to classify commits was presented by Levin

and Yehudai (Levin and Yehudai, 2017b). The au-

thors proposed the use of Gradient Boosting Machine

(GBM) (Friedman, 2001; Caruana and Niculescu-

Mizil, 2006) and Random Forest (Breiman, 2001;

Hindle et al., 2009) learning algorithms and take into

account the commit text and source code changes

(e.g., statement added, method removed) as the

Mariano, R., Santos, G. and Brandão, W.

Improve Classiﬁcation of Commits Maintenance Activities with Quantitative Changes in Source Code.

DOI: 10.5220/0010401700190029

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 2, pages 19-29

ISBN: 978-989-758-509-8

models’ features. The GBM shows an average

accuracy of 72% and a Kappa coefﬁcient of 57%.

The Random Forest presents an average accuracy

of 76% and a Kappa coefﬁcient of 63%. Despite

many efforts towards ﬁnding the best approach of

the commit classiﬁcation technique, highly accurate

models are still lacking. Also, many possible features

regarding commits performed by developers are not

taken into account in current models. In this research

paper, we studied the following problem statement:

Problem Statement: Current models to clas-

sify commits into maintenance activities do

not use all available features regarding com-

mits, which could potentially improve the ac-

curacy of the models.

In this article, we investigate how a state-of-art tech-

nique classiﬁes commits into maintenance activities.

We improve the results of the state-of-art technique by

calculating quantitative changes in the source code.

In particular, our goal is to achieve a model with

high accuracy and Cohen’s Kappa coefﬁcient. The

last metric is relevant in cases when the classiﬁca-

tion categories are not balanced, i.e., when there are

many more occurrences of one class in comparison to

others. This scenario could mislead the results due

to high accuracy; however, these results are due to

imbalanced labels. We base our work on the study

of Levin and Yehudai (Levin and Yehudai, 2017b),

which is the current state-of-art method to classify

commits into maintenance activities.

Aiming at improving both accuracy and Kappa

metrics of current classiﬁcation models, we pro-

pose three modiﬁcations to classify commits: (i) in-

clude the following information regarding quantita-

tive changes in source code as additional features: to-

tal lines of code added total lines of code deleted, per

commit, and the number of ﬁles changed, per com-

mit; and (ii) use XGBoost as one of the learning algo-

rithms. We then analyze the accuracy and kappa met-

rics of XGBoost and Random Forest, in comparison

to the GBM and Random Forest, respectively, used in

the previous paper (Levin and Yehudai, 2017b). We

use XGBoost in our work due to its recent beneﬁts

and advantages over other implementations of boost-

ing learning algorithms (Chen and Guestrin, 2015;

Chen and Guestrin, 2016). Hopefully, including the

three additional features and using XGBoost can in-

crease the model’s accuracy and Kappa coefﬁcient.

Third (iii), we propose the application of a more ac-

curate analysis of the dataset and used an algorithm

to select the best features of the model, and evaluate

the algorithm XGBoost (proposed on ii) in this new

dataset.

The main contribution of this work is the im-

provement of the classiﬁcation model using new soft-

ware features. We believe our results can assist the

development of a tool to classify commits on VCS

platforms, such as GitHub. Thus, the software tool

could help managers and stakeholders to track defec-

tive commits.

The remainder of this article is organized as fol-

lows. In Section 2, we present the related works.

Section 3 exhibits the proposed approach. Section 4

shows the experimental setup and results. Section 5

discuss the possible applications of our results. Fi-

nally, Section 6 concludes our work and presents di-

rections for future work.

2 RELATED WORK

2.1 Commit Classiﬁcation

Classifying software maintenance commits is an al-

ways relevant challenge in the literature. The rele-

vance given to maintenance activities is due to the

large consumption of resources in this stage of soft-

ware development. Swanson (Swanson, 1976) names

maintenance as the “iceberg” of software develop-

ment. His studies suggest that maintenance activity

can consume up to 40% of the effort spent to develop

the software, and this value tends to grow. These

spent efforts can prevent organizations from develop-

ing new products since the main objective is to keep

the software running. So understanding and classi-

fying maintenance activities can help prevent future

problems, make maintenance efﬁcient, and cut costs.

Taking into account the importance of classiﬁ-

cation, many previous studies, with different ap-

proaches, have attempted to propose accurate models

to classify commits into maintenance activities. Most

works are based on the commit messages, using text

analysis, such as word frequency approaches, to ﬁnd

speciﬁc keywords (Mockus and Votta, 2000; Fischer

et al., 2003; Sliwerski et al., 2005; Hindle et al., 2009;

Levin and Yehudai, 2016). For instance, Mockus and

Votta (Mockus and Votta, 2000) reported an average

accuracy of approximately 60% within the scope of a

single project.

Levin and Yehudai (Levin and Yehudai, 2017b)

combined keywords from commit comments and

source code changes to classify commits. The authors

used Gradient Boosting Machine (GBM) and Ran-

dom Forest as underlying learning algorithms. Their

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

results show that both algorithms present higher accu-

racies than previous studies, with Random Forest per-

forming better than GBM. However, both algorithms

present regular Kappa coefﬁcient values.

As we can note, previous studies that attempt to

classify commits into maintenance activities have not

taken into account many properties of the commits,

such as information about the size of the commit. Pre-

vious studies (Herraiz et al., 2006; Hattori and Lanza,

2008) show some properties of the commit concern-

ing its size, and that it can be used to help classiﬁca-

tion. In this work, we improve the model’s accuracy

and Cohen’s Kappa coefﬁcient (Cohen, 1960) in rela-

tion to the previous study that achieved the best results

so far (Levin and Yehudai, 2017b).

2.2 Model Explainability

Machine learning classiﬁcation techniques and mod-

els are often “black boxes”, where we are unable

to understand the reasons behind predictions. Previ-

ous studies have already emphasized the importance

of understanding the nature of these predictions and

making our models more reliable. A series of dif-

ferent methods have been proposed to solve this is-

sue (Lipovetsky and Conklin, 2001; Strumbelj and

Kononenko, 2013; Bach et al., 2015; Ribeiro et al.,

2016; Datta et al., 2016; Shrikumar et al., 2017).

The SHAP values propose in (Lundberg and Lee,

2017) uniﬁed proposed approaches, providing a sin-

gle method more effective than existing ones.

In many applications understanding the prediction

provided by a model can be as signiﬁcant as the ac-

curacy. More than provide predictions, usually it’s

important to know why the model makes some de-

cisions and what factors inﬂuence that. In software

development, have this information allow the devel-

opers to know the action that brings more efﬁciency

to maintenance activities. For researchers, this infor-

mation provides better choices of features and more

transparency about how to improve the model. Thus,

we also conducted a study to understand our model

and the features that have the most impact.

3 PROPOSED APPROACH

We propose an empirical study consisting of three

phases to achieve the goal. In the ﬁrst phase we per-

form a Literature Review to ﬁnd related works on

classiﬁcation of commits into maintenance activities.

From previous studies (Mockus and Votta, 2000; Fis-

cher et al., 2003; Sliwerski et al., 2005; Hindle et al.,

2009; Levin and Yehudai, 2016) we select the cur-

rent state-of-the-art (SOTA) model (Levin and Yehu-

dai, 2017b). Here, in this work, we propose a new

model to overcome accuracy and kappa coefﬁcient

provided by the SOTA work by including three ad-

ditional features and an implementation of XGBoost

learning algorithm.

In the second phase we Collect the Additional

Features from GitHub

via http requests to the

GitHub GRAPHQL API

. These features bring more

information on commits performed by developers and

may help in the classiﬁcation of commits into main-

tenance tasks. In Section 3.1, we explain how the la-

beled dataset was obtained from a previous study, in

Section 3.2 we detail how the new features may be

useful in distinguishing the commit categories, and in

Section 3.3 we present a deeper analysis of the data.

In the third and last phase of Experiments and

Analysis we replicate the results obtained by the

SOTA using the original dataset and the same split

and evaluation methods. In addition, we analyze the

impact of our new features and follow the experiments

with them. We split the dataset into training and test.

Then, for each algorithm, the cross-validation method

was used to tune the hyperparameters (Claesen and

De Moor, 2015). We evaluate the training dataset us-

ing 10-fold cross-validation (Kohavi, 1995), and the

average accuracy metric was reported.

Lastly, we focus on understanding the most rel-

evant feature groups, evaluate each group separately

and their combination using the XGBoost algorithm.

Particularly, we evaluate the impact of the main fea-

tures separately.

3.1 Labeled Commit Dataset

In this work, the dataset is composed of repositories

hosted on GitHub, which were selected in a previous

study (Levin and Yehudai, 2017b). The criteria to se-

lect the repositories following: (i) Use the Java pro-

gramming language; (ii) Have more than 100 stars;

(iii) Have more than 60 forks; (iv) Have their code

updated since 2016-01-01; (v) Created before 2015-

01-01; (vi) Had size over 2MB.

By following the mentioned criteria, 11 reposi-

tories remained in the ﬁnal dataset, representing a

wide domain of software projects, such as IDEs, dis-

tributed database and storage platforms, and integra-

tion frameworks. The repositories are: RxJava, In-

tellij Community, HBase, Drools, Kotlin, HAdoop,

Elasticsearch, Restlet, OrientDB, Camel e Spring

https://github.com/

https://developer.github.com/v4/

Improve Classiﬁcation of Commits Maintenance Activities with Quantitative Changes in Source Code

FrameWork. A better descripition of these reposito-

ries can be seen on (Levin and Yehudai, 2017b).

Levin and Yehudai (Levin and Yehudai, 2017b)

download and analyze each repository with its com-

mit history. For each of two subsequent commits c1

and c2 (c2 has been done right after c1), the source

code changes between them were identiﬁed and reg-

istered. These changes were ﬁrst proposed by (Fluri

and Gall, 2006) and they add up to 48 different types

of changes.

Furthermore, for each commit, the authors in-

spected its comment and searched for a speciﬁc set

of 20 keywords (as detailed in (Levin and Yehudai,

2017b)). The keywords are also part of the features

of the model. They are represented by a binary array

(of size 20) where each coordinate corresponds to a

keyword and the value “1” indicates the occurrence

of that keyword, while “0” indicates the absence.

Now, there are 68 features (48 + 20), correspond-

ing to source code changes and keyword occurrences,

respectively. The labeling process was manually per-

formed by the authors from the previous study (Levin

and Yehudai, 2017b) in which we based this work.

Approximately 100 commits were randomly sampled

from the 11 repositories and the authors classiﬁed

them according to one of the three maintenance cate-

gories (corrective, adaptive, and perfective). When a

commit did not present sufﬁcient information to allow

classiﬁcation, the authors selected another one, until

ﬁnding one which was possible to safety classify.

The authors made efforts to prevent class starva-

tion (i.e., not having enough instances of a certain

class). In case they detected a considerable imbal-

ance in some projects classiﬁcation categories they

added more commits of the starved class from the

same project. This balancing was done by repeat-

edly sampling and manually classifying commits un-

til a commit of the starved class was found. The ﬁnal

dataset consisted of 1,151 manually classiﬁed com-

mits and was made open access by Levin and Yehu-

dai (Levin and Yehudai, 2017b), (Levin and Yehudai,

2017a).

3.2 Additional Features

In this work, we propose three additional features

for the 11 selected repositories to be incorporated in

the labeled commit dataset. As mentioned before,

the features were collected through HTTP requests to

the GitHub GRAPHQL API and refer to the changes

on the ﬁles, and lines of code (LOC) changes in the

source code, i.e., quantitative source code changes.

We propose the inclusion of the total ﬁles changed

by the commit, total LOC added by the commit, and

total LOC deleted by the commit, giving a total of

71 features for our model. Given the 3 categories

of maintenance activities, we hypothesize that the

three features may reﬂect each class. Therefore,

including these features may help in the separation of

the classes, as explained below.

Adaptive: This maintenance activity class refers to

adding new functionalities to the system, so it is more

likely that added LOC is considerably higher than

deleted LOC. Adding new functionalities can also

cause more ﬁles to receive changes.

Corrective: This maintenance activity class refers to

fault ﬁxing, whence it is more likely that added LOC

is very close to deleted LOC.

Perfective: This class refers to system design

improvements. In this case, there is no previous

knowledge regarding the relationship between added

LOC and deleted LOC. Therefore, we may have

different combinations of the features for this class.

The additional features may be extremely useful

for separating adaptive class from the others, how-

ever, they may not be conclusive regarding the other

classes. As we can see in Figures 1, 2 and 3, in fact,

added LOC is considerably higher in adaptive class,

changed ﬁles is less signiﬁcant in corrective, while

deleted LOC does not present a strong difference.

Figure 1: Added lines of code.

These boxplots were generated based on the values

of the features collected for the repositories that con-

stitute our dataset. We decided to include all three

features since their co-occurrences may be helpful to

separate the maintenance classes.

3.3 Dataset Overview

To better understand the behavior and characteristics

of the dataset, we present a deeper analysis of the

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

Figure 2: Deleted lines of code.

Figure 3: Changed Files.

data. Here we make the analysis and create two new

datasets to be used in our experiments. First, we

check all the commits, one by one, search for any

problem, and ﬁnd two appearances of duplicate com-

mits.

1. Rows 503 and 523 are identical with all their

columns, including the commitID and label, so we

decided to delete line 523 and keep line 503.

2. Rows 96 and 97 also have duplication in all

columns, including the commitID however the la-

bel for line 96 was provided as corrective and line

97 as perfective, we decided to exclude both lines

since the correct classiﬁcation of this commit is

not accurate.

Next step, to improve the analysis, we apply the Pan-

das Proﬁling

, a Python tool that generates a report

about a dataset. They show the column-by-column in-

formation interactively, allowing a more detailed and

https://pypi.org/project/pandas-proﬁling/

quantitative visualization of each feature in addition

to the existing correlation between them. Among the

main information, besides size and data type, we re-

ceive information like missing values, most frequent

values, descriptive and quantile statistics, data distri-

bution, and variance. Thus, we could identify mean-

ingful information about the data.

The feature “PARENT INTERFACE CHANGE”

presents in all rows a constant value equal to 0. In

this way, the feature was automatically rejected by the

tool. It is important to state that the feature is only

rejected concerning this dataset since none of the in-

stances present this information, which makes it use-

less in model production. However, for a more robust

dataset, with new instances that present the character-

istics of this feature, it could return to the model.

We observe that 80% (56) of the features have

more than 86% (1000) of that data are constant, more

speciﬁcally equal to zero. This occurred because the

keywords are binary features until each instance has

only a few source code changes, but this can have

many inﬂuences in the model. For example, The fea-

ture “ADDING CLASS DERIVABILITY”, have only

one instance with a value different than zero. Al-

though, in our analysis, we have not noticed a high

correlation with the features. After all the removals,

rows, and columns, now we have a second dataset,

that we call “Complete”, with 1,148 lines and 70 fea-

tures(47 source code changes + 20 keywords + 3 ad-

ditional).

Searching the features that were more important

to a model, we also apply a Supervised Tree-based

Feature Selection, and select the more impactful fea-

tures for the Classiﬁcation. We chose Random For-

est for this selection because it is the best perform-

ing algorithm in the previous work and is different

from the XGBoost algorithm used for analyzing. Se-

lecting the more impactful features can help ﬁnd the

true importance of these features in the classiﬁcation.

After applying this supervised feature Selection, we

have a third dataset, that we call “Select” with 1,148

lines and 21 features. Of these 21, we have 11 source

code changes, 7 keywords, and the 3 additional fea-

tures. It is important to note that the selection model

itself selected the additional features, which helps to

show that the addition of features with the quantita-

tive change can bring a signiﬁcant result. The compo-

sition of Both Complete and Select dataset is classi-

ﬁed as 43.5% (499 instances) were corrective, 35.1%

(403 instances) were perfective, and 21.4% (246 in-

stances).

That will be important to give a brief explanation

of each feature on the Select dataset to understand our

experiments. First, we have the 3 new features that we

Improve Classiﬁcation of Commits Maintenance Activities with Quantitative Changes in Source Code

propose, “additions”, “deletions” and “changedFiles”

that represent the number of lines added, number of

lines delete and a number of ﬁles change, respectively,

we can see more explanation on section 3.3. Second

we have the 7 keywords as described in Section 3.2

they indicate if the word appeared in the message,

they are the following words: “add”, “allow”, “ﬁx”,

“implement”, “remov”, “support” and “test”. Finally,

we have the 11 features of source code changes, a

more detailed explanation can be found on (Fluri and

Gall, 2006), follow:

- ADDITIONAL FUNCTIONALITY: add func-

tionality on code.

- REMOVED FUNCTIONALITY: remove func-

tionality on code.

- ADDITIONAL OBJECT STATE: add of at-

tributes that describe the state of an object.

- ALTERNATIVE PART INSERT: the inserting or

deleting that does not have any impact on the over-

all nested depth of a method.

- COMMENT INSERT: inserting comments.

- CONDITION EXPRESSION CHANGE: update

operation of a structured statement.

- DOC UPDATE: update on a document.

- STATEMENT DELETE: delete operation in the

entities control structure, loop structure, and state-

ment.

- STATEMENT INSERT: insert operation.

- STATEMENT UPDATE: update operation.

- STATEMENT PARENT CHANGE: move opera-

tion.

4 EXPERIMENTS AND RESULTS

4.1 Replicating Results

Before proceeding with the experiments, it is impor-

tant to check if the results acquired by the SOTA

are replicable. For this reason, we split the original

dataset into a training dataset and a test dataset. The

split is the same made by the SOTA, 85% on training,

and 15% on the test. We performed the split by us-

ing scikit-learn library (Pedregosa et al., 2011) from

Python language, which use the label to balance the

class distributions. The detail of the split label is in

Table 1, and the split by commits per project is on

Table 2.

Since we propose a new algorithm, we train the

model with the training dataset using XGBoost. Then,

Table 1: Split labels.

Corrective Perfective Adaptive

Train 424 345 209

Test 76 59 38

Table 2: Split commits.

Project Train Test

hadoop 100 12

drools 97 17

intellij-community 93 16

ReactiveX-RxJava 88 12

spring-framework 87 13

kotlin 86 15

hbase 86 23

elasticsearch 86 14

restlet-framework-java 86 15

orientdb 85 17

camel 84 19

the trained models were evaluated using the test

dataset. The test dataset did not take part in the model

training process. We do the training using the same

models, as proposed by the SOTA. The best perform-

ing compound model for each classiﬁcation algorithm

was evaluated on the test dataset.

Our results are very close to the results reported by

Levin and Yehudai (Levin and Yehudai, 2017b). We

obtained 76% of accuracy and 63% of Kappa from

Random Forest, and 73% of accuracy and 58% of

Kappa from XGBoost. We conclude that the SOTA

is replicable and XGBoost is an option for evaluation.

4.2 Impact of Features

Now we take our second dataset (complete) and an-

alyze the impact that our additional features have on

the model. For better visualization, we did an im-

pact test of each feature separately, and their com-

binations (Changed Files: CF, added LOC: AL, and

deleted LOC: DL). The XGBoost is the default algo-

rithm from the Shap value. We use the Shap to show

the importance of the features in section 4.7. So, we

decide to use also XGBoost algorithm for visualiza-

tion of the impact of the features. For this test, we use

a simple 10-fold cross-validation across all datasets

without using hyperparameters. The techniques were

made using the scikit learn library (Pedregosa et al.,

2011). We can see the difference in Table 3.

4.3 XGBoost

We run XGBoost on the training dataset and evaluate

it by using 10-fold cross-validation. A grid-search

of hyperparameters was passed to the evaluation

method in order to ﬁnd the best hyperparameters.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

Table 3: Features Impact.

Accuracy Kappa

NaN 71,69 55,89

CF 71,60 55,74

AL 72,30 56,99

DL 71,62 55,57

CF, AL 72,50 57,26

CF, DL 71,36 55,23

AL, DL 72,66 52,55

CF, DL, AL 72,31 56,81

The main parameter to tune is the maximum depth

of a tree in the XGBoost. Table 4 shows a simple

example of some variations in the hyperparameters

colsample bytree, or simply colsample (subsample

ratio of columns), and max depth (maximum tree

depth), with 150 iterations. The best parameters are

in bolded text. We also veriﬁed the model perfor-

mance during the training by plotting some graphs

for several hyperparameters. We choose accuracy to

decide which parameters we should select.

Table 4: Example of hyperparameters grid for XGBoost.

max depth colsample Accuracy Kappa

1 0.4 0.6952 0.5258

1 0.7 0.7004 0.5334

3 0.4 0.7443 0.6011

3 0.7 0.7280 0.5766

6 0.4 0.7239 0.5683

6 0.7 0.7259 0.5721

As we can observe in the presented table 4, the best

hyperparameters were 0.4 (colsample bytree) and 3

(for max depth). Using these parameters, the model

presented 74.43% of accuracy and 60.22% of Kappa

in training. With these values in hand, the model was

evaluated in the test dataset. It is important to remind

that this dataset was not taken into account during

the training phase so that we can have a more precise

evaluation of the model. In Table 5, we can see

the predicted classes by the model in the confusion

matrix. This matrix presents the corrected and wrong

predictions, from which we are able to calculate the

accuracy and Kappa.

Table 5: XGBoost confusion matrix for test dataset.

True class

Prediction Adaptive Corrective Perfective

Adaptive 26 1 7

Corrective 5 63 13

Perfective 6 11 40

By inspecting the confusion matrix for XGBoost, our

model presented an Accuracy of 75.7% and a Kappa

Coefﬁcient of 60.7% regarding the commit classiﬁca-

tion into adaptive, corrective, and perfective mainte-

nance classes.

4.4 Random Forest

Similarly to XGBoost, we run Random Forest in the

training dataset and evaluated it by using 10-fold

cross-validation. We also use a grid-search to ﬁnd

the best hyperparameters. In particular, the main

parameter to tune is the number of predictors that

are randomly selected for each tree. This technique

is used to make sure that trees are uncorrelated.

The hyperparameter tuning is also done by the

scikit-learn library (Pedregosa et al., 2011). Table 6

presents a simple example of some variations in the

hyperparameter mtry (number of randomly selected

predictors). The best parameter is in bolded text in

Table 6. As for XGBoost, we also veriﬁed the model

performance during the training by plotting graphs

for different values of the mtry. We choose accuracy

to decide which parameters we should select.

Table 6: Example of hyperparameters grid for Random For-

est.

mtry Accuracy Kappa

2 0.6546 0.4384

35 0.7013 0.5355

69 0.6868 0.5123

Table 6 show that the best hyperparameter value was

35 (mtry). This means that for each tree in the for-

est, 35 features are randomly selected among the 70.

Using this parameter, the model presented 70.13% of

accuracy and 53.55% of Kappa in training. With this

parameter value in hands, the model was evaluated in

the test dataset. In Table 7, we can see the predicted

classes by the model in the confusion matrix.

Table 7: Random Forest confusion matrix for test dataset.

True Class

Prediction Adaptive Corrective Perfective

Adaptive 25 2 6

Corrective 3 61 7

Perfective 9 12 47

From the observe of confusion matrix for Random

Forest, our model presented an Accuracy of 77.32%

and a Kappa Coefﬁcient of 64.61% regarding the

commit classiﬁcation into adaptive, corrective, and

perfective maintenance classes.

Improve Classiﬁcation of Commits Maintenance Activities with Quantitative Changes in Source Code

4.5 Summary of Results

The models proposed in this work achieved better re-

sults in comparison to the SOTA technique to classify

commits into maintenance activities. This is indica-

tive that, in fact, there are ways to improve this task.

Table 8 reveals that, for each learning algorithm, how

the accuracies and Kappa coefﬁcients are achieved by

our models compare to the previous study.

Table 8: Comparison of SOTA and proposed improvements.

SOTA OURS

GBM RF XGBoost RF

Accuracy 0.7200 0.7600 0.7570 0.7732

Kappa 0.5700 0.6300 0.6070 0.6461

We note that XGBoost presented an increase in both

accuracy and Kappa coefﬁcient in comparison to

the Gradient Boosting Machine used in the previous

work. While Levin and Yehudai (Levin and Yehu-

dai, 2017b) achieved 72.0% of accuracy and 57.0%

of Kappa, our model achieved 75,7% of accuracy and

almost 61.0% of Kappa. These results show that in-

cluding added lines of code and deleted lines of code

as features may be useful for classifying commits. In

addition, using advanced implementations of boost-

ing algorithms, such as XGBoost, in fact, can improve

model performance.

Regarding Random Forest, we can observe that

the difference from our results to the past one was

smaller, but still, we achieved higher accuracy and

Kappa Coefﬁcient. While the previous model pre-

sented 76.0% of accuracy and 63.0% of Kappa, our

model reached an accuracy of 77.3% and Kappa of

64.6%.

In summary, our model presented higher accura-

cies and Kappa coefﬁcients for both algorithms. For

XGBoost, our model achieved an increase of almost

4% for accuracy and Kappa. Regarding the Random

Forest, we increased the accuracy by 1.3% and the

Kappa by almost 2%.

4.6 Dataset Measures

It is possible to observe that depending on the way we

evaluate the model, using keywords or source code

changes, our results may be different. Now we look

at the real impact using different datasets. First, we

evaluate the types of features separately, and then

combine them, and ﬁnally compare with our two new

datasets, “Complete” and “Select”. We can see all

comparisons in Table 9.

Table 9: Dataset Comparison.

Accuracy Kappa

Keywords 72.72 57.82

Changes 50.87 21.99

Quantitative 49.02 18.56

Keywords + Changes 73.64 59.16

Keywords + Quantitative 73.35 58.75

Changes + Quantitative 56.47 31.01

Combine 75.72 62.41

Complete 74.45 60.11

Select 71.16 54.78

4.7 Model Understandability

It Is important to analyze which group of features are

most important for the model. In Table 9, we observe

these features importance. The “Quantitative” group

of features proposed in this paper proves to be slightly

efﬁcient in the prediction. Only a group of 3 features

achieved accuracy and kappa very close of the 48 fea-

tures of Source code changes(“Changes”). Moreover,

we can see that “Keywords” are essential for high ac-

curacy. The “Select” dataset presented in this paper

has only 22 features, which is less than a third of the

original “Combine” dataset and still achieved relevant

results, and close to the dataset with all features.

It is possible to notice that some features are more

important than others in the model. Understanding

only the groups with the greatest impact is not enough

to understand the whole model. Comprehend the im-

pact of each feature, separately, can be done with a

model understandability.

To explain our model decisions, we use a tech-

nique development, proposed and present in (Lund-

berg and Lee, 2017), which uses SHAP values. This

technique is a uniﬁed measure of feature importance,

to unveil the “black boxes” in the classiﬁcation mod-

els. The SHAP values work with any tree-based

model, they calculate the average of contributions

across all permutation of the features. In summary,

they can show how each feature contributes, posi-

tively or negatively to a model. We use the “Select”

dataset for this analysis.

Figure 4 presents a SHAP summary plot, show-

ing the impact of each feature on the model. We

observe that 7 features are important for the model,

ranked in descending order, the most important fea-

ture “support” is the ﬁrst in the plot, and “ADDI-

TIONAL FUNCTIONALITY”(ADD FUNCT) is the

second. On the horizontal, we have the module of

the impact on the prediction model. The closer the

values of 0, the smaller its impact, In contrast, the

farther from 0 means a bigger impact. Therefore, an-

alyzing Figure 4, the “support” feature has a bigger

impact, with an average of 0.08 of magnitude. In this

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

Figure 4: The most important features for the model.

way, we can conclude that has the “support” word on

the commit message follows a signiﬁcant impact on

the classiﬁcation of the model. Besides that, corrob-

orating with Table 9, we note that keywords are the

most important group on the model, representing 4 of

the 7 best features. Following this rationale, we have

2 of our proposed features and 1 of the source code

changes between the most important features for the

prediction.

5 APPLICATIONS

Levin and Yehudai investigated the amount of a com-

mit that a given developer made in each of the main-

tenance activities and suggested the notion of a de-

veloper’s maintenance proﬁle (Levin and Yehudai,

2016). Their model to predict developer maintenance

proﬁle could be supported by our proposed model

to classify commits into maintenance tasks, possibly

with yield higher accuracy in their prediction.

Another application of our proposed model is in

the identiﬁcation of anomalies in the development

process. It is important to manage the maintenance

activities performed by developers, i.e., the volume of

commits made in each maintenance category. Mon-

itoring unexpected behavior in maintenance tasks

would assist managers to plan and allocate resources

in advance. For instance, lower adaptive activity may

indicate that the project is not evolving as it was ex-

pected, and lower corrective activity may suggest that

developers are neglecting fault ﬁxing. Identifying the

root causes of problems may aid the manager to over-

come them. Furthermore, recognizing maintenance

patterns in successful projects may be useful as guide-

lines for other projects.

Building a software team is a non-trivial task

given the diverse factors involved with it, such as

technological and human aspects (Gorla and Lam,

2004; dos Santos and Figueiredo, 2020). Com-

mit classiﬁcation may help to build a more reliable

and balanced developer team regarding the devel-

oper maintenance activity proﬁle (Levin and Yehudai,

2016; dos Santos and Figueiredo, 2020). This can be

done create an API or software linked to the version

control to identify and classify the commit automati-

cally. When a team is composed of more developers

with a speciﬁc proﬁle (e.g., adaptive) than others, the

development process may be affected, and the ability

of the team to meet usual requirements (e.g., develop-

ing new features, adhering to quality standards) could

also be negatively impacted.

6 CONCLUSIONS

Software maintenance is an extremely relevant task

for software projects, and it is essential for the whole

software life cycle and operation. Maintenance is

shown to consume most of the project budget. There-

fore, understanding how maintenance tasks are per-

formed is useful for practitioners and managers so

that they can plan and allocate resources in advance.

In the present article, we proposed improvements

on a SOTA approach used to classify commits into

software maintenance activities. In particular, we pro-

posed the adoption of three new features that measure

quantitative changes of source code, since the SOTA

do not use that type of features on their commit clas-

siﬁcation. We also proposed the use of the XGBoost

algorithm to perform commit classiﬁcations. Also,

we carried out experiments using an already labeled

commit dataset to evaluate the impact of our proposed

improvements on commit classiﬁcation.

Experimental results showed that our proposed

approach achieved 76% accuracy and 61% of Kappa

for the XGBoost, an increase of 4% in comparison

to the past study. Also, our Random Forest achieved

77.3% and 64.6% of accuracy and Kappa, respec-

tively, with an increase of 1.3% and approximately

2% related by SOTA. We notice that quantitative

changes can be helpful to understand the character-

istics of a commit. Furthermore, the use of additional

features have become extremely relevant in our ma-

chine learning model. Thus, in terms of model under-

standability, we could determine that the additional

features are on the top of the most important features

in the classiﬁcation model.

As future work, we intend to evaluate other fea-

tures related to software commits and use other

datasets with a larger number of commits to improv-

ing the accuracy of the classiﬁers. We also intend

to evaluate other classiﬁcation algorithms using dif-

ferent evaluation metrics, e.g., precision and recall,

to improve our understanding of the behavior of the

classiﬁer. Commits have more metadata that was not

Improve Classiﬁcation of Commits Maintenance Activities with Quantitative Changes in Source Code

analyzed in this article due to the scope of the appli-

cation and the intended comparison with the existing

approach. Moreover, we intend to use NLP (Natu-

ral Language Processing) to investigate the commit

text based on maintenance activities. Thus, we could

generate a classiﬁcation approachable to classify the

commits automatically based on the labels discussed

in this article. Finally, we intend to use other cat-

egories (i.e., activities) proposed in the literature to

generalize the results.

ACKNOWLEDGEMENTS

The present work was carried out with the support

of the Coordenac¸

ao de Aperfeic¸oamento de Pessoal

de N

ıvel Superior - Brazil (CAPES) - Financing

Code 001. The authors thank the partial support of

the CNPq (Brazilian National Council for Scientiﬁc

and Technological Development), FAPEMIG (Foun-

dation for Research and Scientiﬁc and Technological

Development of Minas Gerais), and PUC Minas.

REFERENCES

Amor, J., Robles, G., Gonzalez-Barahona, J., Gsyc, A., Car-

los, J., and Madrid, S. (2006). Discriminating de-

velopment activities in versioning systems: A case

study. In Proceedings of the 2nd International Con-

ference on Predictor Models in Software Engineering,

PROMISE’06.

Bach, S., Binder, A., Montavon, G., Klauschen, F., M

uller,

K., and Samek, W. (2015). On pixel-wise explana-

tions for non-linear classiﬁer decisions by layer-wise

relevance propagation. PLoS ONE, 10.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

Caruana, R. and Niculescu-Mizil, A. (2006). An empiri-

cal comparison of supervised learning algorithms. In

Proceedings of the 23rd International Conference on

Machine Learning, ICML’06, pages 161–168.

Chen, T. and Guestrin, C. (2015). XGBoost : Reliable large-

scale tree boosting system. In Proceedings of the NIPS

2015 Workshop on Machine Learning Systems, Learn-

ingSys’15.

Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree

boosting system. In Proceedings of the 22nd Interna-

tional Conference on Knowledge Discovery and Data

Mining, KDD’16, pages 785–794.

Claesen, M. and De Moor, B. (2015). Hyperparameter

search in machine learning. In Proceedings of the 11th

Metaheuristics International Conference, MIC’15.

Cohen, J. (1960). A coefﬁcient of agreement for nominal

scales. Educational and Psychological Measurement,

20(1):37–46.

Datta, A., Sen, S., and Zick, Y. (2016). Algorithmic trans-

parency via quantitative input inﬂuence: Theory and

experiments with learning systems. In Proceedings of

the IEEE Symposium on Security and Privacy, SP’16,

pages 598–617.

dos Santos, G. E. and Figueiredo, E. (2020). Commit

classiﬁcation using natural language processing: Ex-

periments over labeled datasets. In Proceedings of

the 23rd Iberoamerican Conference on Software En-

gineering, CIbSE’20.

Fischer, M., Pinzger, M., and Gall, H. (2003). Populat-

ing a release history database from version control

and bug tracking systems. In Proceedings of the

International Conference on Software Maintenance,

ICSM’03, pages 23–32.

Fluri, B. and Gall, H. C. (2006). Classifying change types

for qualifying change couplings. In Proceedings of

the 14th IEEE International Conference on Program

Comprehension, ICPC’06, pages 35–45.

Friedman, J. (2001). Greedy function approximation:

A gradient boosting machine. Annals of Statistics,

29(5):1189–1232.

Gorla, N. and Lam, Y. W. (2004). Who should work with

whom?: Building effective software project teams.

Communications of the ACM, 47(6):79–82.

Gupta, M. (2017). Improving software maintenance using

process mining and predictive analytics. In Proceed-

ings of the IEEE International Conference on Soft-

ware Maintenance and Evolution, ICSME’17, pages

681–686.

Hattori, L. P. and Lanza, M. (2008). On the nature of com-

mits. In Proceedings of the 33rd ACM/IEEE Interna-

tional Conference on Automated Software Engineer-

ing, ASE’18, pages 63–71.

Herraiz, I., Robles, G., Gonzalez-Barahona, J. M.,

Capiluppi, A., and Ramil, J. F. (2006). Compari-

son between slocs and number of ﬁles as size met-

rics for software evolution analysis. In Proceedings of

the Conference on Software Maintenance and Reengi-

neering, CSMR’06, pages 8 pp.–213.

Hindle, A., German, D., Godfrey, M., and Holt, R. (2009).

Automatic classiﬁcation of large changes into mainte-

nance categories. In Proceedings of the 17th IEEE In-

ternational Conference on Program Comprehension,

ICPC’09, pages 30–39.

Kohavi, R. (1995). A study of cross-validation and boot-

strap for accuracy estimation and model selection.

In Proceedings of the 14th International Joint Con-

ference on Artiﬁcial Intelligence, IJCAI’95, page

1137–1143.

Levin, S. and Yehudai, A. (2016). Using temporal and se-

mantic developer-level information to predict mainte-

nance activity proﬁles. In Proceedings of the IEEE In-

ternational Conference on Software Maintenance and

Evolution, ICSME’16, pages 463–467.

Levin, S. and Yehudai, A. (2017a). 1151 com-

mits with software maintenance activity labels

(corrective, perfective, adaptive). Available:

https://doi.org/10.5281/zenodo.835534.

Levin, S. and Yehudai, A. (2017b). Boosting automatic

commit classiﬁcation into maintenance activities by

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

utilizing source code changes. In Proceedings of the

13rd International Conference on Predictor Models in

Software Engineering, PROMISE’17, pages 97–106.

Lientz, B. P., Swanson, E. B., and Tompkins, G. E. (1978).

Characteristics of application software maintenance.

Communications of the ACM, 21(6):466–471.

Lipovetsky, S. and Conklin, M. (2001). Analysis of re-

gression in game theory approach. Applied Stochastic

Models in Business and Industry, 17:319 – 330.

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach

to interpreting model predictions. In Proceedings of

the 31st International Conference on Neural Informa-

tion Processing Systems, NIPS’17, page 4768–4777.

Mariano, R. V. R., dos Santos, G. E., V. de Almeida, M.,

and Brand

ao, W. C. (2019). Feature changes in source

code for commit classiﬁcation into maintenance activ-

ities. In Proceedings of the 18th IEEE International

Conference on Machine Learning and Applications,

ICMLA’19, pages 515–518.

Mockus, A. and Votta, L. G. (2000). Identifying reasons

for software changes using historic databases. In Pro-

ceedings of the International Conference on Software

Maintenance, ICSM’00, pages 120–130.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”Why

Should I Trust You?”: Explaining the predictions of

any classiﬁer. In Proceedings of the 2016 Conference

of the North American Chapter of the Association for

Computational Linguistics, pages 97–101.

Schach, S., Jin, B., Yu, L., Heller, G., and Offutt, J.

(2003). Determining the distribution of maintenance

categories: Survey versus measurement. Empirical

Software Engineering, 8:351–365.

Shrikumar, A., Greenside, P., and Kundaje, A. (2017).

Learning important features through propagating ac-

tivation differences. In Proceedings of the 34th Inter-

national Conference on Machine Learning, ICML’17,

page 3145–3153.

Sliwerski, J., Zimmermann, T., and Zeller, A. (2005). When

do changes induce ﬁxes? Software Engineering Notes,

30(4):1–5.

Strumbelj, E. and Kononenko, I. (2013). Explaining pre-

diction models and individual predictions with feature

contributions. Knowledge and Information Systems,

41:647–665.

Swanson, E. B. (1976). The dimensions of maintenance. In

Proceedings of the 2nd International Conference on

Software Engineering, ICSE’76, pages 492–497.

Improve Classiﬁcation of Commits Maintenance Activities with Quantitative Changes in Source Code