Commit Classiﬁcation into Maintenance Activities Using In-Context

Learning Capabilities of Large Language Models

Yasin Sazid

, Sharmista Kuri

, Kazi Solaiman Ahmed

and Abdus Satter

Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh

Computer Science Department, University of New Mexico, Albuquerque, New Mexico, U.S.A.

ﬁ

Keywords:

Commit Classiﬁcation, Commit Message, Maintenance Activity, Large Language Models, GPT, In-Context

Learning.

Abstract:

Classifying software changes, i.e., commits into maintenance activities enables improved decision-making in

software maintenance, thereby decreasing maintenance costs. Commonly, researchers have tried commit clas-

siﬁcation using keyword-based analysis of commit messages. Source code changes and density data have also

been used for this purpose. Recent works have leveraged contextual semantic analysis of commit messages

using pre-trained language models. But these approaches mostly depend on training data, making their ability

to generalize a matter of concern. In this study, we explore the possibility of using in-context learning capa-

bilities of large language models in commit classiﬁcation. In-context learning does not require training data,

making our approach less prone to data overﬁtting and more generalized. Experimental results using GPT-3

achieves a highest accuracy of 75.7% and kappa of 61.7%. It is similar to performances of other baseline

models except one, highlighting the applicability of in-context learning in commit classiﬁcation.

1 INTRODUCTION

Software maintenance constitutes a signiﬁcant por-

tion of the overall costs in software development.

To increase cost-effectiveness, it is imperative to un-

derstand the different maintenance activities involved

with software development (Swanson, 1976). Cat-

egorization of these maintenance activities enables

decision-making on resource allocation, choice of

technology, and management of technical debt (Ghad-

hab et al., 2021), making it easier to manage costs.

To realize this beneﬁt, many researchers have tried to

proﬁle software projects in terms of maintenance ac-

tivities (Swanson, 1976) (Mockus and Votta, 2000)

(Levin and Yehudai, 2016). The very ﬁrst task in

maintenance activity proﬁling is commit classiﬁca-

tion. As commits keep track of technical changes in a

software, classifying them into maintenance activities

helps better understand and manage software mainte-

nance, thereby improving cost-effectiveness.

Different approaches have been proposed for com-

mit classiﬁcation. The most common approach is an-

alyzing commit messages. Multiple studies have em-

ployed keyword-based analysis of commit messages

(Hindle et al., 2009) (Levin and Yehudai, 2017) (Mar-

iano et al., 2021). Aside from commit messages, stud-

ies also considered other data sources, such as source

code changes (Levin and Yehudai, 2017) (Meqdadi

et al., 2019) and code density (H

onel et al., 2020).

Other studies have tried contextual semantic analysis

of commit messages by ﬁne-tuning pre-trained lan-

guage models (Ghadhab et al., 2021) (Zafar et al.,

2019). However, such approaches mostly depend

on training machine learning classiﬁers with commit

data, thus making them dependent on the quality of

training data.

In this study, we propose the use of in-

context learning capabilities of large language models

(LLMs) in the context of commit classiﬁcation. As in-

context learning does not require training data, it can

be an excellent way to generalize commit classiﬁca-

tion across commit data from different sources. We

previously achieved encouraging results in detecting

and classifying user interface (UI) dark pattern texts

using in-context learning (Sazid et al., 2023). We use

the same approach in this study to apply in-context

learning in the context of commit message classiﬁca-

tion. In this approach, we ﬁrst synthesize deﬁnitions

of maintenance activity categories from the existing

literature. These category deﬁnitions are then used to

engineer prompts for the large language model, along

with zero, one, or two examples per category.

506

Sazid, Y., Kuri, S., Ahmed, K. and Satter, A.

Commit Classiﬁcation into Maintenance Activities Using In-Context Learning Capabilities of Large Language Models.

DOI: 10.5220/0012686700003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 506-512

ISBN: 978-989-758-696-5; ISSN: 2184-4895

Experimental results using GPT-3 present encour-

aging signs about the applicability of in-context learn-

ing in the context of commit classiﬁcation. We

achieve a highest accuracy of 75.7% and kappa of

61.7%, which is similar to other baseline approaches

except one. Most importantly, our approach requires

no training data. It only uses semantics of category

deﬁnitions to classify commit message texts. Thus, it

is less prone to data overﬁtting and offers increased

generalization over other approaches.

2 RELATED WORK

Commit classiﬁcation is not a new ﬁeld of research,

but recent years have seen a growing number of stud-

ies on this topic (Heri

cko and

Sumak, 2023). As

the necessity to understand and organize software

changes keeps rising, researchers have tried to au-

tomate the commit classiﬁcation process using rule-

based models (Amit and Feitelson, 2021) (Hassan,

2008) (Mauczka et al., 2012), supervised (Ghadhab

et al., 2021) (Hindle et al., 2009) (Levin and Yehu-

dai, 2017) (H

onel et al., 2020) (Zafar et al., 2019) and

semi-supervised (Fu et al., 2015) (Gharbi et al., 2019)

machine learning.

Hindle et al. used machine learning to classify

large changes into ﬁve maintenance categories - cor-

rective, adaptive, perfective, feature addition, and

non-functional changes (Hindle et al., 2009). They

used the commit message, author, and modiﬁed mod-

ules data in their work. This trend of analyzing com-

mit messages can be seen in other works as well. Fu

et al. and Yan et al. used topic modelling on com-

mit messages (Fu et al., 2015) (Yan et al., 2016).

But the most common technique in analyzing commit

messages is word frequency-based analysis (Heri

cko

and

Sumak, 2023). Levin et al. used such an ap-

proach to extract keywords from commit messages

(Levin and Yehudai, 2017). They used the presence of

keywords and source code changes (number of state-

ments, methods, ﬁles changed etc.) to distinguish be-

tween commits belonging to three maintenance activ-

ities - corrective, adaptive, and perfective. H

onel et

al. extended the work of Levin et al. by incorpo-

rating source code density with keywords and code

changes to improve the classiﬁcation accuracy (H

onel

et al., 2020) (Levin and Yehudai, 2017). However,

keyword-based techniques do not recognize the con-

textual relationship between words. As a result, it is

necessary to train a context-aware model for commit

classiﬁcation.

Multiple studies used pre-trained language models

for contextual analysis of commit messages. Ghad-

hab et al. used ﬁne-grained code changes with a

pre-trained language model, BERT (Bidirectional En-

coder Representations from Transformers), to aug-

ment commit classiﬁcation (Ghadhab et al., 2021).

Sarwar et al. and Trautsch et al. also used contextual

semantic analysis of commit messages (Sarwar et al.,

2020) (Trautsch et al., 2023). Zafar et al. presented a

set of rules for semantic analysis of commit messages

and trained a context-aware deep learning model by

ﬁne-tuning BERT for bug-ﬁx commit message clas-

siﬁcation (Zafar et al., 2019). These applications of

pre-trained language models generally focus on ﬁne-

tuning models with commit message data. As a result,

they are only as generalized as the source of training

data.

Using the semantics of category deﬁnitions of

maintenance activities can provide better generaliza-

tion capabilities for language models. Large Lan-

guage Models (LLMs) like GPT (Generative Pre-

trained Transformer) have in-context learning capa-

bilities that can be useful in this regard. This capabil-

ity enables LLMs to perform tasks by conditioning on

only a few examples (Min et al., 2022). What sepa-

rates our work from existing works is that we leverage

this capability of LLMs for commit classiﬁcation in-

stead of ﬁne-tuning the pre-trained models with com-

mit message data.

3 BACKGROUND

We start this section with an overview of maintenance

activity categories considered in this work. Then, we

introduce in-context learning and GPT, a large lan-

guage model that has been illustrating revolutionary

capabilities in different natural language processing

(NLP) tasks, especially text classiﬁcation. Finally, we

explain statistical methods used in this study to eval-

uate classiﬁcation performance.

3.1 Maintenance Activity Categories

Swanson proposed three maintenance activity cat-

egories - ‘Corrective’, ‘Perfective’, and ‘Adaptive’

(Swanson, 1976). These categories are brieﬂy ex-

plained in this subsection.

3.1.1 Corrective

Corrective maintenance involves identifying and ﬁx-

ing issues like bugs and faults in software to ensure

it behaves as intended. It addresses problems in pro-

cessing or performance, caused by errors in the appli-

cation software, hardware, or system software. It is

Commit Classiﬁcation into Maintenance Activities Using In-Context Learning Capabilities of Large Language Models

507

carried out in response to failures and includes tasks

like bug ﬁxing, resolving processing issues, address-

ing performance problems, and correcting implemen-

tation failures.

3.1.2 Perfective

Perfective maintenance involves improving a soft-

ware, even if it is already built and documented well,

without any implementation problems. The goal is to

make the software easier to modify during corrective

or adaptive maintenance, not necessarily to reduce

program failures or handle environmental changes

better. This type of maintenance focuses on getting

rid of inefﬁciencies, enhancing performance, and im-

proving maintainability, ultimately striving for a more

perfect design.

3.1.3 Adaptive

Adaptive maintenance involves adjusting software in

response to changes in the data and processing en-

vironments. For instance, changes in data might in-

volve modifying classiﬁcation codes or restructuring

a database, while changes in processing could result

from new hardware or operating system installations,

requiring updates to existing programs. It’s crucial to

anticipate these changes for timely and effective adap-

tive maintenance.

3.2 In-Context Learning

In-context learning is a prompt engineering strategy

for large language models that do not require tradi-

tional sense of training in machine learning. Instead

of training or ﬁne-tuning models with large datasets,

only a few examples are provided within the prompt.

As a result, models can learn tasks using inference

only, without updating underlying parameters (Min

et al., 2021).

3.3 Generative Pre-Trained

Transformer (GPT)

GPT (Generative Pre-trained Transformer) is a se-

ries of large language models developed by OpenAI

They are based on a transformer architecture, which

is a type of deep learning model that uses attention

mechanisms to capture relationships between words

in a sentence. GPT excels in a variety of natural lan-

guage processing tasks, such as question answering,

translation, summarizing, classiﬁcation, and text pars-

ing (Chiu et al., 2021).

https://openai.com

3.4 Statistical Methods

We evaluate our classiﬁcation performance using

common statistical measures like accuracy, precision,

recall and kappa. Short descriptions of the statistical

methods required in this study are provided below.

• True Positive (TP) - Number of instances that

were correctly classiﬁed as belonging to a class.

• False Positive (FP) - Number of instances that

were incorrectly classiﬁed as belonging to a class

when they actually belong to a different class.

• True Negative (TN) - Number of instances that

were correctly classiﬁed as not belonging to a

class.

• False Negative (FP) - Number of instances that

were incorrectly classiﬁed as not belonging to a

class when they actually belong to that class.

• Precision - Ratio of correctly classiﬁed instances

of a class over all instances that were classiﬁed as

belonging to that class.

Precision =

T P

T P + FP

(1)

• Recall - Ratio of correctly classiﬁed instances of

a class over all instances of that class.

Recall =

T P

T P + FN

(2)

• Accuracy - Ratio of correctly classiﬁed instances

over all instances.

Accuracy =

# of correctly classiﬁed instances

# of all instances

(3)

• Kappa - Cohen’s kappa is a metric used to account

for uneven distribution of classiﬁcation categories

or classes. It considers the possibility of the agree-

ment occurring by chance, therefore providing a

more reliable metric than accuracy.

4 METHODOLOGY

Our main objective in this study is to measure the

applicability of in-context learning in the context of

commit classiﬁcation. Previously, we executed such

an investigation in the context of automated detection

of dark pattern texts (Sazid et al., 2023). We fol-

low the same approach of applying in-context learn-

ing in this new context of commit classiﬁcation. We

begin by synthesizing comprehensive deﬁnitions of

the three maintenance activity categories considered

in this study. This information is then utilized to en-

gineer prompts for large language models. We choose

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

508

Figure 1: Overview of Commit Classiﬁcation Approach Using In-Context Learning.

GPT-3 as the large language model to be used in

this study because of its encouraging performance in

terms of in-context learning capabilities. Figure 1

depicts an overview of the commit classiﬁcation ap-

proach used in this study.

4.1 Dataset of Commit Messages

We utilize the commit dataset provided by Levin et

al., which comprises 1,151 commit data from 11 Java

projects (Levin and Yehudai, 2017). Commits in this

dataset are categorized as ‘Adaptive’, ‘Corrective’, or

‘Perfective’. The distribution of commits is as fol-

lows: 22% ‘Adaptive’, 43% ‘Corrective’, and 35%

‘Perfective’. Aside from the commit message, the

dataset also includes source code changes and key-

words for each commit. But we do not use these other

features in our work. 1,145 of the 1,151 commits are

utilized solely to validate GPT-3 classiﬁcation mod-

els. The remaining 6 commits illustrated in Table 1

are utilized for training or validation based on GPT-3

prompting techniques.

4.2 Category Deﬁnition Synthesis

We must ﬁrst synthesize the deﬁnitions of the mainte-

nance activity categories. We want to use this as con-

textual information in our classiﬁcation process. As a

result, the meaning and phrasing of the category deﬁ-

nitions will have a signiﬁcant inﬂuence on the clas-

siﬁcation outcomes. We synthesize relevant phras-

ings and characteristics from the literature (Swanson,

1976) (Mockus and Votta, 2000) (Levin and Yehudai,

2017) (Heri

cko and

Sumak, 2023) to provide a de-

tailed and comprehensive deﬁnition for each mainte-

nance activity category. We enumerate the different

phrasings to generate each category deﬁnition. The

resulting deﬁnitions are listed in Table 1.

4.3 Classiﬁcation of Commit Messages

Similar to our previous work on in-context learning,

we utilize the ‘GPT for Sheets™ and Docs™’ exten-

sion available in ‘Google Sheets’ to classify all the

commit message texts in the dataset. We employ

the ‘GPT()’ function provided by this extension. It

sends a prompt to GPT and returns the result. We

select the ‘gpt-3.5-turbo’ model for this study. As

commit message classiﬁcation necessitates precise re-

sults, we conﬁgure the ‘temperature’ parameter of the

model to 0 so that it prioritizes accuracy over cre-

ative results. After conﬁguration, We send engineered

prompts to the model appending a single piece of

commit message text to classify that text. We call

the ‘GPT()’ function for all the commit message texts

in the dataset following this approach. The ’GPT()’

function call operates in a completely new session

each time, ensuring that GPT’s memory of the pre-

vious context does not inﬂuence the outcomes.

4.4 Prompt Engineering

Prompt engineering is carried out to apply in-context

learning for the following prompting techniques -

zero-shot, one-shot, and few-shot. Procedures for

prompt engineering followed in this study are ex-

plained in this subsection.

4.4.1 Zero-Shot Prompting

In zero-shot prompting, no example for any of the

classiﬁcation categories is provided to the GPT-3 clas-

siﬁcation model. A template for zero-shot learning

used in our study is given below -

Prompt:

Classify the commit message into ‘adaptive’, ‘perfec-

tive’ or ‘corrective’ maintenance activities.

<Deﬁnition of ‘Corrective’ Category>

Commit Classiﬁcation into Maintenance Activities Using In-Context Learning Capabilities of Large Language Models

509

Table 1: Deﬁnitions and Examples of Classiﬁcation Categories.

Classiﬁcation

Category

Zero-Shot One-Shot Few-Shot

Precision Recall Accuracy Kappa Precision Recall Accuracy Kappa Precision Recall Accuracy Kappa

Corrective 88% 80%

75.7% 61.7%

86% 73%

70% 52.5%

91% 62%

65.6% 45.6%

Perfective 63% 83% 57% 87% 51% 94%

Adaptive 81% 55% 83% 36% 90% 25%

Figure 2: Performance Across Prompting Techniques.

5.2 Result Discussion

Table 3 compares our best performance, achieved us-

ing zero-shot prompting, with some baseline multi-

class commit classiﬁcation models trained and vali-

dated on the same dataset used in this study. It is eas-

ily visible that our performance is similar to the other

models except the one using the LogitBoost classiﬁer

onel et al., 2020). Even though our approach fails

to beat the model using LogitBoost, it is important to

consider a signiﬁcant difference between all of these

models and our approach. These models learned from

training data making them prone to data overﬁtting

whereas we use in-context learning. Our best perfor-

mance is achieved without any training, using only

the semantics of category deﬁnitions. As a result, our

approach is less prone to overﬁtting and more gener-

alized compared to other models.

As zero-shot is our best-performing model, using

more examples can not improve the performance of

our approach. However, there are two feasible di-

rections for improvement. Firstly, more detailed and

accurate category deﬁnitions in prompts can poten-

tially enhance performance. Secondly, integrating our

approach with other non-ML methods, such as key-

word analysis, source code change analysis, or topic

modelling, could further improve performance. Such

an amalgamation can leverage the generalization ca-

pabilities of in-context learning while also beneﬁting

from the classiﬁcation capabilities of other methods.

Table 3: Performance Comparison With Baseline Models.

Commit Classiﬁcation Model Year Accuracy Kappa

Multi-class classiﬁcation based on source code

changes and keywords extracted from commit

messages using LogitBoost (H

onel et al., 2020)

2020 85% 78%

Multi-class classiﬁcation based on source code

changes and keywords extracted from commit

messages using Random Forest

(Levin and Yehudai, 2017)

2017 73.6% 58.9%

Multi-class classiﬁcation based on quantitative

metrics and keywords extracted from commit

messages using Random Forest

(Mariano et al., 2021)

2021 75.7% 62.4%

Multi-class classiﬁcation based on commit

messages using In-Context Learning (proposed)

2023 75.7% 61.7%

6 THREATS TO VALIDITY

Validation with a single dataset poses an external va-

lidity threat. Classiﬁcation performance may differ

in other commit datasets. However, our approach is

generalized in terms of training data. As we do not

use training data, the approach can be as generalized

as the category deﬁnitions used in prompt engineering

for large language models. As a result, we believe this

approach can be useful in classifying commit mes-

sages from other sources as well. Random example

selection for prompt engineering poses a threat to the

internal validity of the study. However, in zero-shot

prompting, no example was used in the prompt. Thus,

the performance of zero-shot prompting, which is the

best-performing model in this study, does not suffer

from this issue.

7 CONCLUSION

In this study, we investigate the applicability of in-

context learning in the context of commit classiﬁca-

tion. Our approach synthesizes deﬁnitions of main-

tenance activity categories from the existing litera-

ture, which are used as contextual information for

large language models to classify commit messages.

Experimental results using GPT-3 show encouraging

Commit Classiﬁcation into Maintenance Activities Using In-Context Learning Capabilities of Large Language Models

511

performance in zero-shot compared to other baseline

models. In-context learning decreases the risk of data

overﬁtting as no training data is used. Thus, our com-

mit classiﬁcation approach is as generalized as the

category deﬁnitions used in prompt engineering. In

the future, we plan to combine this approach with

other commit classiﬁcation approaches to further im-

prove the classiﬁcation performance.

REFERENCES

Amit, I. and Feitelson, D. G. (2021). Corrective commit

probability: a measure of the effort invested in bug

ﬁxing. Software Quality Journal, 29(4):817–861.

Chiu, K.-L., Collins, A., and Alexander, R. (2021). De-

tecting hate speech with gpt-3. arXiv preprint

arXiv:2103.12407.

Fu, Y., Yan, M., Zhang, X., Xu, L., Yang, D., and Kymer,

J. D. (2015). Automated classiﬁcation of software

change messages by semi-supervised latent dirich-

let allocation. Information and Software Technology,

57:369–377.

Ghadhab, L., Jenhani, I., Mkaouer, M. W., and Messaoud,

M. B. (2021). Augmenting commit classiﬁcation by

using ﬁne-grained source code changes and a pre-

trained deep neural language model. Information and

Software Technology, 135:106566.

Gharbi, S., Mkaouer, M. W., Jenhani, I., and Messaoud,

M. B. (2019). On the classiﬁcation of software change

messages using multi-label active learning. In Pro-

ceedings of the 34th ACM/SIGAPP Symposium on Ap-

plied Computing, pages 1760–1767.

Hassan, A. E. (2008). Automated classiﬁcation of change

messages in open source projects. In Proceedings

of the 2008 ACM symposium on Applied computing,

pages 837–841.

Heri

cko, T. and

Sumak, B. (2023). Commit classiﬁcation

into software maintenance activities: A systematic lit-

erature review. In 2023 IEEE 47th Annual Computers,

Software, and Applications Conference (COMPSAC),

pages 1646–1651. IEEE.

Hindle, A., German, D. M., Godfrey, M. W., and Holt, R. C.

(2009). Automatic classication of large changes into

maintenance categories. In 2009 IEEE 17th Interna-

tional Conference on Program Comprehension, pages

30–39. IEEE.

onel, S., Ericsson, M., L

owe, W., and Wingkvist, A.

(2020). Using source code density to improve the ac-

curacy of automatic commit classiﬁcation into main-

tenance activities. Journal of Systems and Software,

168:110673.

Levin, S. and Yehudai, A. (2016). Using temporal and se-

mantic developer-level information to predict mainte-

nance activity proﬁles. In 2016 IEEE International

Conference on Software Maintenance and Evolution

(ICSME), pages 463–467. IEEE.

Levin, S. and Yehudai, A. (2017). Boosting automatic com-

mit classiﬁcation into maintenance activities by uti-

lizing source code changes. In Proceedings of the

13th International Conference on Predictive Models

and Data Analytics in Software Engineering, pages

97–106.

Mariano, R. V., dos Santos, G. E., and Brand

ao, W. C.

(2021). Improve classiﬁcation of commits mainte-

nance activities with quantitative changes in source

code. In ICEIS (2), pages 19–29.

Mauczka, A., Huber, M., Schanes, C., Schramm, W., Bern-

hart, M., and Grechenig, T. (2012). Tracing your

maintenance work–a cross-project validation of an

automated classiﬁcation dictionary for commit mes-

sages. In Fundamental Approaches to Software Engi-

neering: 15th International Conference, FASE 2012,

Held as Part of the European Joint Conferences

on Theory and Practice of Software, ETAPS 2012,

Tallinn, Estonia, March 24-April 1, 2012. Proceedings

15, pages 301–315. Springer.

Meqdadi, O., Alhindawi, N., Alsakran, J., Saifan, A., and

Migdadi, H. (2019). Mining software repositories

for adaptive change commits using machine learning

techniques. Information and Software Technology,

109:80–91.

Min, S., Lewis, M., Zettlemoyer, L., and Hajishirzi, H.

(2021). Metaicl: Learning to learn in context. arXiv

preprint arXiv:2110.15943.

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M.,

Hajishirzi, H., and Zettlemoyer, L. (2022). Rethink-

ing the role of demonstrations: What makes in-context

learning work? arXiv preprint arXiv:2202.12837.

Mockus and Votta (2000). Identifying reasons for software

changes using historic databases. In Proceedings 2000

International Conference on Software Maintenance,

pages 120–130. IEEE.

Sarwar, M. U., Zafar, S., Mkaouer, M. W., Walia, G. S.,

and Malik, M. Z. (2020). Multi-label classiﬁcation

of commit messages using transfer learning. In 2020

IEEE International Symposium on Software Reliabil-

ity Engineering Workshops (ISSREW), pages 37–42.

IEEE.

Sazid, Y., Fuad, M. M. N., and Sakib, K. (2023). Automated

detection of dark patterns using in-context learning ca-

pabilities of gpt-3. In 2023 30th Asia-Paciﬁc Soft-

ware Engineering Conference (APSEC), pages 569–

573. IEEE.

Swanson, E. B. (1976). The dimensions of maintenance.

In Proceedings of the 2nd international conference on

Software engineering, pages 492–497.

Trautsch, A., Erbel, J., Herbold, S., and Grabowski, J.

(2023). What really changes when developers in-

tend to improve their source code: a commit-level

study of static metric value and static analysis warning

changes. Empirical Software Engineering, 28(2):30.

Yan, M., Fu, Y., Zhang, X., Yang, D., Xu, L., and Kymer,

J. D. (2016). Automatically classifying software

changes via discriminative topic model: Supporting

multi-category and cross-project. Journal of Systems

and Software, 113:296–308.

Zafar, S., Malik, M. Z., and Walia, G. S. (2019). Towards

standardizing and improving classiﬁcation of bug-ﬁx

commits. In 2019 ACM/IEEE International Sympo-

sium on Empirical Software Engineering and Mea-

surement (ESEM), pages 1–6. IEEE.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

512