Effectiveness of Data Augmentation and Ensembling Using

Transformer-Based Models for Sentiment Analysis:

Software Engineering Perspective

Zubair Rahman Tusar

1,2

, Sadat Bin Sharfuddin

1,2

, Muhtasim Abid

1,2

, Md. Nazmul Haque

1,2 a

and Md. Jubair Ibna Mostafa

1,2 b

Software Engineering Lab (SELab), Islamic University of Technology (IUT), Gazipur, Bangladesh

Department of Computer Science and Engineering, Islamic University of Technology (IUT), Gazipur, Bangladesh

Keywords:

Sentiment Analysis, Pre-Trained Transformer-Based Models, Ensembling, Data Augmentation.

Abstract:

Sentiment analysis for software engineering has undergone numerous research to efﬁciently develop tools and

approaches for Software Engineering (SE) artifacts. State-of-the-art tools achieved better performance using

transformer-based models like BERT, and RoBERTa to classify sentiment polarity. However, existing tools

overlooked the data imbalance problem and did not consider the efﬁciency of ensembling multiple pre-trained

models on SE-speciﬁc datasets. To overcome those limitations, we used context-speciﬁc data augmentation

using SE-speciﬁc vocabularies and ensembled multiple models to classify sentiment polarity. Using four gold-

standard SE-speciﬁc datasets, we trained our ensembled models and evaluated their performances. Our ap-

proach achieved an improvement ranging from 1% to 26% on weighted average F1 scores and macro-average

F1 scores. Our ﬁndings demonstrate that the ensemble models outperform the pre-trained models on the orig-

inal datasets and that data augmentation further improves the performance of all the previous approaches.

1 INTRODUCTION

Sentiment analysis is a computational analysis of

people’s attitudes, emotions, and views regarding an

entity, which might be a person, an event, or perhaps

a topic (Medhat et al., 2014). It can be used to

determine the emotional tone of a body of writing.

For a given text unit, it can determine whether the text

expresses a positive, negative, or neutral sentiment.

In recent years, the software engineering

community devotes a substantial amount of effort

to conducting research on sentiment analysis. Since

software development relies heavily on human

efforts and interactions, it is more vulnerable to the

practitioners’ emotions. Accurate sentiment analysis

can help to improve various aspects of the software

engineering domain that are affected by sentiment

and emotion. It has been used to investigate the role

of emotions in IT projects and software development

(Wrobel, 2016)(Islam and Zibran, 2016), ﬁnding

relations between developers’ sentiment and software

bugs (Huq et al., 2020), etc.

https://orcid.org/0000-0002-5120-3030

https://orcid.org/0000-0003-2096-0474

Numerous techniques and tools available for

sentiment analysis that are being used in software

engineering. These techniques mainly employ

unsupervised, supervised, and Transformer-based

approaches. Firstly, the unsupervised approach

is used by tools like SentiStrength-SE (Islam and

Zibran, 2018b) and DEVA (Islam and Zibran,

2018a). The authors leveraged domain-speciﬁc

keyword dictionaries to achieve good performance

in SE-speciﬁc texts, as these contain differences

in meaning which are valuable for determining

sentiment. For example, words like bug, patch,

etc. can provide one kind of sentiment when used

in SE-speciﬁc texts and another kind of sentiment

when used in a general context, and having such

domain-speciﬁc keywords in their dictionary made

these tools perform better in SE-speciﬁc texts.

Secondly, the supervised learning approach is used by

tools like SentiCR (Ahmed et al., 2017) and Senti4SD

(Calefato et al., 2018) which achieved improvement

over the previous unsupervised approaches as these

learned from the training dataset instead of relying

on manually deﬁned word-wise sentiment polarities.

Despite the improvements, the proliferation of

438

Tusar, Z., Sharfuddin, S., Abid, M., Haque, M. and Mostafa, M.

Effectiveness of Data Augmentation and Ensembling Using Transformer-Based Models for Sentiment Analysis: Software Engineering Perspective.

DOI: 10.5220/0012092500003538

In Proceedings of the 18th International Conference on Software Technologies (ICSOFT 2023), pages 438-447

ISBN: 978-989-758-665-1; ISSN: 2184-2833

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

transformer-based approaches achieved better results

in other natural language domains attracts many

researchers to utilize pre-trained transformer-based

models in the software engineering domain. Among

these models, ﬁne-tuned BERT, RoBERTa, and

XLNet outperformed existing unsupervised and

supervised tools (Zhang et al., 2020) on SE-speciﬁc

artifacts. The pre-trained models achieved 6.5%

to 35.6% improvement in terms of macro and

micro-averaged F1-scores (Zhang et al., 2020).

Although the pre-trained transformer models

performed better than the previous approaches, small

dataset size, and class imbalance problems incur

threats to those models’ performances. Moreover,

existing research did not consider the ensembling

of multiple models over individual models to know

whether they have any impact on performance or not.

To solve the class imbalance and small dataset size

problems, we conducted a study on the effectiveness

of data augmentation. We further conducted

ensembling of the pre-trained transformer-based

models to determine the impact on performance.

We evaluated the proposed approach using

four publicly available datasets with annotated

sentiment polarities. We chose three pre-trained

transformer-based models: BERT, RoBERTa, and

XLNet. We applied data augmentation with

SE-speciﬁc Word2Vec (Mishra and Sharma, 2021)

and EDA (Easy Data Augmentation) (Wei and Zou,

2019). Along with the original dataset, we used

two versions of augmented datasets. In the ﬁrst

version, we did not consider the class imbalance

issue and augmented the datasets equally for all the

classes. In the second version, which we referred

to as controlled augmentation, we considered the

class imbalance issue and applied augmentation in

a way to balance the classes along with increasing

the size of the datasets. Finally, for all datasets,

we performed sentiment analysis by ensembling

the chosen transformer-based models and analyzed

the obtained results over contemporary approaches.

Speciﬁcally, we investigated the following research

questions.

• RQ1: Do the ensemble models outperform the

pre-trained models on the original datasets?

• RQ2: Does data augmentation have any impact

on the performances of the models?

The experimental results demonstrate that the

ensemble models outperform all the pre-trained

transformer models in three out of the four original

datasets, i.e., datasets without any augmentation, in

terms of weighted- and macro-average F1 scores.

The results further demonstrate that the augmentation

approaches aid the performance of all the pre-trained

transformer model approaches as well as the

ensemble models in two out of the three datasets in

terms of weighted and macro average F1 scores.

The main contributions of this paper are:

• Use of stacking ensemble on pre-trained

transformer-based models to improve SE-speciﬁc

sentiment analysis.

• Comparative study of the effect of controlled

augmentation and inclusion of SE-based data

augmentation on existing small datasets

Structure of the Paper: Section 2 discusses

the related works. Section 3 describes datasets in

detail. The methodology adopted for our research is

elaborated in section 4. In section 5, we evaluate,

analyze, and discuss the results of our experiments

and present the main ﬁndings. Limitations are

mentioned in section 6. Finally, we conclude by

outlining the future work in section 7.

2 RELATED WORKS

We discussed the approaches of SE-based sentiment

analysis from four perspectives: unsupervised

approaches, Supervised approaches for SE-based

sentiment analysis, pre-trained models of NLP for

sentiment analysis, and data augmentation and the

ensembling of pre-trained models.

2.1 Unsupervised Approaches

Unsupervised approaches are used in Senti Strength

and Senti Strength SE.

Senti Strength: It is a lexicon-based sentiment

analysis technique, which uses dictionaries of both

formal terms and informal texts ( like - slang, and

emoticons)(Thelwall et al., 2010). In the dictionary,

every word was assigned a speciﬁc sentiment

strength. Then, the tool categorized sentences into

positive and negative emotions and determined the

strength of the emotions based on dictionaries and

linguistic analysis. However, sentistrength could not

perform well on SE-speciﬁc data as the dictionary did

not contain words speciﬁc to software engineering.

Senti Strength SE: To evaluate the performance

of SentiStrength on software artifacts, 151 Jira

issue comments were analyzed(Islam and Zibran,

2017), which achieved less accuracy. Investigating

the less accuracy of Sentistrength, they found 12

reasons, for which domain-speciﬁc meanings of

words were most prevalent. So they built a modiﬁed

version of SentiStrength by adding a domain-speciﬁc

dictionary(Islam and Zibran, 2017). New sentiment

Effectiveness of Data Augmentation and Ensembling Using Transformer-Based Models for Sentiment Analysis: Software Engineering

Perspective

439

words and negations were added to the dictionary. It

was the ﬁrst sentiment analysis tool where SE-speciﬁc

context was considered. The dictionary, however, had

very limited domain-speciﬁc words.

2.2 Supervised Approaches

Supervised approaches are discussed using Stanford

CoreNLP, SentiCR, and Senti4SD tools. Each of

these tools used supervision to classify sentiment

and/or polarity.

Stanford CoreNLP: Stanford CoreNLP(Socher

et al., 2013) was introduced for single-sentence

sentiment classiﬁcation; polarity along with the

sentiment value of a sentence was returned by the

tool. It was trained with the Recursive Neural Tensor

Network on the Stanford Sentiment Treebank.

SentiCR: It was developed speciﬁcally for code

review comments (Ahmed et al., 2017). It classiﬁed

code review comments into two classes - negative

and non-negative. The supervised classiﬁer

used in sentiCR was GBT(Gradient Boosting

Tree)(Pennacchiotti and Popescu, 2011), as it gave

the highest precision, recall, and accuracy among the

eight evaluated classiﬁers.

Senti4SD: It was the ﬁrst supervised learning-based

tool to generate feature vectors. Senti4SD(Calefato

et al., 2018) used three features - SentiStrength

lexicons, n-gram extracted keywords from the dataset,

and word representations in a distributional semantic

model(DSM) exclusively trained on StackOverﬂow

data. Four prototype vectors namely positive polarity

word vectors, negative polarity word vectors, neutral

polarity word vectors, and the sum of positive, and

negative word vectors were calculated to extract

semantic features. Finally, a Support Vector Machine

(SVM) was used to identify sentiment polarities.

Although supervised approaches performed

superiorly to unsupervised approaches, they were

under par to properly predict sentiment in cases

where sentiment-heavy words were not included in

their training dataset.

2.3 Pre-Trained Models

There are some popular transformer-based pre-trained

models that are described here for sentiment analysis.

BERT: A deep learning model designed to learn

contextual word representations from unlabeled

texts(Devlin et al., 2018). It is based on the

transformer architecture but doesn’t have the decoder,

rather contains only a multi-layer bidirectional

transformer encoder. The pre-training of the

model is accomplished by optimizing two tasks

- masked language modeling (MLM) and next

sentence prediction(NSP). Originally there were two

implementations of BERT: BERTBase with 12 layers,

12 self-attention heads, 110M parameters, and a

hidden layer size of 768 and BERTLarge with 24

layers, 16 self-attention heads, 340M parameters, and

1024 hidden layer size. Our work uses BERTBase.

RoBERTa: A robustly optimized BERT that changed

pre-training steps by using larger mini-batch sizes

to train over more data for a huge time, train

on longer sequences to remove Next Sentence

Prediction loss, and train with dynamic masking.

When Liu et al(Liu et al., 2019) released it,

it had achieved state-of-the-art results on several

benchmarks surpassing BERT.

XLNet: Based on Transformer-XL, it used segment

recurrence mechanism and relative encoding. To

address the individual weakness of autoregressive

language modeling and autoencoding, it combined

their strengths (Yang et al., 2019). XLNet performed

better than BERT, including sentiment analysis,

especially for long texts.

Although the size of the training dataset plays

a crucial role in ﬁne-tuning these large models, we

observed less than one thousand samples even in the

gold-standard datasets used in the community.

2.4 Data Augmentation and Ensembling

of Pre-Trained Models

Data augmentation(Batra et al., 2021) was used

through lexical-based substitution and back

translation as a pre-processing step to help train

and ﬁne-tune BERT variants - BERT, RoBERTa,

and ALBERT. Then, a weighted voted scheme was

applied to the ﬁnal Softmax layer output of the BERT

variants to ensemble the models and achieve the ﬁnal

weighted prediction. However, the class imbalance

was evident in the datasets used in these approaches.

Sample size for different sentiment polarities, i.e.,

positive, negative, and neutral, differed by a large

margin. As a result, this issue caused supervised

approaches to perform poorly.

There are several strategies to tackle this issue in

the community. One line of work is the Easy Data

Augmentation (EDA) approach proposed by Wei et

al.(Wei and Zou, 2019). The authors evaluated this

technique and found that the improvement rate of this

technique increases inversely proportional to the size

of the dataset.

ICSOFT 2023 - 18th International Conference on Software Technologies

440

Table 1: Distribution of the datasets over the classes.

Dataset Total Positive Neutral Negative

App Review 341 186 25 130

Stack Overﬂow 1,500 131 1,191 178

Jira 926 290 0 636

Github 7,122 2,013 3,022 2,087

Table 2: Data samples from the Datasets under study.

Dataset Samples from dataset Usage in natural lang.

Github

Seems like something

is leaking memory :(

The plumber ﬁxed the

leaking pipe.

Stack

Overﬂow

I am attempting to get

my JUnit tests for an

Android application

running using Ant.

Ants are all over my

food

Jira

safety and scope beans

to current thread.

White thread should

have been used here

App

Reviews

android 4.2jelly bean

can’t install error 909. .

i’ll give you ﬁve stars if

u ﬁx it

Jelly bean is one my

favorite pastime food

while watching movies

3 DATASET

In our study, we used four gold-standard publicly

available datasets with annotated sentiment polarity.

The distribution of the datasets over the classes

(positive, neutral, and negative) is shown in Table 1.

Some samples from these datasets are highlighted in

Table 2 that demonstrate how software engineering

context can inﬂuence the meaning of different words.

Jira Issues. The original dataset (Ortu et al., 2015)

had four labels of emotions: love, joy, anger, and

sadness which Lin et al.(Lin et al., 2018) brought

down to two labels by annotating sentences with love

and joy with positive polarity and sentences with

anger and sadness with negative polarity.

App Reviews. 3000 reviews were presented

(Villarroel et al., 2016) from which 341 reviews were

randomly selected by Lin et al.(Lin et al., 2018).

The dataset has a 5% conﬁdence interval and a 95%

conﬁdence level making it statistically signiﬁcant.

Stack Overﬂow Posts. There are 1500 sentences in

total. This dataset was gathered by Lin et al. (Lin

et al., 2018) from a July 2017 Stack Overﬂow dump.

They choose threads that are (i) labeled with Java and

(ii) include one of the terms library, libraries, or API

(s). After that, they chose 1,500 words at random and

classiﬁed their sentiment polarities manually.

GitHub Data. The dataset has 7,122 sentences

extracted from GitHub commit comments and

pull-requests. An iterative extraction was performed

on the dataset of Pletea et al.(Pletea et al., 2014)

by Novielli et al. (Novielli et al., 2020) to obtain

annotated text units.

4 METHODOLOGY

In this section, we describe the proposed

methodology to detect the sentiment polarities.

It consists of three components. At ﬁrst, the

implementation details of the adopted augmentation

approaches are discussed (4.1). Secondly, we

elaborate on the ﬁne-tuning step (4.2). Finally,

we describe the implementation of the ensemble

approach (4.3). An overview of our methodology

is shown in Figure 1. The following subsections

describe these three components with further details.

4.1 Augmentation

Motivated by the work of (Wei and Zou, 2019), we

adopted the augmentation technique to mitigate the

problem of small dataset size and class imbalance

over the samples. The technique used three ways to

introduce new samples from the given samples.

• Random Synonym Replacement: It was used

to replace random words in a given sample

(sentence) with their synonyms. We applied

two different techniques. One was to replace

randomly selected words with the synonyms

found in Natural Language Toolkit (NLTK)

wordnet library(Miller, 1995). Another one

is to replace SE-speciﬁc words using most

similar words in SE context provided by a

word2vec model trained speciﬁcally on software

engineering text from the study conducted by Siba

m et al. (Mishra and Sharma, 2021). Some of the

augmented results are shown in Table 4.

• Random Deletion: We randomly deleted a word

from the given sentence. In addition, if the given

sample contained duplicated words, those words

were also discarded.

• Random Insertion: Like the other two ways, we

randomly insert random words in a given sample.

To do this, at ﬁrst, we randomly pick a word from

the given sentence and then, ﬁnd the synonym of

that word. Finally, we inserted this synonym into

a random position.

We performed two types of augmentation only on the

training dataset to generate a new dataset from the

given raw dataset.

Basic Augmentation: It was used to tackle the issue

of the dataset being small in size. We randomly

picked a sample from the dataset and used either

of the aforementioned three techniques to generate a

new sample. However, this did not solve the class

imbalance issues of the datasets.

Effectiveness of Data Augmentation and Ensembling Using Transformer-Based Models for Sentiment Analysis: Software Engineering

Perspective

441

Figure 1: Overview of the proposed methodology.

(a) Dataset: App Reviews. (b) Dataset: Github.

Figure 2: Class frequency distributions for all datasets.

Controlled Augmentation: Class imbalance in many

cases can cause bias problem in machine learning

models. So, in this Controlled Augmentation,

we applied augmentation techniques to balance the

class-wise frequency distribution. For each dataset,

we augmented text to ensure equal distribution in

every class.

The distribution of samples over the classes after

applying the three techniques and two types of

augmentation is shown in Figure 2.

Table 3: Pre-trained transformer models and conﬁgurations.

Architecture Used Model Parameters Layers Hidden Heads

BERT bert-base-uncased 110M 12 768 12

RoBERTa roberta-base 120M 12 768 12

XLNet xlnet-base-cased 110M 12 768 12

4.2 Fine-Tuning Transformer-Based

Models

The second step in our research methodology was

to ﬁne-tune the pre-trained transformer-based models

for the downstream tasks. To do this, we split

each dataset under our study into 70% for the

train-set and 30% for the test-set. We ﬁne-tuned

three transformer-based models separately with the

train sets split from the raw datasets as well

as augmented datasets (generated using basic and

controlled augmentation approaches). The models

are BERT, RoBERTa, and XLNet. We refer to

these models collectively as pre-trained transformers

(PTT). The information related to train these models

is shown in Table 3.

4.3 Ensemble

We proposed a stacking-based ensemble technique

to aggregate the performance of the different

transformer-based models. Because each of the

models (that are part of the ensemble) has a

distinct basic functioning, the range of predictions

is substantially broader in the case of the ensemble.

ICSOFT 2023 - 18th International Conference on Software Technologies

442

Table 4: Data augmentation: SE-Speciﬁc Synonym Re-

placement utilizing word2vec (Mishra and Sharma, 2021).

Original Augmented

A perfect timing. Just a

second before my commit

:)”

A perfect timing. Just the

third during my rollback

:)”

don’t pull. master is bro-

ken! will ﬁx soon.”

don’t pull. slave is broken!

will ﬁx soon.”

Each model was pre-trained on a particular language

modeling task, such as BERT’s next sentence

prediction, XLNet’s auto-regressive approach, and

RoBERTa’s dynamic masking. At ﬁrst, we considered

every sample as a bag of words and encoded by the

positional encoding that were represented in eq.1.

sample

= [w

, w

..., w

] (1)

Then, we fed these n number of samples into m

number of transformer based pre-trained models and

took the softmax outputs of the last layer of these

models, showed in Eq. 2 and 3. These outputs were

concatenated and treated as feature vector (FV) which

contained the contextual representation of the given

input samples.

FV = ⊕

i=1

⊕

j=1

so f tmax(models

(sample

));

models = {BERT, XLNet, RoBERTa}

(2)

softmax(x

) =

exp(x

)

∑

exp(x

)

(3)

Finally, we detected the sentiment polarities using

two well known classiﬁers: Random Forest (RF) and

Logistic Regression (LR) using Eq. 4.

= classi f iers

(FV ); classi f iers = {RF, LR} (4)

However, as three of the datasets (Stack Overﬂow,

App review, and Github) under our study have three

classes and LR is designed for binary classiﬁcation,

we adopted the one-vs-rest (OvR) method for

multi-class classiﬁcation.

5 EVALUATION AND

DISCUSSION

In this section, we reported, compared and analyzed

the performance of the pre-trained transformer-based

models and the ensemble models on the four datasets

described in Section 3. It was evident from Fig-

ure 2 that three of the four datasets (App reviews,

Stack Overﬂow, and Jira) had a small number of

samples and a signiﬁcant class imbalance issue. Thus,

we conducted augmentation on these three datasets

except for the Github (GH) dataset. For each dataset,

we reported the performance of the approaches we

used on the original version, the basic augmented

version, and the controlled augmented version. For

comparison, we also showed the performance of

SentiStrength-SE as a representative of the lexicon

approach-based sentiment analysis tool and SentiCR

as a representative of the supervised approach-based

sentiment analysis tool only on the original version of

the datasets. We highlighted the best performance in

terms of the two main metrics (i.e., weighted-average

F1 scores and macro-average F1 scores) in bold.

We answered the research questions based on the

experimental results as follows.

5.1 RQ1: Do the Ensemble Models

Outperform the Pre-trained Models

on the Original Datasets?

We answer RQ1 by analyzing the performance of

ensemble models and PTTs only on the original

version of the four datasets.

5.1.1 App-Review Dataset

The ensemble models outperform the PTT

approaches in both weighted and macro-average

F1 scores. The best-performing PTT approach is

BERT for both the F1 scores. BERT can achieve

weighted and macro-averaged F1 scores of 63%

and 44%, respectively (5). The RF-based ensemble

model can achieve weighted- and macro-average

F1 scores of 71% and 51%, respectively, while

the LR-based ensemble model can achieve

weighted- and macro-averaged F1 scores of

69% and 49%, respectively. Our results show

that the RF-based ensemble model outperforms the

best-performing PTT approach by 8%-50% in terms

of weighted-average F1 score by 7%-33% in terms

of macro-average F1 score. This range is determined

based on the improvement of the best and worst

scores of the PTT approaches. Even though the

dataset is small and the class distribution is highly

imbalanced, the ensemble models still perform better

compared to the other approaches.

5.1.2 Stack Overﬂow Dataset

The best-performing PTT approach is

RoBERTa considering both weighted(89%) and

macro-average(75%) F1 scores for this dataset,

which is better than the performance achieved by

the ensemble models. The LR-based ensemble

model performs better than the RF-based and can

achieve weighted and macro-average F1 scores of

Effectiveness of Data Augmentation and Ensembling Using Transformer-Based Models for Sentiment Analysis: Software Engineering

Perspective

443

Table 5: Results for the Appreview dataset.

Dataset

Class Negative Neutral Positive Weighted-average Macro-average

Model P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%)

App Review

Sentistrength-SE 86 28 42 10 38 15 72 79 75 73 57 58 56 48 44

SentiCR 77 79 78 14 12 13 84 84 84 76 77 77 58 58 58

BERT 62 49 55 0 0 0 68 87 77 61 66 63 43 45 44

RoBERTa 38 100 55 0 0 0 0 0 0 14 38 21 13 33 18

XLNet 50 56 53 0 0 0 65 68 66 54 58 56 38 41 40

Ensemble (RF) 64 86 73 0 0 0 84 74 79 70 73 71 49 53 51

Ensemble (LR) 65 70 67 0 0 0 76 82 79 67 72 69 47 51 49

App Review

Basic

Augmentation

BERT 85 91 88 50 12 20 91 95 93 86 88 86 75 66 67

RoBERTa 77 100 87 0 0 0 98 89 93 83 87 84 58 63 60

XLNet 80 95 87 0 0 0 93 89 91 82 85 83 58 61 59

Ensemble (RF) 87 95 91 40 25 31 95 94 94 88 89 89 74 71 72

Ensemble (LR) 85 95 90 0 0 0 94 94 94 84 88 86 60 63 61

App Review

Controlled

Augmentation

BERT 86 88 87 43 38 40 92 92 92 86 87 87 74 73 73

RoBERTa 83 91 87 33 12 18 94 95 94 85 88 86 70 66 66

XLNet 95 86 90 33 25 29 87 95 91 86 87 86 72 69 70

Ensemble (RF) 86 88 87 60 38 46 92 95 94 88 88 88 80 74 76

Ensemble (LR) 84 88 86 80 50 62 94 95 94 89 89 89 86 78 81

Table 6: Results for the Stack Overﬂow dataset.

Dataset

Class Negative Neutral Positive Weighted-average Macro-average

Model P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%)

Stack Overﬂow

Sentistrength-SE 38 14 20 82 92 87 29 23 26 72 77 74 50 43 44

SentiCR 42 58 49 90 85 87 46 44 45 80 78 79 59 62 60

BERT 69 42 53 86 97 91 80 28 41 83 84 82 78 56 62

RoBERTa 82 71 76 91 97 94 86 42 56 89 89 89 86 70 75

XLNet 80 20 32 81 99 89 0 0 0 74 81 75 54 40 41

Ensemble (RF) 74 58 65 88 96 92 77 40 52 86 86 85 80 64 70

Ensemble (LR) 77 63 69 89 97 93 81 40 53 87 88 87 82 66 72

Stack Overﬂow

Basic

Augmentation

BERT 67 54 60 90 95 93 73 56 63 86 87 86 77 68 72

RoBERTa 67 69 68 92 92 92 59 56 57 86 86 86 73 73 73

XLNet 75 68 71 91 95 93 73 56 63 88 88 88 80 73 76

Ensemble (RF) 69 58 63 90 94 92 66 58 62 86 86 86 75 70 72

Ensemble (LR) 70 64 67 91 94 93 69 56 62 87 87 87 77 72 74

Stack Overﬂow

Controlled

Augmentation

BERT 70 68 69 91 92 91 58 51 54 85 86 85 73 70 72

RoBERTa 74 63 68 90 96 93 86 56 68 88 88 88 83 71 76

XLNet 75 46 57 88 97 92 83 47 60 86 86 85 82 63 70

Ensemble (RF) 71 57 65 89 96 91 78 48 61 86 85 86 80 68 73

Ensemble (LR) 77 63 69 90 96 93 81 51 63 88 88 88 83 70 75

89% and 72%, respectively. Although RoBERTa

outperforms the ensemble models, the ensemble

models outperform the other two PTT approaches.

More speciﬁcally, as illustrated in Table 6 the

LR-based ensemble model outperforms the other

two PTT approaches by 5%-12% in terms of

weighted-average F1 score and by 10%-31% in terms

of macro-average F1 score. This dataset also has a

high imbalance in class distribution.

5.1.3 Jira Dataset

The ensemble models outperform the PTT

approaches in both weighted- and macro-average

F1 scores here as well. The best-performing PTT

approach is RoBERTa which can achieve an F1

score of 95% for both weighted and macro-average.

Both the RF and LR-based ensemble models can

achieve an F1 score of 96% for the weighted-average.

But the macro-average F1 score of the RF-based

ensemble model is 96%, which is better than the

95% macro-average F1 score of the LR-based

ensemble model. Table 7 clearly portrays that

the RF-based ensemble model outperforms the

best-performing PTT approach by 1%-6% in terms of

weighted-average F1 score and by 1%-8% in terms

of macro-average F1 score.

5.1.4 GitHub Dataset

This dataset is the largest in size and also has the

most balanced class distribution, which reﬂects in our

results as well. All the approaches perform relatively

well on this dataset. The ensemble models bring

minor improvements to the already well-performing

PTT approaches. Here, the best-performing PTT

approach is BERT, which can achieve an F1 score of

91% for both weighted and macro-average (Table 8).

The LR-based ensemble model also can achieve an F1

score of 91% for both weighted- and macro-average.

But the RF-based ensemble model outperforms them

slightly. It can achieve an F1 score of 92% for

both weighted and macro-average. So the RF-based

ensemble model outperforms the best performing

PTT approach by 1%-2% for both weighted- and

macro-average F1 scores.

ICSOFT 2023 - 18th International Conference on Software Technologies

444

Table 7: Results for the Jira dataset.

Dataset

Class Negative Positive Weighted-average Macro-average

Model P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%)

Jira

Sentistrength-SE 99 72 84 62 99 76 88 81 81 81 86 80

SentiCR 95 98 97 96 90 92 95 95 95 95 94 95

BERT 94 97 95 93 85 89 93 93 93 93 91 92

RoBERTa 98 96 97 91 95 93 96 95 95 94 95 95

XLNet 89 100 94 99 72 83 92 91 90 94 86 88

Ensemble (RF) 96 99 97 97 92 94 96 96 96 96 95 96

Ensemble (LR) 96 98 97 96 91 93 96 96 96 96 94 95

Jira

Basic

Augmentation

BERT 98 99 98 97 96 96 98 98 98 97 97 97

RoBERTa 97 99 98 98 94 96 97 97 97 98 96 97

XLNet 99 99 99 97 97 97 98 98 98 98 98 98

Ensemble (RF) 98 99 99 98 96 97 98 98 98 98 97 97

Ensemble (LR) 98 100 99 99 96 97 98 98 98 99 98 98

Jira

Controlled

Augmentation

BERT 96 100 98 100 92 96 97 97 97 98 96 97

RoBERTa 98 99 98 98 95 96 98 98 98 98 97 97

XLNet 98 95 96 89 96 92 95 95 95 94 95 94

Ensemble (RF) 99 99 99 97 98 97 98 98 98 98 98 98

Ensemble (LR) 98 99 99 98 96 97 98 98 98 98 97 98

Table 8: Results for the Github dataset.

Dataset

Class Negative Neutral Positive Weighted-average Macro-average

Model P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%)

Github

Sentistrength-SE 78 73 76 77 85 81 86 77 81 80 79 79 80 79 79

SentiCR 89 67 76 78 92 84 87 85 86 83 83 82 84 81 82

BERT 90 89 89 91 91 91 93 93 93 91 91 91 91 91 91

RoBERTa 89 87 88 91 90 90 90 94 92 90 90 90 90 90 90

XLNet 87 88 88 93 88 90 89 95 92 90 90 90 89 90 90

Ensemble (RF) 91 90 90 92 91 91 92 94 93 92 92 92 92 92 92

Ensemble (LR) 91 89 90 91 91 91 92 94 93 91 91 91 91 91 91

RQ1 Findings:

The ensemble models outperform the

pre-trained transformer-based models by a

signiﬁcant margin for three out of the four

datasets showing that ensemble models can

perform better than individual transformer

models despite having class imbalance.

In the Stack Overﬂow dataset, RoBERTa,

which is a PTT approach, performs better

than the ensemble models.

5.2 RQ2: Does Data Augmentation

Have any Impact on the

Performances of the Models?

We answer RQ2 by comparing and analyzing the

performance of PTT and the ensemble models on the

augmented versions of the datasets.

5.2.1 App-Review Dataset

In the basic augmentation approach, all the PTT

and the ensemble approaches outperform the results

achieved by training the models using the original

train-sets in terms of both weighted-average and

macro-average F1 scores. The best performance

improvement is achieved for the BERT-based

PTT approach with a 23% improvement on

weighted-average and macro-average F1 scores.

RF ensemble approaches, however, achieve the best

results when compared to PTT approaches, with

weighted-average and macro-average F1 scores of

89% and 72%, respectively, as shown in Table 5.

We observe the improvement for the LR ensemble

model compared to when trained on the original

train-set but do not achieve the best results. The

noticeable improvement is for the neutral class. All

the approaches fail to predict any neutral samples on

the original dataset. Augmentation helped improve

in this case, and we observed BERT, and the RF

ensemble approach improved in the case of neutral

sample detection. Overall, the results depict that

data augmentation enhanced the performances of the

models for this dataset.

We observe the best results for the controlled

augmentation approach, especially for the classes

with the least number of samples. In the

controlled augmentation approach, all the PTT

approaches achieve an improved weighted-average

and macro-average F1 score. We observe that the

LR ensemble approach achieves the best result with

weighted-average and macro-average scores of 89%

and 81%, respectively. The number of samples for

the neutral class was signiﬁcantly lower than for the

other classes in the original train-set. For which,

we can clearly observe the poor performance for

that class using all the approaches in the table 5.

This controlled augmentation approach enables all the

Effectiveness of Data Augmentation and Ensembling Using Transformer-Based Models for Sentiment Analysis: Software Engineering

Perspective

445

models to improve the neutral class’s performances.

5.2.2 Stack Overﬂow Dataset

In this dataset, we observe a different trend from

the app review dataset when the basic as well as

the controlled augmentation approaches, are applied.

We do not observe any improvement in the basic

and controlled augmentation approaches. This is

because the imbalance percentage is higher than

the Jira and the App reviews datasets. So, when

data augmentation is applied multiple times on the

dataset to oversample the undersampled classes, the

quality of the samples degrades. The best-performing

approach remains XLNet-based PTT in terms of

macro-average F1 score. However, we do see a slight

1% improvement when basic augmentation is applied

for the XLNet-based PTT approach. The ensemble

approaches do not improve the results of the PTT

approaches as well.

5.2.3 Jira Dataset

Compared to when trained on the original train-set,

we observe that augmentation helped improve

the results for all the PTT as well as ensemble

approaches. We see the best improvement of 8% in

weighted-average F1 score and 10% in macro-average

F1 score is achieved by the xlnet-base-cased PTT

model. We get the best results for the LR ensemble

approach, which achieved a weighted-average and

macro-average F1 score of 98%. Similar to the

App review dataset, we can observe signiﬁcant

improvement through augmentation for this dataset

as well. Unlike the App review dataset, when the

controlled augmentation approach is taken for the

Jira dataset, we do not observe improvement across

all PTT and ensemble approaches. The reason

behind this, according to our observation, can be

that after applying basic augmentation to the original

dataset, the dataset size was increased. Still, the class

imbalance was not signiﬁcant compared to the app

review dataset. For this reason, we observe better

performance on the basic augmentation approach.

RQ2 Findings:

The augmentation approaches aid the

performances of all the PTT approaches

as well as the ensemble approaches in two

out of three datasets in terms of weighted-

and macro-average F1 scores with the

most signiﬁcant improvements for the

undersampled classes.

6 LIMITATIONS

One of the limitations was that we only used datasets

that were publicly available from previous works. As

a result, we were not able to ensure the quality of the

manual annotations. This limitation also extended

to our data augmentation technique. As an example,

the text “Can’t compile X64 under vs2010, returnng

this error : 5>d:\mangos\mangos\src\game

\spellmgr.cpp(1226): error C4716:

‘DoSpellProcEvent::AddEntry’ : must return a

value” from the datasets under study may incline

more towards a negative sentiment but was annotated

as neutral. As mentioned in Section 4, we augmented

existing data by adopting various approaches. So the

quality of the augmented data relied on the quality of

the original dataset.

Another potential limitation of our work is related

to the random splitting of data in our experimental

setup. We split the dataset randomly into a 70-30

ratio, where 70% of the dataset was used to train and

the rest 30% was used to test. As the data is random

in each split, in each run the results might vary. This

can be addressed by using more rigorous techniques

like k-fold which we plan on doing in the future.

7 CONCLUSION AND FUTURE

WORKS

We provide a comparative study on the

performance of ensembled models and pre-trained

transformer-based models in terms of weighted and

macro-average F1 scores. We also investigated the

requirement and performance of data augmentation in

ﬁne-tuning ensembled and pre-trained models. The

pre-training was done on four datasets. Experimental

results revealed that ensemble models perform better

than individual pre-trained models in three out of

the four original datasets, i.e., datasets without any

augmentation. Our results also demonstrated that the

augmentation approaches aid the performances of all

the pre-trained transformer models along with the

ensemble ones in two out of three datasets.

Ensemble models have more potential to give

better performance from a software engineering

perspective. Thus, more ensemble techniques can be

explored in the future. Besides, the existing datasets

under recent study have noticeable class imbalance

issues. So, we also aim to explore other approaches

of handling imbalance of dataset. The quality and size

of the dataset is also an important factor that can be

furthered experimented on generating a larger dataset

with quality data annotation.

ICSOFT 2023 - 18th International Conference on Software Technologies

446

REFERENCES

Ahmed, T., Bosu, A., Iqbal, A., and Rahimi, S. (2017). Sen-

ticr: a customized sentiment analysis tool for code re-

view interactions. In 2017 32nd IEEE/ACM Interna-

tional Conference on Automated Software Engineer-

ing (ASE), pages 106–111. IEEE.

Batra, H., Punn, N. S., Sonbhadra, S. K., and Agarwal, S.

(2021). Bert-based sentiment analysis: A software

engineering perspective. In International Conference

on Database and Expert Systems Applications, pages

138–148. Springer.

Calefato, F., Lanubile, F., Maiorano, F., and Novielli,

N. (2018). Sentiment polarity detection for soft-

ware development. Empirical Software Engineering,

23(3):1352–1382.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Huq, S. F., Sadiq, A. Z., and Sakib, K. (2020). Is developer

sentiment related to software bugs: An exploratory

study on github commits. In 2020 IEEE 27th Inter-

national Conference on Software Analysis, Evolution

and Reengineering (SANER), pages 527–531. IEEE.

Islam, M. R. and Zibran, M. F. (2016). Towards under-

standing and exploiting developers’ emotional varia-

tions in software engineering. In 2016 IEEE 14th In-

ternational Conference on Software Engineering Re-

search, Management and Applications (SERA), pages

185–192. IEEE.

Islam, M. R. and Zibran, M. F. (2017). Leveraging au-

tomated sentiment analysis in software engineering.

In 2017 IEEE/ACM 14th International Conference on

Mining Software Repositories (MSR), pages 203–214.

IEEE.

Islam, M. R. and Zibran, M. F. (2018a). Deva: sensing emo-

tions in the valence arousal space in software engi-

neering text. In Proceedings of the 33rd annual ACM

symposium on applied computing, pages 1536–1543.

Islam, M. R. and Zibran, M. F. (2018b). Sentistrength-se:

Exploiting domain speciﬁcity for improved sentiment

analysis in software engineering text. Journal of Sys-

tems and Software, 145:125–146.

Lin, B., Zampetti, F., Bavota, G., Di Penta, M., Lanza, M.,

and Oliveto, R. (2018). Sentiment analysis for soft-

ware engineering: How far can we go? In Proceed-

ings of the 40th international conference on software

engineering, pages 94–104.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Medhat, W., Hassan, A., and Korashy, H. (2014). Sentiment

analysis algorithms and applications: A survey. Ain

Shams engineering journal, 5(4):1093–1113.

Miller, G. A. (1995). Wordnet: a lexical database for en-

glish. Communications of the ACM, 38(11):39–41.

Mishra, S. and Sharma, A. (2021). Crawling wikipedia

pages to train word embeddings model for software

engineering domain. In 14th Innovations in Soft-

ware Engineering Conference (formerly known as In-

dia Software Engineering Conference), pages 1–5.

Novielli, N., Calefato, F., Dongiovanni, D., Girardi, D., and

Lanubile, F. (2020). Can we use se-speciﬁc sentiment

analysis tools in a cross-platform setting? In Proceed-

ings of the 17th International Conference on Mining

Software Repositories, pages 158–168.

Ortu, M., Adams, B., Destefanis, G., Tourani, P., Marchesi,

M., and Tonelli, R. (2015). Are bullies more produc-

tive? empirical study of affectiveness vs. issue ﬁxing

time. In 2015 IEEE/ACM 12th Working Conference on

Mining Software Repositories, pages 303–313. IEEE.

Pennacchiotti, M. and Popescu, A.-M. (2011). Democrats,

republicans and starbucks afﬁcionados: user classiﬁ-

cation in twitter. In Proceedings of the 17th ACM

SIGKDD international conference on Knowledge dis-

covery and data mining, pages 430–438.

Pletea, D., Vasilescu, B., and Serebrenik, A. (2014). Secu-

rity and emotion: sentiment analysis of security dis-

cussions on github. In Proceedings of the 11th work-

ing conference on mining software repositories, pages

348–351.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,

C. D., Ng, A. Y., and Potts, C. (2013). Recursive

deep models for semantic compositionality over a sen-

timent treebank. In Proceedings of the 2013 confer-

ence on empirical methods in natural language pro-

cessing, pages 1631–1642.

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and Kap-

pas, A. (2010). Sentiment strength detection in short

informal text. Journal of the American society for in-

formation science and technology, 61(12):2544–2558.

Villarroel, L., Bavota, G., Russo, B., Oliveto, R., and

Di Penta, M. (2016). Release planning of mobile apps

based on user reviews. In 2016 IEEE/ACM 38th Inter-

national Conference on Software Engineering (ICSE),

pages 14–24. IEEE.

Wei, J. and Zou, K. (2019). Eda: Easy data augmentation

techniques for boosting performance on text classiﬁ-

cation tasks. arXiv preprint arXiv:1901.11196.

Wrobel, M. R. (2016). Towards the participant observa-

tion of emotions in software development teams. In

2016 Federated Conference on Computer Science and

Information Systems (FedCSIS), pages 1545–1548.

IEEE.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R. R., and Le, Q. V. (2019). Xlnet: Generalized au-

toregressive pretraining for language understanding.

Advances in neural information processing systems,

32.

Zhang, T., Xu, B., Thung, F., Haryono, S. A., Lo, D., and

Jiang, L. (2020). Sentiment analysis for software en-

gineering: How far can pre-trained transformer mod-

els go? In 2020 IEEE International Conference on

Software Maintenance and Evolution (ICSME), pages

70–80. IEEE.

Effectiveness of Data Augmentation and Ensembling Using Transformer-Based Models for Sentiment Analysis: Software Engineering

Perspective

447