used to determine the best model for sentiment anal-
ysis as well as to improve the performance bench-
marks. Previous works primarily focus on feeding
numerical vectors from pre-trained word embeddings
into the LSTM models. This work modifies the
LSTM architecture to learn the numerical represen-
tations directly in the embedding layer. Unlike earlier
research, in this study, we offer a full statistical analy-
sis to back up the conclusions using statistical testing.
The impact of class-balancing strategies on datasets to
develop more accurate models is looked into, as well
as which class-balancing methodology is best suited
to software engineering. We have used Friedman test
to confirm the observations drawn from the results of
the different models trained. The objective is to prove
which LSTM model is best suited to sentiment anal-
ysis of software engineering artifacts for research or
use in industry.
Sections 2-5 of this work are organized as follows:
A literature survey of several approaches to sentiment
analysis for our desired applications is presented in
Section 2. Section 3 describes the dataset, experimen-
tal setup, and study design, while section 4 makes a
comparative analysis on the research outcome. The
section compares the LSTM models with the RNN
structures as well as class-balancing strategies. Fi-
nally, Section 5 summarizes the research findings and
provides recommendations for further scope in this
area.
2 RELATED WORK
2.1 Sentiment Analysis Techniques
Shen et al.(Shen et al., 2019) analyzed the various fac-
tors of the dataset and study design that contribute
to the insufficient accuracy of sentiment prediction.
The authors noted a jump in accuracy from 0.758 to
0.934 on the binary classification of emotions of com-
ments from the StackOverflow dataset when the mod-
els were trained on a domain-specific dataset as com-
pared to a non-tech dataset. According to the study,
another contributing factor to sub-par results is the
imbalance in the dataset used for training, and almost
all previous evaluation works have suffered from this
flaw.
The performances of four existing sentiment anal-
ysis tools are evaluated in the work done by Novielli
et al. (Novielli et al., 2020). It was reported that
retraining these four sentiment analysis tools did not
produce laudable results when the sources from which
the training and testing samples were retrieved varied.
Hence, a lexicon-based technique is proposed as an
alternative to retraining existing tools. Further, it was
also concluded that upon training supervised tools
with a small balanced training data set (of around one
thousand documents), the models outperformed the
lexicon-based tools. This was however true only if
the training data had a high inter-rater agreement.
When the training and test sets come from mul-
tiple data sources, they discovered that retraining
domain-specific sentiment analysis tools is not a
sound approach. When retraining is not possible ow-
ing to the lack of a gold standard, they offer lexicon-
based techniques, which they advocate anytime re-
training is not feasible due to the lack of a gold stan-
dard. Further research suggested that when super-
vised tools are retrained with a small training dataset
of roughly 1,000 documents, they perform better than
lexicon-based tools, provided the training dataset is
balanced and high inter-rater agreement is detected.
2.2 Lexicon-Based Methods V/S
Machine Learning-Based
Jurado et al. analyzed the sentiment in commits of de-
velopers on GitHub using lexicon-based methods by
first classifying whether the text was objective or sub-
jective and then classifying the text as positive or neg-
ative sentiment (Jurado and Rodriguez, 2015). The
authors used four different lexicons, ANEW, Open-
Finder(OF), SentiStrength(SS), and WN-Affect(WA).
NLTK was used for pre-processing, and SnowBall for
the stemming process. For each issue, the number of
words under the corresponding emotion (anger, dis-
gust, fear, joy, sadness, and surprise) identified by WA
were obtained, while the other lexicons provided a po-
larity analysis of the issue as negative or positive. It
was worth noting that the positive and negative analy-
ses for the polarity obtained with the appropriate lex-
icons had a positive association. This fact introduces
undesired uncertainty and has an impact on the polar-
ity’s possible interpretations. The authors found that
polarity analysis with the specified lexicons was un-
suitable for their corpus but that it was effective for
emotional analysis.
Calefato et al. (Calefato et al., 2018) developed
the model Senti4SD, which was trained and tested on
over 4000 posts manually tagged with emotion polar-
ity from Stack Overflow.
Batra et al. (Batra et al., 2021) attempted to
use transformer-based pre-trained models to achieve
higher performance compared to that of the exist-
ing tools such as SentiCR, SentiStrength-SE, etc.
The research proposed three distinct ways for analyz-
ing BERT-based models: the first method fine-tunes
the already existing pre-trained BERT-based models,
Software Engineering Comments Sentiment Analysis Using LSTM with Various Padding Sizes
397