Text Mining Studies of Software Repository Contents

Bartosz Dobrzyński and Janusz Sosnowski

Warsaw University of Technology, Institute of Computer Science, ul. Nowowiejska 15/19, Poland

Keywords: Issue Tracking System, Issues and Comment Classification, Software Repository, Text Mining.

Abstract: Issue tracking systems comprise data which are useful in evaluating or improving software development

processes. Revealing and interpreting this information is a challenging problem which needs appropriate

algorithms and tools. For this purpose, we use text mining schemes adapted to the specificity of the software

repository. They base on a detailed analysis of the used dictionaries which comprise Natural Language Words

(NLW) and are enhanced with specialized entities in issue descriptions (e.g., emails, code snippets, technical

names). They are defined with specially developed regular expressions. The pre-processed texts are submitted

to original text mining algorithms (machine learning). This approach has been verified in commercial and

open-source projects and showed possible development improvements.

1 INTRODUCTION

Software development is managed with the use of

Issue Tracking Systems (ITS) supported with

Software Version Control (SVC) and other systems.

They provide data repositories comprising rich data

which document software engineering activities

during the project life cycle (Vidoni, 2021;

Sosnowski et al., 2017; Huang, et al., 2019). They are

specified as issues in ITS created by project

stakeholders (actors): project analyst, developers,

testers, users. Issues are specified in some structural

form and include various fields targeted at specific

features, e.g., title, summary and description of the

relevant problem, issue type, priority, reporter id,

status and history of processing, links to other files.

Recently, text mining techniques gained

significant interest (e.g., Li et al., 2022; Yahav, et al.

2019). Typical analysis goals are information

extraction, data mining and knowledge discovery,

text categorization, sentiment analysis, document

summarization, etc. For this purpose, various

algorithms have been developed based on Natural

Language Processing (NLP) which involve

information retrieval and extraction, lexical,

structural and semantical analysis, data mining and

visualization, etc. Unfortunately, texts used in

software repositories differ significantly from those

analysed in classical text mining.

https://orcid.org/0000-0001-6640-1585

The developed text mining scheme has been

enhanced with a deep insight into the used dictionary

considering words from NL thesaurus and other

entities. Non-NL entities are identified and classified

in relevance to their semantical meaning. They extend

the space of data mining as opposed to classical text

mining covering only NL words. The contribution of

the paper relates two three aspects:

i) Taxonomy and extensive analysis (statistical

and semantical) of textual contents in IST repositories

supported with developed regular expressions

identifying non-NL entities.

ii) Investigating the impact of feature selection on

the text classification efficiency in machine learning

schemes adapted to searched issue properties.

iii) Assessing the quality of issue documentation

in relevance to project stakeholders’ activities.

The structure of the paper is as follows. Section 2

outlines the background of our research in relevance

to other publications in the literature. Section 3

presents an original analysis of text features in

repositories and provides illustrative statistics.

Section 4 outlines text mining methodology and

algorithms illustrated with some experimental results.

Sections 5 discusses possible extensions of our

approach which is concluded in section 6.

562

Dobrzy

nski, B. and Sosnowski, J.

Text Mining Studies of Software Repository Contents.

DOI: 10.5220/0011970100003464

In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023), pages 562-569

ISBN: 978-989-758-647-7; ISSN: 2184-4895

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

2 LITERATURE REVIEW

Software development processes are documented in

diverse repositories supported by issue tracing (IST),

software version control (SVC) and other systems.

Various automatization approaches have been

proposed in the literature targeted at specified aspects

facilitating development optimization.

Umer, Li & Sultan (2019) deal with the problem

of approving or rejecting issues for further

processing. They use natural language processing

techniques for sentiment analysis (positive, negative)

of the issue summary based on frequently used words

in the reports. Nadem et al. (2021) classified issues in

three categories bug, enhancement, and question. The

proposed approach uses neural network RoBERT

tuned to issue reports and admits multi-label settings.

In practice, issues are supplemented with diverse

textual comments added during the issue handling by

project participants. They provide abundant

information about the associated issues useful in their

processing and monitoring the project progress,

activities and competence of project participants, etc.

Wentig et al., (2019) propose techniques to acquire

interesting information for stakeholders. They traced

issues from Github projects containing numerous

comments related to discussions involving many

participants (on average 10). Hence, they formulated

15 categories of relevant discussions, which included

expected and observed behaviour, bug reproduction,

solution discussion, task progress, testing, new issues

and requests, social conversation, etc. This study

gives only some general insight on stakeholder

discussions, the level of issue understanding, etc.

Herbold and Trautsch (2020) analyse the

possibility of classifying bug and non-bug issues

basing on combined analysis of issue title and

description with trained models. This problem has

been raised due to the observation that the reported

issue types often do not match the description of the

issue. Ferreira Gomes et al., (2019) provide a

comprehensive survey on issue severity prediction.

Similarly, we can identify security bugs. Most

described methods use unstructured text features,

machine learning and text mining techniques.

Nagvani & Verma (2012) propose the bug

classification algorithm CLUBAS which combines

text clustering, frequent term calculation and term

mapping techniques. It is used to search similar

groups of bugs (with cosine similarity), groups of

developers relevant to categorized bug groups (for

optimizing fixing times), etc.

Text mining is also useful in resolving some

specific problems of issue handling, e.g., related to

bug diagnosis and triaging, identifying duplicated

issues. In most projects new issues or bugs are

manually triaged by an expert developer. This may

result in excessive time costs resulting from

inappropriate assignment of a developer to fix the

problem and reassigning it to other persons. This can

be improved by automatic issue triaging based on text

classification techniques (Banerjee et al., 2017) to

recommend appropriate developers. It is based on

extracting and matching bug and developer categories

taking into account diverse attributes, e.g. developers’

expertise scores for correlation with reported bugs.

Zhang et al. (2016) combine issue triaging (fixer

assignment) with severity identification.

Fan et al., (2018) provide a survey of automated

bug report management techniques which include

bug triaging, detection of duplicate bug report, bug

severity/priority assignment. It is extended with a

method of discriminating valid and not valid bugs

based on random forest algorithm. Zhang et al.,

(2019) propose a tool to method-level fine-grained

bug localization. It uses semantic similarity, temporal

proximity, and call dependency scores.

An important issue is automatic identification of

duplicated reports to eliminate redundant actor

activities and reducing the amount of time a triager

spends in analysing the incoming reports. Depending

upon the project, duplicate issues constitute a few up

to 30% of all issues. We can use semantic and lexical

similarities in checking an incoming report with other

existing reports in the repository to assess the

probability of being a duplicate. Hindle & Onuczko

(2019) survey various deduplication methods, they

base on diverse techniques using TF-IDF term

frequency, machine learning, topic analysis, or deep

learning. Banerjee et al., (2017) assess issue text and

summary similarities with the cosine similarity

metric. Most approaches to automatic detection of

duplicate bugs use natural language processing, only

a few (Ebrahimi, et al., 2019) consider also the

execution information (stack traces).

The presented literature review confirms practical

significance of text mining techniques supporting

issue handling in the project lifecycle. The relevant

publications are targeted at algorithms tuned to

specific problems and they lack detailed studies of

semantical and structural features of issue reports and

comments. Typically, repositories of many projects

comprise a significant percentage of non-natural

language terms. Introducing a taxonomy of these

terms we developed regular expressions defining

diverse categories of terms. Replacing these terms in

the texts with tags relevant to different term

categories enhances semantic impact of original texts

Text Mining Studies of Software Repository Contents

563

and provides new dimension of text classification and

interpretations. Deriving various issue report features

with text mining we can combine them with other

statistics and issue handling processes which is also

neglected in the literature. These features can be

interpreted in a wider context of issue handling

dependencies, so the analysis are more project

assessment oriented. In the paper we present a deeper

study of these problems, which augments the space of

assessing issue handling and documentation.

3 TEXT FEATURE ANALYSIS

Having analysed software repositories of many

projects we found that they constitute a mixture of NL

phrases and sentences and other included objects.

Syntactic features of these records are not standard,

partially retaining grammatical rules and some

specific styles of reports or imposed by the company.

In this situation the crucial point is a deeper analysis

of the object and word classes which can be

considered as dictionary taxonomy. Objects are

complex text entities of dedicated meaning and

specified structure, e.g. code, panel, test outcome.

Within word classes (continuous character string) we

distinguish natural language words (NLW)

consistent with specified thesaurus (e.g., English),

functional words (FW) and non-classified words

(NCW). In most cases NLW class relates to English

thesaurus, however sometimes reporters may include

other language words (e.g., Polish) which can be

considered separately or included in NCW class (if

appear sporadically). NCW class may comprise

words with not defined meaning (can be processed to

include them in appropriate FW class or a new one)

or erroneous words, e.g. due to typos.

Functional words are project dependant,

nevertheless in general we can distinguish some

typical ones: references to attachments of different

types (e.g. graphical. textual, logs), external links and

pull request references, repository user identifiers,

names of code class or packet, email address.

Recognition of diverse types of objects and words

within the considered classes can be performed with

the use of the language thesaurus and regular

expressions defining diverse objects and FW types.

For an illustration we present a list of such

expressions (specified in POSIX standard notation):

- Expressions identyfying graphical, textual, and

log attachements: \!\S+[.png|.jpg|.gif]\!;

\!\S+.txt\! and \!\S+.log\! , respectively

(attachement with any extension \!\S+\!).

- Pooll request references: (\[(.)*?pull-

request?.*\])|(https:\/\/.*?pull-request[\S]*)|

(\[(.)*?\/pull\/?.*\])|(https:\/\/.*?\/pull\/[\S]*.?)|

(https:\/\/.*?\/commit\/[\S]*.?)

- External references: (http:\/\/\S+)

(\[\S+\|\S+\])|(\[http.*?\])|(\[https:\/\/\S+)|(https:\

/\/\S+)|(http:\/\/\S+)

- Panel and code sections: {panel(.|\n)*?{panel}

and ({code(.|\n)*?{code})|({noformat})

|\n)*?{noformat})| ({code:java}(.|\n)*?{code})

- Names of classes and code packages (including

js files): (([a-zA-Z_$][a-zA-Z\d_$]*\.){1,}[a-

zA- Z_$][a-zA-Z\d_$]*)| (([a-z]|[A-Z])+(_([a-

z]|[A-Z])*)+.js)| (\b([a-z]|[A-Z])+(_([a-z]|[A-

Z])*)+\b))

- CamelCase names of classes or methods: \w*[A-

Z]\w*[A-Z]\w*; \b[a-z]+[A-Z]+\S*\b

- Link to Email specification - regular expressions:

https://stackoverflow.com/a/201378

They can be used to derive the structure of issue

dictionary which gives some general view on the

complexity of the further text mining algorithms and

quality of reports. An important issue is also tracing

dictionary features in short (e.g. monthly) and long-

time perspectives. This is illustrated in Fig.1 for the

analysed commercial project C1.

Figure 1: NL dictionary size (number of words) for project

C1: monthly (lower plot) and accumulated statistics (upper

plot) related to subsequent development months.

The presented plots show the sizes of dictionaries

in two perspectives: accumulated (taking into account

issues from 24 months) and monthly (for issues

registered within subsequent months). It is interesting

that the cardinality of NL dictionary (unique words)

within months is relatively stable (700-1127, average

870); the long-term dictionary (accumulated)

increases slowly up to 4205 for 24 months and 4763

for 36 months. Similar statistics we have derived for

open access projects. For example, in case of

MongoDB NL dictionary is richer: the monthly

dictionary size was in the range 1916 – 2700, the

accumulated size for 24 months assumed 8779.

Nevertheless, here we also observe relatively small

1000

2000

3000

4000

5000

1 5 9 131721

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

564

increase in time. This allows to deeper analysis of

used words or phrases characterising issues. The

monthly cardinality of non-NL word classes is

significantly lower, however long-time statistics

showed linear increase. Dealing with these statistics

it is also reasonable to refer them to the number of

recorded issues and included tags. For MongoDB we

have 119 - 307 issues per month (average228), 12864

- 31563 tokens, average 21385, so average issue size

is 94 tokens. For the commercial project C1 we have

73-163 issues per month 116 (1418- 9382 tokens)

with average 50 tokens per issue.

Complex words correspond to class names,

program variables etc., they usually are specified with

appropriate name conventions, e.g. CamelCase,

snakecase, Pascal. In most analysed project we found

CamelCase notation, so such word phrases are

classified as CCW. Classes specified in relevance to

program packet, e.g.: java.lang.String – class String

in java.lang packet are denoted as CPW.

Non classified words assumed on average 5.3%

and 9.5% of all words for C1 and MangoDB projects,

respectively. These statistics for subsequent months

fluctuated in the ranges 3.8% - 8.1% and 8.2% -

11.0%, respectively. Here, it is also important to

analyse the cardinality of unique non-classified

words. For the commercial project it was 0-90 per

month which resulted in the whole period of 3 years

dictionary of 875 words in total. For MongoDB it

was: 172-320 per month, and total for 2 years 1780.

Hence, it can be verified manually and refined by

introducing other word categories, e.g. technical

acronyms. Such statistics can also be derived

separately for diverse issue types. For the commercial

project the description size of user story, task, new

function and bug issues was 146, 40, 112 and 49

words/issue, respectively. The used NL word

dictionary in new function was about 1.4 times bigger

than for other issues. The ratio of non-NL elements

per issue was 1.3-1.5 for external link, 1.5-3.0 for

class name (with maximum 3 for user story), ratios of

binary attachments for new function and bugs were

0.4 and 0.7, respectively, for the remaining issue

types it assumed 0.16. The rate of other non-NL

elements was in the range 0.03-0.05.

Some statistics of non-NL objects included in

issues is presented in Tab. 1 for commercial (C1) and

MongoDB (MDB) projects. They relate to email

addresses, code snippets, classes, binary and image

appendices, external and code change references,

panel sections, respectively. Tab. 1 shows monthly

ranges followed by average values (the second row).

Issue descriptions may comprise diverse technical

words (TW), which are often relevant to the project

domain and implementation. Hence, it is reasonable

to identify and interpret them. They can also appear

in NLW or other word classes. Using TW words, we

can correlate the considered text with specific

problems, e.g., performed functionality, performance

drawbacks. TW words can be extended for technical

phrases (n-grams) playing a similar role to keywords

- useful in characterizing topics of issues. These word

sets and relevant n-grams can be derived iteratively

and updated, they can also be a subject for team

discussions to improve and standardize descriptions

of issues, including sets of representative keywords.

Table 1: Monthly distribution of non-NL objects in issues.

Pro

ect Ad

Code Class A

Bin

C1 53-271

119.1

0-10

0.9

0-199

22.7

10-114

47.1

MDB 0-7

1.1

30-117

35.5

23-249

63.2

0.176.9

ApIm ExtRef ChRef Pan

C1 - 6-211

0-1

0.03

MDB 0-29

0.19

5-21

10.9

0-11

3.2

0-12

1.7

Dealing with special words (SW) defined by

regular expressions it is worth identifying their

context, e.g., preceding them phrases. For example, in

project C1 external link in issue descriptions is

preceded by “Go to” or “open”, in comments “please

see”, “verified”; emails are preceded by “login with”.

Attachments are usually preceded with “Please see”

phrase. Sometimes issue descriptions comprise

complex sections, e.g., code snippets. Nevertheless,

less formalized section (not well structured) can also

appear. Text feature statistics are useful in assessing

informative value of reports, competence of reporters

and improvement suggestions.

4 CLASSIFICATION SCHEMES

Software development and maintenance processes

are documented in diverse repositories supported by

issue tracing (IST), software version control (SVC)

and other systems. In our studies we focused our

attention on repository issues and relevant comments.

In the first case we consider issue textual elements:

title, summary, or description; comments are treated

as indivisible entities. We have decided to pre-

process the original texts according to classical text

mining recommendations enhanced with our original

special word transformations and derived text feature

statistics. The transformed text is submitted to

Text Mining Studies of Software Repository Contents

565

classification schemes adapted to the semantic

analysis goals (section 4.1). Some illustrative results

of this original approach are presented in section 4.2.

4.1 Algorithms

The developed text mining of software repository

textual contents is performed in two phases: i) text

pre-processing, ii) classification. The first phase is

composed of the following steps of Algorithm 1:

1. Extracting reports from the software repository

for the analysis, using relevant API (e.g. Jira

API), they can relate to a specified time period.

2. Creating the set of original textual entities (OTE)

labelled with issue/comment ids.

3. Transforming OTE set into a signature form by

replacing objects, FW and NCW words by

special word tags (compare section 3). Here, we

can use two conversions generalized and

distinctive. In the first case we use a general tag

for all words in the considered class (e.g. email,

code, reporter) in the second one different words

are attributed different tags (e.g. reporter#1,

reporter#2,). Depending upon the analysis goal

we can also admit a mixed approach with some

word classes tagged distinctively (e.g. to trace

issues generated by specific reporters, or referred

to specific code commits).

4. Text reduction, e.g., using lemmatization of

words, removing stop words, removing

numerical words.

As opposed to classical text mining we do not

unify upper- and lower-case characters (used in

program variables, classes, etc.). The extracted text

reports are manually analysed to derive the training

set of text reports for the assumed classification.

Here, we try to assure balanced representation of

considered class categories. This set is pre-processed

according to Algorithm 1, it can be enhanced with

additional derived features (section3) and submitted

to classifier training block.

The developed classification scheme uses

available text processing and machine learning

packages from sklearn library. The set of considered

text features include information from the fields of

reported issues (e.g. description, title of the issue and

diverse derived attributes/factors) or comment

contents. It is also extended by derived sentiment

factor (positive, negative, neutral) with NLTK

Sentiment Intensity Analyser module. Some

numerical properties, e.g. text length can be defined

using t-shirt sizing method used in Agile story Point

Estimation. The developed classification involves the

following steps of Algorithm 2:

1. Prepare the input data for classification: textual

data pre-processed with Algorithm 1 and other

features (e.g., text sentiment, diverse numerical

properties), create learning and testing subsets

with labelled classes.

2. Define sets of input data configuration (batches)

for the further analysis

3. For each data configuration perform machine

learning transformations with packages from

sklearn for specified classification models

4. For each classification model perform cross

validation using sklearn GridSearchCV package

and select the best model which can be used in

classification

The training process is based on cross validation

technique with labelled data for training and for

validation. For each classifier we evaluate its

accuracy and select the best one to be used for future

classification of transformed texts.

Examples of data configuration sets in step 2 for

issue classification can be title and description,

description + sentiment + number of attachments

(compare section 4.2). In step 3 the text processing is

performed with sklearn Column Transformer which

combines input data properties in a unified matrix of

numerical values. Textual components (e.g., issue

description, title) are processed with sklearn Tfidf

Vectorizer. We can use accuracy as scoring metric as

our training data is evenly distributed.

Labelled features (e.g.: text sentiment, description

length range) are processed by OneHotEncoder

which generates spare matrix representation.

Numerical features (e.g. number of email references)

are standardized with StandardScaler. In practice, it

is reasonable to limit the number of features, so an

important issue is their selection. Basing on our

experience we decided to take into account

description, title and up to 4 additional features,

which resulted in diverse combinations submitted to

classification experiments facilitating identification

of the ones with high impact on classification quality.

It is important to note that step 3 of Algorithm 2

should be adapted to the classification goal and

analysed text specificity, which can be refined by an

expert and the gained practice. It is possible to use

diverse text feature combinations (e.g. original title,

transformed description). Hence, it is reasonable to

check classification efficiency (accuracy) of diverse

combinations of the transformed text for the analysis

and select the one with the best accuracy. In this

selection some intuition of the expert is advantageous

to select features giving the highest impact of

classification. The standard classification accuracy

(ACC) and F1 score are used here.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

566

4.2 Illustrative Results

The effectiveness of our approach is illustrated for

two classification tasks: issue types and comment

categories. We distinguish 4 issue types: user story,

task, new feature, bug. We verified the classification

accuracy (ACC) for 6 classification models and

diverse combinations of input data stream: pre-

processed (according to algorithm 1) issue

description (D), issue title (T), sentiment of title or

description, the number of special words (|SW|),

emails (|Em|), references to external appendices

(|EA|) or to code changes (|ChanR|), classes,

percentage of technical words (TW%), sizes of

description or title in bytes (denoted with |..|

brackets). This allowed us to assess the impact of

divers input data features.

Table 2: Issue type classification for C1 project.

Data confi

uration LSV MLP RR

D+|EA| +|Sw| +|Em|

D+T+TW%

D+Ts++|Sw|

TW%+|Sw|+|T|+|D|

D+T+

0.778

0.783

0.931

0.756

0.772

0.922

0.842

0.861

0.925

0.831

0.794

0.897

0.786

0.767

0.903

0.772

0.758

0.889

Table 2 presents an excerpt of issue type

classification results for three best classifiers: LSV

(linear super vector), MLP (Multi-Layer Perceptron),

RR (Ridge Regression). The highest accuracy(ACC)

relates to Description + Title + TW% data

composition. For other classifiers the best results

were achieved for the same data composition: NB

(Naïve Bayes) - 0.872, Passive Aggressive (PA) -

0.894, KNN (k nearest neighbours)- 0.803. For data

based only on the issue description text, the lowest

values were 0.542, 0.531, and 0.661 for NB, PA and

KNN classifiers, respectively. For other data

combinations in most cases the results were lower

than best ones by about up to 0.1, except a few up to

0.4 (PA, NB). We can notice that the dominant impact

of accuracy has the pre-processed issue description

and title. Adding numerical features needs proper

selections, e.g. replacing TW% by 4 other numerical

ones reduced ACC to 0.892. The calculated F1 scores

were a little bit lower than ACC (typically 0.95ACC).

To check the impact of introduced input data

features we verified the classification score taking

into account only original comment description

(without replacing non-NL words with tags) and we

obtained lower ACC values. Issue type classification

into four categories was rather a simple task in

relevance to description and title. Repository reports

in project C1 were systematically improved (in

particular titles). Such classification for open-source

projects is less accurate. The classifications in

relevance to other goals e.g., quality of description,

diagnostic capabilities may show higher advantage of

using transformed data and additional features. Here,

it is worth noting that high accuracy of issue type

classification can also be considered as some metric

of issue reporting quality.

Classifying comments, we distinguished 4

categories: positive, response, question, fix. This

classification is more demanding than the issue type.

Here, we also checked the impact of diverse input

data configurations: content of the comment text (pre-

processed according to algorithm 1), sentiment, the

numbers of change references, emails, external

appendices references, classes, code snippets. Other

features (e.g., content size) had negligible impact.

Table 3: Comment classification for project C1.

Data configuration LSV MLP KNN

Cont

Cont + |ChangR|

Cont+ Sent+ |ChangR|

Cont+ |ChangR|+|Email|

Cont + |ChangR|+ |EA|

Cont+ |Code|+ |class||

0.798

0.805

0.885

0.795

0.801

0.798

0.818

0.805

0.881

0.798

0.801

0.815

0.766

0.785

0.901

0.772

0.762

0.766

Table 3 shows selected results for project C1. For

all classifiers the best results related to the third data

combination. We have also assessed comment

classification for open-source project Groovy. The

best accuracy ACC was 0.810 (F1=0.785) for LSV

classifier and configuration: Cont + |ChanR| + |?|.

Where |?| denotes the number of question marks in

the text, Cont is the pre-processed comment text.

Better results for the commercial project can be

explained by our knowledge of this project, stable

team and systematic assessment of its quality

including report ratings and critics provided by an

external company.

5 DISCUSSION

The presented analysis of textual entities within

software repositories confirmed that they comprise

significant percentage of words/phrases in natural

language which are mixed with diverse non-NL

elements. Replacing these elements with labelled tags

facilitates classification and interpretation of reports

(issues, comments). The set of not classified entities

can be further investigated to find uncovered ones and

possibly create additional new classes. The class

categories must be defined in relevance to the scope

Text Mining Studies of Software Repository Contents

567

of the analysis. Classes beyond the analysis scope can

be skipped from the input data to concentrate on the

considered ones and avoid blurring the analysis goal.

In case of long-term projects, the contents of

dictionaries can change as shown in section 3 (Fig.1).

Dictionaries for subsequent months are usually quite

small and they increase for longer periods, however

some saturation is observed. Hence, it is

recommended to periodically refine the classification

processes using upgraded training sets. The

developed text pre-processing (Algorithm 1) can be

used also in other analysis problems, e.g., based on

finding similarities or clustering analysed

reports/comments. It can be combined with other text

mining/machine learning algorithms and tools.

The textual repository analysis can be extended

considering other features of reports, e.g., specified in

issue fields (priority, reporter id, software

component) or timing properties. This hybrid

approach may provide deeper insight into project

problems. For an illustration we present the problem

of so-called debt of defects (Zabardast et al., 2020).

In practice, some reported issues are considered as

negligible, and their solution/handling is postponed in

time. This may become critical while the number of

such defects becomes significant. This abrupt

increase appears due to programmer fatigue with

resolving similar defects. We have developed

Algorithm 3 of identifying this situation which

includes two steps:

1) identifying significant non-linear increase of

postponed defects (creating list P

of these defects),

2) selecting suspicious defects in P

with text

mining based on recognizing some critical words in

the description or using classification algorithms and

training sets.

In the considered commercial project C1 in the

first development phase each month appeared 1

postponed negligible defect, after project stabilization

for the next 8 months we observed 8 such defects per

month, then a rapid increase appeared from 13 to 63

defects in subsequent 6 months. Similar effects we

observed also for some open-source projects. Using

text mining based on predefined critical words (10-

30 words) allowed us to identify typically 10% of

postponed defects as critical. The selected suspicious

defects have been confirmed as critical ones in 82%.

Some of these defects were identified as fixed, with

skipped registration of completed state in the issue

state history (deficiency of reporters).

In many projects some issues are labelled as

wantfix, which denotes that they will not be handled

(Panichella et al., 2021). Typical reasons for this are:

feature request/enhancement already implemented or

not needed, feature fixed in the context of previous

ones or too expensive, etc. Such issues can be

identified sooner with classification based on text

mining and rejected without processing.

The quality of issue reports depends upon the

recommendations imposed by project managers,

competence and experience of relevant stakeholders.

Good reports facilitate problem identification,

diagnosis, and resolution. It is important to match

what developers need and what issue reporters

provide (Zimmerman et al., 2010). Hence, monitoring

the quality of issue reporting is needed, it can base on

some measurements and exchanged experience

between the project team and users. This may result

in deriving diverse improvement recommendation

targeted at better readability, comprehension,

illustration, automatic classification, etc. Issue

descriptions are enhanced with generated comments

during their analysis and resolution. The quality of

these comments is also important. Tracing comment

sequences is useful in this process.

Table 4: Comment sequence features in C1 project.

Comment se

uence Dela

1 CDu

Fix, Pos, Res, Fix, Res, Posit

Fix,Que, Posit

Fix, Fix, Posit

Fix, Que

Res, Que, Res, Que, Posit

5d 23h

5d 1h

0.1h

7d 1h

0.3h

4d 2h

5d 2h

24d 2h

4d 21h

62d 2h

Analysing classified comments, we traced their

sequences and timing properties. Table 4 presents

some results for the commercial project. For the

presented comment sequences (2-6 comments) we

give time lapse (in days) between issue opening and

the delay to the first comment (Delay1) followed by

commenting duration (Cdur). Some sequences are

bizarre, e.g., lacking response (Res) after question

(Que). Res comment was skipped (negligence of

developer), however the problem has been resolved

and the issue closed. The initial Res comment in the

last sequence refers to the issue description.

For some classification problems we can perform

combined text mining involving textual features of

issue and comment elements extended with other

selected numerical attributes. This seems to be

valuable in case of assessing reporting quality,

predicting issue resolution form. Moreover, cross

sectional analysis is possible by restricting the

considered issues or comments to specified types,

analysis periods, involved reporters, etc.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

568

6 CONCLUSIONS

The analysis of reported issues and relevant

comments showed that a deeper study of the impact

of non-NL elements is needed to explore semantical

aspects of reports. This extends the space of the report

analysis. The introduced text pre-processing and

derived text features facilitate understanding the

classification decisions and assure better accuracy.

The validity threat to our study is the result restriction

to a few projects. Nevertheless, the presented

methodology is universal due to similarities in created

software repositories (contents and structure).

The main text mining is targeted at classification

or clustering of the considered textual objects, here

we can use diverse statistical and machine learning

techniques, which can be combined and adapted to

project specificity and searched properties. This can

be enhanced with contextual and correlations

analysis. Distinction between texts generated by bots,

authored by users, developers or testers could narrow

semantic searches and extend the space of repository

studies. Further research is targeted at correlating

issue handling processes schemes and times with

semantical aspects of textual descriptions and other

issue features (basing on our previous experience -

Sosnowski, et al., 2017; Polaczek & Sosnowski,

2021). This can be enhanced with questionnaire

studies involving project participants.

REFERENCES

Banerjee, S. et al. (2017). Automated triaging of very large

bug repositories. In Information and Software

Technology, 89.

Ebrahimi, N. e al. (2019). An HMM-based approach for

automatic detection and classification of duplicate bug

reports. In Information and Software Technology 113

(2019) 98–109.

Fan, Y., Xia, X., Lo, D., Hassan, A.E. (2018). Chaff from

the wheat: characterizing and determining valid bug

reports. In IEEE Transactions on Software

Engineering,· August 2018.

Ferreira Gomes, L.A., et al. (2019). Bug report severity

level prediction in open source software: A survey and

research opportunities. In Information and Software

Technology 115 (2019) 58–78.

Herbold, S., Trautsch, A., Trautsch, F. (2020). On the

feasibility of automated prediction of bug and non-bug

issues.In Empirical Software Engineering 25, 5333–

5369.

Hindle, A., Onuczko, C. (2019). Preventing duplicate bug

reports by continuously querying bug reports. In

Empirical Software Engineering, vol. 24, no. 2.

Huang, Y., et al. (2019). An empirical study on the issue

reports with questions raised during the issue resolving

process. In Empirical Software Engineering 24, 718–

750.

Li, Q. et al. (2022). A survey on text classification: From

traditional to deep learning In ACM Transactions on

Intelligent Systems and Technology vol. 13, no. 2.

Nadeem, A., Usman Sarwar, M., Zubair Malik, M. (2021).

Automatic issue classifier: a transfer learning

framework for classifying issue reports. In IEEE

International Symposium on Software Reliability

Engineering Workshops (ISSREW), October.

Nagvani, N.K., Verma, S. (2012). CLUBAS: An algorithm

and Java based tool for software bug classification

using bug attributes similarities. In Journal of Software

Engineering and Applications, vol.5, no. 6, 436-447.

Panichella, S., Canfora, G., Andrea Di Sorbo (2021).

‘Won’t we fix this issue?’ Qualitative characterization

and automated identification of wontfix issues on

GitHub. In Information and Software Technology, Vol.

139, Nov., 106665.

Polaczek, J., Sosnowski, J. (2021). Exploring the software

repositories of embedded systems: An industrial

experience. In Information and Software Technology,

vol. 131.

Sosnowski, J., Dobrzyński, B., Janczarek, P. (2017)

Analysing problem handling schemes in software

projects. In Information and Software Technology, vol.

91.

Umer, Q., Liu, H., Sultan, Y. (2019). Sentiment based

approval prediction for enhancement reports, In

Journal of Systems and Software, 1555 (2019) 57-69.

Vidoni, M. (2021). A systematic process for Mining

Software Repositories: Results from a systematic

literature review. In Information and Software

Technology, vol 4 December.

Wenting, D.A, et al. (2019). Analysis and detection of

information types of open source software issue

discussions. In ICSE, IEEE/ACM 41st International

Conference on Software Engineering.

Yahav, I., Shehory, O, Schwartz, D. (2019). Comments

mining with TF-IDF: The inherent bias and its removal.

In IEEE Transactions on Knowledge and Data

Engineering, vol. 31, no. 3, pp. 437-450.

Zabardast, E. Gonzalez-Huerta, J., Šmite, D. (2020).

Refactoring, bug fixing, and new development effect on

technical debt: An industrial case study. In 46th

Euromicro SEAA Conference, pp. 376-384.

Zhang, T., et al. (2016). Towards more accurate severity

prediction and fixer recommendation of software bugs.

In Journal of Systems and Software 117 166–184.

Zhang, W. et al. (2019). FineLocator: A novel approach to

method-level fine-grained bug localization by query

expansion. In Information and Software Technology

110 (2019) 121–135.

Zimmermann, T. et al. (2010). What makes a good bug

report? In IEEE Transactions on Software Engineering,

vol. 36, no. 5, pp. 618-643.

Text Mining Studies of Software Repository Contents

569