User Story Tutor (UST) to Support Agile Software Developers

Giseldo da Silva Neo

1 a

, Jos

e Ant

ao Beltr

ao Moura

2 b

, Hyggo Almeida

2 c

Alana Viana Borges da Silva Neo

2 d

and Olival de Gusm

ao Freitas J

unior

3 e

Campus Vic¸osa, Federal Institute of Alagoas, Vic¸osa, Brazil

Center for Electrical and Computer Engineering, Federal University of Campina Grande, Campina Grande, Brazil

Institute of Computing, Federal University of Alagoas, Macei

o, Brazil

Keywords:

User Story, Story Points, Education, Recommendation, Readability.

Abstract:

User Stories record what must be built in projects that use agile practices. User Stories serve both to estimate

effort, generally measured in Story Points, and to plan what should be done in a Sprint. Therefore, it is essential

to train software engineers on how to create simple, easily readable, and comprehensive User Stories. For that

reason, we designed, implemented, applied, and evaluated a web application called User Story Tutor (UST).

UST checks the description of a given User Story for readability, and if needed, recommends appropriate

practices for improvement. UST also estimates a User Story effort in Story Points using Machine Learning

techniques. As such UST may support the continuing education of agile development teams when writing and

reviewing User Stories. UST’s ease of use was evaluated by 40 agile practitioners according to the Technology

Acceptance Model (TAM) and AttrakDiff. The TAM evaluation averages were good in almost all considered

variables. Application of the AttrakDiff evaluation framework produced similar good results. Apparently,

UST can be used with good reliability. Applying UST to assist in the construction of User Stories is a viable

technique that, at the very least, can be used by agile developments to complement and enhance current User

Story creation.

1 INTRODUCTION

Every year, a report that consolidates data from tech-

nology projects from different companies in several

countries is published on the internet by the Stan-

dish Group International Organization. The report is

called Chaos Report (StandishGroup, 2015). This re-

port indicates that less than 1/3 of the surveyed soft-

ware technology projects were completely successful.

Most of the projects did not reach completion within

the planned time and cost budget estimated.

Software development is a complex process that

involves many variables and is prone to several fail-

ures (Sommerville, 2011). Much of this failure is

related to the speciﬁcation of what should be done

and other factors. To minimize this problem some

methodologies and frameworks, like Agile methods,

https://orcid.org/0000-0001-5574-9260

https://orcid.org/0000-0002-6393-5722

https://orcid.org/0000-0002-2808-8169

https://orcid.org/0009-0000-1910-1598

https://orcid.org/0000-0003-4418-8386

can provide a conceptual structure for conducting this

software engineering projects.

Agile can be understood as a set of behaviors, pro-

cesses, practices, and tools used to create products

and subsequently make them available to end users

(Cohn, 2005). One of the best-known representatives

of Agile methods is SCRUM (Sutherland, 2014). It

focuses primarily on the aspect of what must be done.

In SCRUM requirements must be speciﬁed at an ad-

equate level of clarity, neither complex nor too rigid.

An important part of this method is writing, interpret-

ing, and implementing what is called a User Story.

A User Story is a short and simple sentence about

a feature (written from the perspective of the user who

wants it) and is used to deﬁne the scope of a software

project (Cohn, 2004). It is a requirements analysis

technique that captures the “who”, “what” and “why”

concisely and simply, usually limited in detail, so that

it can be written by hand on a small note card of paper.

These User Stories are generally stored in software

that manages all the project life-cycle (Jadhav et al.,

2023). By analyzing these raw data, recorded in these

tools, we can extract information for various software

Neo, G., Moura, J., Almeida, H., Neo, A. and Freitas Júnior, O.

User Story Tutor (UST) to Support Agile Software Developers.

DOI: 10.5220/0012619200003693

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024) - Volume 2, pages 51-62

ISBN: 978-989-758-697-2; ISSN: 2184-5026

engineering research (Tawosi et al., 2022a).

However, writing a good User Story can be dif-

ﬁcult. The User Story can be very shallow and not

present adequate detail to understand the expected ﬁ-

nal result, or, conversely, it can be too comprehensive.

For example: A stakeholder may confuse the level of

detail of a User Story and write the scope of an entire

module or system, which is not appropriate. Also, the

quality of User Stories can have a big impact when

the agile team makes estimates. The writing of good

user stories, which can be used to estimate effort, ap-

pears as one of the 5 most important agile problems

informed by 119 developers (Andrade, 2021).

Improving the creation of the User Story is crucial

for better planning and consequently the success of

the project. Therefore, the objective of this study is to

assist the User Story creation process, recommending

improvements, using natural language processing in

an intelligent learning environment, i.e., a pedagog-

ical agent that assists in the learning process, char-

acterizing itself as a tutor of content or more adapted

strategies. The hypothesis is that this environment can

help agile teams build better user stories. An expected

contribution is to help development teams that use ag-

ile practices to build better user stories.

The tool User Story Tutor (aka UST), proposed

in this article, receives as input a User Story text in

English and presents personalized recommendations

for improving it, with the support of a large language

model (LLM). LLMs are very large deep learning

models that are pre-trained on vast amounts of data.

The tool also presents a prediction in Story Points,

generated by a machine learning algorithm trained

with data from other projects. The user is also pre-

sented with the User Story’s readability indexes, as

they can be used to represent an indicator of text clar-

ity.

We use the Design Science Research search

methodology with 3 phases: problem identiﬁcation,

solution design, and evaluation to build the proposed

tool (Wieringa, 2014). For evaluation, a survey was

carried out with the support of the Technology Accep-

tance Model (TAM) framework (Davis et al., 1989)

and the AttrakDiff evaluation framework (Hassenzahl

et al., 2003). 40 agile practitioners who did not en-

gage in UST’s development, evaluated the proposal

solution and responded to the Survey. The TAM and

AttrakDiff evaluation results indicate that UST meets

the established objectives, with good acceptance from

participants.

The remainder of the paper is organized as fol-

lows. Section 2 summarizes the main concepts that

facilitate the understanding of subsequent sections.

Section 3 discusses the methodology and method-

ological artifacts adopted in UST’s R&D. Section 5

brings more technical details of UST. Section 4 ad-

dresses UST’s evaluation results. Section 6 is devoted

to related work. Section 7 explores threats to the va-

lidity of our investigation and its results. Lastly, Sec-

tion 8 offers the ﬁnal considerations.

2 BACKGROUND

2.1 User Stories

User Stories are a central piece in the development of

requirements for teams that use agile development. A

characteristic of agile methods is their focus on fast,

value-added deliveries in short periods, dealing with

changes as quickly as possible (Dyb

a and Dingsøyr,

2008). This approach has shown results and has been

used in project management (PMI, 2017), both in in-

dustry (Trimble et al., 2016; Rigby et al., 2018) and

in government (Mergel, 2016).

SCRUM is an agile method based on ﬁxed time

cycles called sprints, where teams work to achieve

well-deﬁned objectives; these objectives are repre-

sented in the Product Backlog, a list of things to do

that is constantly updated and re-prioritized (Suther-

land, 2014). Software requirements are usually stored

in User Stories. These artifacts describe the activities

that the development team will estimate and build and

are written in natural language.

Most teams that use User Stories, or even the agile

method, also use software tools to manage the project

and mainly to keep a record of their user stories (Jad-

hav et al., 2023). By analyzing the data recorded

by these tools we can extract information for vari-

ous software engineering research, including research

on how to improve these same User Stories (Jim

enez

et al., 2023).

GitLab is one of these management tools used

by agile teams to record User Stories (Choudhury

et al., 2020). It allows software engineers to automate

many actions during the development cycle, includ-

ing recording and changing User Stories (Dimitrijevi

et al., 2015). In GitLab, the User Story is registered

as an Issue, and for each User Story various pieces

of information are stored, such as the title, the task

description, and its estimate in Story Points.

This data stored in these management tools can

support decision-making in various software engi-

neering scenarios, such as: assigning User Stories

(Mani et al., 2019), improving the description of

User Stories (Chaparro et al., 2017), iteration plan-

ning (Choetkiertikul et al., 2018), sentiment analy-

sis of developers who write User Stories (Ortu et al.,

CSEDU 2024 - 16th International Conference on Computer Supported Education

2015; Ortu et al., 2016; Valdez et al., 2020), effort es-

timation of User Stories (Porru et al., 2016; Soares,

2018; Dragicevic et al., 2017; Choetkiertikul et al.,

2019; Tawosi et al., 2022b), and prioritization of User

Stories (Gavidia-Calderon et al., 2021; Huang et al.,

2021; Umer et al., 2020).

2.2 Recommendation Systems

Recommendation systems are software tools and

techniques that provide suggestions for items that are

more likely to arouse the interest of a particular user

(Ricci et al., 2010). Recommendations are important

for the learning process by allowing teachers and stu-

dents to ﬁnd content more appropriately, according to

their proﬁle and needs.

The evolution of Recommendation Systems is

moving towards a set of hybrid techniques, which

combine two or more different Recommendation

techniques, to resolve the limitations and obtain the

advantages of each of them (Bobadilla et al., 2013).

A hybrid Recommendation system is a term used

to describe any Recommendation system that com-

bines several Recommendation techniques to pro-

duce an output. Burke (2007) cites the following

types: Collaborative, Content-based, demographic,

and knowledge-based (Burke, 2007).

Recommendation systems have gained a lot of

popularity in the educational ﬁeld, generating various

types of recommendations for students, teachers, and

schools. They can reduce student information over-

load by recommending the “right” information at the

right time and in the right format of interest to the

student (Odilinye and Popowich, 2020).

More recently, LLM-type models (for example,

ChatGPT from OpenAI) can be also used for recom-

mendations. OpenAI is the company behind an LLM

called ChatGPT. They offer this large-scale multi-

modal model to be used for third parties in situations

where reliability is not critical (OpenAI, 2023).

Despite some problems with the LLM recommen-

dation approach: responses from an LLM are not

completely reliable, have a limited context window,

and do not learn (OpenAI, 2023). They can be useful

in speciﬁc domains. They are suitable in situations

where there are well-deﬁned intentions, for example,

opening a bank account, scheduling an air ticket, or

improving a User Story, given their controlled struc-

ture and predictable output. In this study, OpenAI will

be used as a synonym for OpenAI API, or even Chat-

GPT, and in the context of this paper, a recommen-

dation is a text to improve the User Story, which the

system suggests to the user.

2.3 Readability Indexes

There are some readability indexes commonly used

in the literature to predict the reading ease of the text.

They are used to determine the readability of an En-

glish passage and they are already used in fake news

and opinion spam detection. This section describes

the 4 most common of them.

Readability indexes can be interpreted as a numer-

ical indicator of how much easier it is for other peo-

ple to read writing text (DuBay, 2004). To extract this

numerical information some of the algorithms use the

count of words, characters, sentences, syllables, and a

list of complex words in their formula.

Readability indexes have been used by educators

since 1920. In 1980 there were already 200 known

calculation formulas (DuBay, 2004). They have al-

ready been criticized by researchers, who point out

their limitations (Koenke, 1971). However, empirical

experiments conﬁrmed the relationship between these

indexes and the readability of the text (Bogert, 1985).

Gunning’s Fog Index is the most frequently used

and studied index and has been extensively used to

analyze text (Bogert, 1985). It is a numerical number

assigned to a given text that uses words, sentences,

and a list of complex words in their formula. The

higher the value, the more complex the text. It was

created by Robert Gunner in 1954. The Greater the

percentage of complex words, the harder the text is to

read. The Higher the index, the lesser the readabil-

ity. His algorithm and method of calculation are well

documented (Gross and Sadowski, 1985). It can be

computed by adding the average sentence length and

the percent of complex words (words of three or more

syllables) and multiplying that sum by 0.4. Like in

the formula presented in Equation 1.

0.4 ·



words

sentences



+ 100 ·



words complex

words



(1)

Another index of text readability is Flesch Read-

ing Ease, according to (Textstat, 2023). The higher

the value, the more difﬁcult it is to read the text.

Its maximum value is 121.22. There is no mini-

mum value, negative scores are also valid. Equation 2

presents the calculation of the Flesch Reading Ease. It

is one of the oldest and most widely used tests and is

only dependent on two factors: The Greater the aver-

age sentence length, the harder the text is to read. The

greater the average number of syllables in a word, the

harder the text is to read. The higher the score, the

greater the readability.

User Story Tutor (UST) to Support Agile Software Developers

206.835 −1.015·



words

sentences



−84.6 ·



syllables

words



(2)

Coleman Liau Index is another complexity index

(Textstat, 2023), but this time using another Equa-

tion 3. Where L is the average number of letters per

100 words, and S is the average number of sentences

per 100 words.

CLI = 0.0588 · L − 0.296 · S − 15.8 (3)

Finally, the Automated Readability Index is calcu-

lated from the following Equation 4 (Textstat, 2023).

4.71 ·



characters

words



+ 0.5 ·



words

sentences



− 21.43

(4)

3 METHODOLOGY

The research methodology used in this research was

Design Science Research (Wieringa, 2014). We de-

signed a web application called User Story Tutor

(UST) that uses Natural Language Processing, Read-

ability Indexes and Machine Learning Prediction as a

proof of concept to improve User Story writing. We

used a survey, supported by a questionnaire in Google

Forms, to evaluate UST. The development was carried

out in the following stages:

• Literature review;

• Design of a predictive model that predicts the

number of Story Points from the User Story us-

ing Machine Learning;

• Deﬁnition of the basic text readability indexes that

can be extracted from User Stories;

• Design of a recommendation module via querying

the OpenAI API;

• Implementation all 3 modules as a web applica-

tion;

• Internal evaluation using many User Stories from

real projects;

• Evaluation with participants in a Survey;

• Qualitative and quantitative analysis of the Survey

results.

To evaluate UST, we carried out a survey based on

the Technology Acceptance Model (TAM) framework

(Davis et al., 1989) and AttrakDiff evaluation frame-

work (Hassenzahl et al., 2003). TAM is an informa-

tion systems theory that models how users accept and

use technology - please see Figure 1. For the TAM

statistical test, Cronbach’s alpha was used. The At-

trakDiff (Hassenzahl et al., 2003) test presents qual-

ity factors (hedonic and pragmatic) that can help to

better evaluate the proposal, complementing the TAM

framework. The Survey collected participants’ per-

ceptions and suggestions regarding UST, whose ob-

jective is to assist agile practitioners in building better

User Stories. The questions used in the survey are

presented in Section 5.

UST’s architecture was divided into 3 modules

(Recommender, Estimator, and Readability) and for

each module, certain procedures were selected and

executed. These selected general procedures will be

detailed in the following paragraphs.

The Recommender module is responsible for rec-

ommending improvements to User Stories by return-

ing text in natural language. The LLM model used

(made available by OpenAI) is already trained with a

large amount of text, extracted from the entire internet

(Torrent et al., 2023). However, customization of the

prompt is necessary to better personalize the return.

To customize the response from the OpenAI LLM

model, we send a prompt to conﬁgure the return ac-

cording to the needs of the recommendation system.

Following OpenAI’s guidelines for building effective

prompts, 3 main recommended techniques were used

in the prompt design.

1. Clarity in instruction

We seek a clear and precise prompt to not gen-

erate doubts when returning the recommendation.

The probability of a good return recommendation

depends on the objectivity of the hidden prompt

sent along with the recommendation.

2. Split complex tasks into simpler tasks

Intending to limit the task, we sent (a prompt) text

to limit the set of return possibilities, as complex

tasks generally have a higher error rate than sim-

pler task requests;

3. Test changes

Several interactions were necessary to create the

prompt used in search of improvements and fol-

lowing the good practices recommended by Ope-

nAI itself.

The Estimator module uses a supervised learning

algorithm, selected according to its best prediction ca-

pacity to predict the Story Points from the informed

User Story text. To elaborate the predictor module we

followed the techniques: data collection; data explo-

ration; data preparation; creation, training, and valida-

tion; adjustment of hyperparameters, and implemen-

tation of the model.

The User Story Estimator predictor was trained

with other User Stories from other agile projects. This

CSEDU 2024 - 16th International Conference on Computer Supported Education

Figure 1: TAM. Adapted from (Alomary and Woolard, 2015).

dataset was collected from real open-source projects

extracted from an open-source repository. More de-

tails of the dataset are presented in Section 4.1. The

metric used to select the best algorithm and for hy-

perparameters adjusting was the Mean Absolute Er-

ror with cross-validation. In the end, The predictor

model was trained with all the data and made avail-

able for the proposed application.

Finally, the module related to Readability presents

4 text readability indexes: Gunning Fog, Automated

readability index, Coleman Liau Index, and Flesch

Reading Ease (please refer to Section 2). But it is im-

portant to highlight that for this proposal, to facilitate

the interpretation of the readability index in general,

a variable called Final Result was also created, which

is the arithmetic mean of the 4 selected Indexes.

4 THE USER STORY TUTOR

This section presents an overview of the technologies

used to build the User Story Tutor (aka UST), its high-

level architecture, the dataset used, and its application

interface.

Our idea is that UST supports agile teams during

the construction of User Stories and assists the de-

velopment process during the User Story preparation

phase and in estimating task effort. UST consists ba-

sically of a web application that can be accessed us-

ing a browser on mobile devices, PCs, and notebooks.

UST uses an LLM provided by OpenIA to recom-

mend improvements and present readability indexes.

The main parts of the tool are discussed below.

The application was designed around 3 modules

with well-deﬁned functions. The Recommender mod-

ule responds to User Story recommender requests. It

is responsible for maintaining the prompt and com-

bining it with the text of the new User Story, query-

ing OpenAI via API, and preparing the return for

presentation to the user. The module that performs

User Story estimates in Story Points uses a predictive

model, already trained with historical data to assist

developers in their estimates, acting as a reference for

the team responsible for estimating effort. Readabil-

ity indexes are extracted from text with basic natural

language processing techniques. An image of the ar-

chitecture is presented in Figure 2.

For coding, we used StreamLit

- a library for

building open-source applications for machine learn-

ing and data science. Python

was used as the

language - a programming language that has been

increasing its market share, mainly in applications

that use machine learning. The Recommender Mod-

ule performs a query to OpenAI. The scikit-learn li-

braries were also used

. All source-code of the project

was available at Github

. The UST can be tested at

StreamLit Cloud

4.1 Dataset

We have made a new dataset (aka NEODATASET)

available together with UST. This dataset encom-

passes data from 34 software development projects,

with 40.014 User Stories taken from GitLab reposi-

tories, totaling 163.897 Story Points. It is made avail-

able on GitHub

so that the entire interested commu-

nity can contribute, similarly to what happens with

other datasets.

This dataset was mined during January 2023 and

April 2023. The mining process targeted GitLab’s

top open-source projects. The selected projects em-

ploy agile software development methodologies and

had the size of their tasks recorded in Story Points.

To mine information from GitLab, we created an ex-

traction tool implemented in Python that connects to

GitLab via API

https://streamlit.io

https://www.python.org/

https://scikit-learn.org

https://github.com/giseldo/userstory

https://userstoryteach.streamlit.app

https://github.com/giseldo/neodataset

User Story Tutor (UST) to Support Agile Software Developers

Figure 2: UST Architecture.

Only Tasks with the State attribute equal to Closed

and that have the weight attribute ﬁlled in were col-

lected. The weight ﬁeld is used in GitLab to record

the effort in Story Points. More information about the

projects included in the dataset is also available di-

rectly from GitLab.

The projects in the dataset have different charac-

teristics and cover different programming languages,

different business domains, and different geographic

locations of the team. The main entity of the dataset

is the User Story (or Issue), which contains the main

information. The dataset has more than 70 attributes

and is stored in JSON and CSV format, given the sim-

plicity of dealing with both formats.

The dataset presented here includes projects

which were not used by previous studies. There

are already previous studies that extracted data from

the Jira management tool to build predictive mod-

els (Tawosi et al., 2022a; Choetkiertikul et al., 2019;

Porru et al., 2016; Scott and Pfahl, 2018), but projects

extracted from GitLab are rarer.

Just as (Tawosi et al., 2022a) did, we are sharing

all the data collected. Therefore, the most common

thing is to share only the data from the dataset con-

sidered in the study itself, as done, for example in

(Choetkiertikul et al., 2019), and not all the data col-

lected.

The expected contribution is that this data set can

assist education and research on agile software devel-

opment. Although our dataset was initially designed

for Story Points and User Story estimation training

and research, it also includes information relevant to

other software engineering aspects. In addition to pro-

viding a possibility to reproduce ﬁndings from other

studies.

4.2 User Interface

The ﬁrst screen of the UST (Figure 3) is where a de-

veloper from the agile team informs their User Story.

Any User Story from any real project can be used.

Then, after the developer enters the description of the

User Story in text format, he clicks on the “Analyze”

button. The UST then initializes the necessary threads

that trigger the responses of the existing modules. The

language of the UST interface was English.

Figure 3: Home Screen.

The recommender module is a module that returns

suggestions from the developer Story description text

entered by the user (Figure 4). The model provided

by OpenAI was used, speciﬁcally the gpt-3.5-turbo.

The parameters used by the recommender module are

sent via a hidden prompt, and they are presented in

Table 1. They followed a reﬁnement process similar

to creating a search key in a systematic literature re-

view, being revised and adapted until we reached the

ﬁnal version presented with the support of the pro-

duced data set in Section 4.1.

For the Recommender module, a valid input from

the user would be this imagined User Story: “As a

UI designer, I want to redesign the Resources page,

so that it matches the new Broker design styles.” The

Recommendation text that UST would return to the

user is: “ To improve this user story, you can consider

CSEDU 2024 - 16th International Conference on Computer Supported Education

Table 1: OpenAI personalized prompt.

role content

system You are a scrum master, skilled in cre-

ating better User Stories for agile soft-

ware projects

user How can I improve this user story: +

User Story text informed by user in en-

glish

adding more details and making the objective clearer.

Here’s an improved version of the user story: As a

UI designer, I want to redesign the Resources page

to improve the user experience and align it with the

new Broker design styles, enhancing the overall con-

sistency and visual appeal of the application. Addi-

tionally, you can further reﬁne the user story by spec-

ifying the speciﬁc changes or improvements you plan

to make to the Resources page”.

Figure 4: Recommendation Module Example.

For the User Story readability module (Figure 5),

the readability indexes of the text are extracted using

the textdescriptive library (available in Python)

and

presented on the screen with the “metrics” component

from StreamLit in the ﬁrst tab. The purpose of the

readability module is to allow the creator of the User

Story to see some quantitative measure of how easy

the text of their User Story is to read.

Figure 5: Readability Indexes Module.

Finally, the Effort Estimation module (Figure 6)

performs an effort estimate in Story Points based on

the User Story description. The predictive model and

vectorizer used are loaded with the Joblib library. The

selected algorithm was SVM. The User Story text

is transformed into a bag-of-words using the TFIDF

https://pypi.org/project/textdescriptives

technique. In production, both the vectorizer and the

model are loaded into memory for prediction. After

the User Story text is transformed into a matrix with

the vectorizer, it is passed on to the model predictor,

which returns the estimate in Story Points. The loaded

Model was previously trained with data from NEO-

DATASET.

Figure 6: Estimation Module.

5 EVALUATION

This section presents a qualitative and quantitative

evaluation of the tool with the support of the TAM

framework and AttrakDiff and discusses the results.

The survey was conducted in December 2023 with

an online questionnaire in Google Forms. The ques-

tionnaire was ﬁrst examined for comprehensiveness,

quality and adequacy to the investigation at hand by a

panel of 6 experts who had amongst them 7 years of

experience in agile development. The questionnaire

used a 5 level Likert scale to gauge the respondent’s

agreement (from none or level 1), passing through lit-

tle (2), neutral (3), somewhat (4) to full (level 5) with

statements made concerning UST. The experts’ com-

ments and suggestions led to the adjustment of the

questionnaire which was then applied to a sample of

respondents.

5.1 Sample Characterization

Our sample of survey respondents is made up of 40

Brazilian participants. 70% of those who responded

to the survey had worked directly with agile method-

ologies. More than half of these (56%) had worked

in a software factory and had already worked as a

member of a software development team. Of these,

20% have been Scrum Masters and 10% have been

Product Owners. The other participants had generally

participated in academic activities related to software

engineering. On average, our sample was made up

of professionals with 3 years of experience in agile

methodology.

User Story Tutor (UST) to Support Agile Software Developers

5.2 TAM

Each of the 4 constructs of the Technology Assess-

ment Model is analyzed below: perception of usabil-

ity, perception of ease of use, external variables, and

attitude.

The perception of usability is the level at which

a person believes that using UST improves the per-

formance of their tasks. To analyze the perception

of usability, (Table 2) the mean, median, and standard

deviation of the Likert scale responses were analyzed.

If the Mean or Median is above the threshold which

we chose to be ”3” (neutral) in our experiments, this

possibly indicates that the participants have a positive

attitude towards the perception of using the tool (Dan-

tas et al., 2019).

Evaluating the responses, it is possible to infer that

participants generally have a positive attitude toward

the perceived usability of the tool (Table 2).

Table 2: Perception of usability.

Deﬁnition Mean Med SD

V1 Using the tool is useful

to improve my User Stories

4.45 5.00 0.80

V2 I learned how to build

better User Stories after us-

ing the tool

4.07 4.00 1.10

The perception of ease of use is the Level at which

the person presents their perception of the tool in

terms of ease of learning and operation. Table 3 de-

scribes the mean, median, and standard deviation of

the responses related to perceived ease of use. All av-

erages are above the adopted threshold, therefore also

regarding good perception and ease of use of UST. A

standard deviation above one indicates a high disper-

sion in responses.

Table 3: Perceived ease of use.

Deﬁnition Mean Med SD

V3 Learning to use the tool

was easy for me

3.70 4.00 1.28

V4 Searching for informa-

tion in this tool was simple

3.72 4.00 1.09

V5 Accessing the tool is

simple

4.07 4.00 1.14

An analysis of external variables, which provides

a better understanding of what inﬂuences perceived

utility and ease of use, is presented in Table 4. A

median above 4 is a good indicator that the external

characteristics were well accepted by users.

Table 4: External variables.

Deﬁnition Mean Med SD

V6 The application’s nav-

igation attributes - menu,

icons, links, and buttons -

are clear and easy to ﬁnd

4.02 4.00 1.17

V7 The tool has a good in-

terface

3.9 4.00 1.20

The data characterized as Attitude, which is the

Intention of the individual to use the tool, are pre-

sented in Table 5. In the same way, as with the other

constructs, we have a mean above the threshold.

Table 5: Attitude.

Deﬁnition Mean Med SD

V8 I believe it is better to

use the tool to help create the

user story than not to use it.

4.25 4.00 0.88

V9 I intend to use the tool to

create better user stories and

to plan my tasks better

3.85 4.00 1.15

For statistical conﬁrmation, Cronbach’s (Gliem

and Gliem, 2003) test was used for the Likert scale

questionnaire, the same technique used by (Dantas

et al., 2019). Cronbach’s Alpha is a way to mea-

sure the internal consistency of a questionnaire or sur-

vey. Cronbach’s Alpha ranges between 0 and 1, with

higher values indicating that the survey or question-

naire is more reliable. An interpretation of Cron-

bach’s alpha is presented in Table 6.

Table 6: Internal consistency from the Survey. Adapted

from (Zach, 2023).

Cronbach Alpha Internal consistency

0.9 ≤ α Excellent

0.8 ≤ α < 0.9 Good

0.7 ≤ α < 0.8 Acceptable

0.6 ≤ α < 0.7 Questionable

0.5 ≤ α < 0.6 Poor

α < 0.5 Unacceptable

A limit adopted in this research is the Cronbach al-

pha index greater than 0.7 for the variables analyzed

and with conﬁdence in the 95% range. From the re-

ported values, as shown in Table 7, we understand

that almost all constructs analyzed are above the es-

tablished limit. This leads to the conclusion that the

internal consistency of this survey is acceptable.

CSEDU 2024 - 16th International Conference on Computer Supported Education

Table 7: Cronbach of TAM constructs.

constructs Cronbach IC

Usability 0.81 [0.64 0.90]

Ease of use 0.92 [0.87 0.95]

External variables 0.87 [0.77 0.93]

Attitude 0.73 [0.49 0.85]

5.3 AttrakDiff

In Figure 7 we present the portfolio of results of

the AttrakDiff test (Hassenzahl et al., 2003). This

test presents factors that can help better evaluate the

proposal, complementing what the TAM framework

presents. The AttrakDiff short test type was used,

which presents 10 questions to users and infers the

metrics reported below.

In Figure 7 the vertical axis of the portfolio view

displays the hedonic quality (bottom = low extent).

The horizontal axis shows the pragmatic quality (left

= low extent). Depending on the dimensions values

the product will lie in one or more character regions.

The bigger the conﬁdence rectangle, the less sure one

can be about which region it belongs. A small con-

ﬁdence rectangle is an advantage because it means

that the investigation results are more reliable and less

coincidental. The conﬁdence rectangle shows if the

users are at one in their evaluation of the product. The

bigger the conﬁdence rectangle, the more variable the

evaluation ratings (Hassenzahl et al., 2003). So, the

answers point to a small trust rectangle in the upper

right quadrant, as a task-oriented tool.

In Figure 8 we present the diagram of average val-

ues. The average values of the AttrakDiff dimensions

for the evaluated product are plotted on the diagram.

In this presentation, hedonic quality distinguishes be-

tween the aspects of stimulation and identity. Further-

more, the rating of attractiveness is presented (Has-

senzahl et al., 2003).

In Figure 9 we present the description of word

pairs. The mean values of the word pairs are pre-

sented here. Of particular interest are the extreme

values. These show which characteristics are particu-

larly critical or particularly well resolved (Hassenzahl

et al., 2003). Better results are placed in the posi-

tive quadrant, which can be inferred from the consol-

idated results in Figure 9. Almost all items evaluated

were in the positive quadrant, except the pair (cheap-

premium).

Figure 7: Portfolio of results.

Figure 8: Diagram of average values.

6 RELATED WORK

Improving the quality of user stories is a line of re-

search that is gaining momentum due to advances in

artiﬁcial intelligence. Generally, the most classic ap-

proaches use the transformation to an intermediate

User Story Tutor (UST) to Support Agile Software Developers

Figure 9: Description of word-pairs.

model, such as a use case (Elallaoui et al., 2018),

or some other natural language processing technique,

with a presentation of reports that can be interpreted

(e.g. AQUSA) (Lucassen et al., 2016).

Using an intermediate model to represent the User

Story adds complexity to the use of the solution,

which can be seen as an unwanted feature. Our ap-

proach does not use an intermediate model. Our ap-

proach makes use of text readability indexes, a tech-

nique already widely used for text analysis in other

areas (e.g. economics, literature), and the use of Ope-

nAI’s LLM with personalized recommendations. In

addition, the estimator module uses machine learning

with natural language processing techniques.

The proposal USQA uses natural language pro-

cessing techniques to analyze usefulness, complete-

ness, and polysemes in the user stories creation

(Jim

enez et al., 2023). Our proposal brings additional

techniques, such as recommendation and readability

of the User Story that can help even more. Table 8

compares UST to the AQUSA and USQA proposals

and it illustrates UST’s contribution as compared to

that of existing related work to User Story writing.

Table 8: Comparison with related work.

Tool Intermediate

Model

Recommend

Report

UST No Yes

AQUSA No No

USQA No Yes

7 LIMITATIONS AND THREATS

There is some criticism in the literature regarding the

numerical interpretation of a Likert scale question-

naire (For example, in the calculation of the Likert

scale average or mean) (F

avero and Belﬁore, 2017).

To minimize this point, we use another framework for

analyzing software quality, the AttrakDiff.

The use of an LLM model made available by com-

panies via API (e.g. OpenAI’s ChatGPT) ties the

UST solution to a corporate company. In future work,

we intend to use and validate an open-source LLM

model.

Readability indexes must be interpreted with cau-

tion, as their formulae use only two variables: com-

plex words and long sentences. Therefore, they are

not able to measure the cohesion and coherence of a

business User Story, which covers semantic, syntac-

tic, and pragmatic factors.

Estimation in Story Points generally follows the

Fibonacci scale. In our proposal, the estimator returns

a real number between 0 and 100. This problem was

treated as a regression problem and not a classiﬁca-

tion one. However, we can obtain probably greater

interpretability if we use the Fibonacci scale instead

of real numbers.

8 CONCLUSIONS AND

ONGOING WORK

This paper presented a proposal and evaluation of a

tool for recommending good practices in writing User

Stories with LLM, in addition to a User Story estima-

tion module with Machine Learning and presentation

of readability indexes for the User Story description.

The proposed tool was evaluated by 40 software engi-

neering practitioners. The evaluation was conducted

with the TAM and AttrakDiff frameworks. Results in-

dicate that UST meets the established objectives, with

good acceptance from its intended users.

From this investigation, one may conclude that

a tool to assist the construction of User Stories is a

viable technique that, at the very least, can be used

to educate teams on writing better User Stories. In

fact, from the evaluation experiment, one may say that

UST could help the User Stories by providing feed-

back to the agile practitioner.

The paper also presented a dataset with data from

projects mined from GitLab that were used to train

the predictive model for Story Points. This dataset

can be used in other research related to agile software

development. Work on named entity recognition to

extract entities from the User Story text is ongoing.

Independent future work could entail additional vali-

dation experiments including integration and evalua-

tion of UST with computer-based education platforms

for agile software development methods.

CSEDU 2024 - 16th International Conference on Computer Supported Education

REFERENCES

Alomary, A. and Woolard, J. (2015). How is technology

accepted by users? a review of technology acceptance

models and theories. IRES 17th International Confer-

ence.

Andrade, A. F. M. (2021). Uma Abordagem Baseada Em

Gamiﬁcac¸

ao Para Estimativa De Esforc¸o Em Desen-

volvimento

Agil De Software. PhD thesis, Universi-

dade Federal de Campina Grande.

Bobadilla, J., Ortega, F., Hernando, A., and Guti

errez, A.

(2013). Recommender systems survey. Knowledge-

based systems, 46:109–132.

Bogert, J. (1985). In defense of the fog index. The Bul-

letin of the Association for Business Communication,

48(2):9–12.

Burke, R. (2007). Hybrid web recommender systems. The

adaptive web: methods and strategies of web person-

alization, pages 377–408.

Chaparro, O., Lu, J., Zampetti, F., Moreno, L., Di Penta, M.,

Marcus, A., Bavota, G., and Ng, V. (2017). Detecting

missing information in bug descriptions. Proceedings

of the ACM SIGSOFT Symposium on the Foundations

of Software Engineering, Part F130154:396–407.

Choetkiertikul, M., Dam, H. K., Tran, T., Ghose, A., and

Grundy, J. (2018). Predicting Delivery Capability in

Iterative Software Development. IEEE Transactions

on Software Engineering, 44(6):551–573.

Choetkiertikul, M., Dam, H. K., Tran, T., Pham, T., Ghose,

A., and Menzies, T. (2019). A Deep Learning Model

for Estimating Story Points. IEEE Transactions on

Software Engineering, 45(7):637–656.

Choudhury, P., Crowston, K., Dahlander, L., Minervini,

M. S., and Raghuram, S. (2020). GitLab: work where

you want, when you want. Journal of Organization

Design, 9(1).

Cohn, M. (2004). User stories applied: For agile software

development. Pearson Education.

Cohn, M. (2005). Agile Estimating and Planning. Pearson

Education.

Dantas, E., Costa, A. A. M., Vinicius, M., Perkusich, M. B.,

de Almeida, H. O., and Perkusich, A. (2019). An ef-

fort estimation support tool for agile software devel-

opment: An empirical evaluation. In SEKE, pages

82–116.

Davis, F. D., Bagozzi, R. P., and Warshaw, P. R. (1989).

User acceptance of computer technology: A compar-

ison of two theoretical models. Management science,

35(8):982–1003.

Dimitrijevi

c, S., Jovanovi

c, J., and Deved

c, V. (2015).

A comparative study of software tools for user story

management. Information and Software Technology,

57:352–368.

Dragicevic, S., Celar, S., and Turic, M. (2017). Bayesian

network model for task effort estimation in agile soft-

ware development. Journal of Systems and Software,

127:109–119.

DuBay, W. H. (2004). The principles of readability: A brief

introduction to readability research. Impact Informa-

tion, (949):1–72.

Dyb

a, T. and Dingsøyr, T. (2008). Empirical studies of agile

software development: A systematic review. Informa-

tion and Software Technology, 50(9-10):833–859.

Elallaoui, M., Naﬁl, K., and Touahni, R. (2018). Automatic

transformation of user stories into uml use case dia-

grams using nlp techniques. Procedia computer sci-

ence, 130:42–49.

avero, L. P. and Belﬁore, P. (2017). Manual de an

alise

de dados: estat

ıstica e modelagem multivariada com

Excel®, SPSS® e Stata®. Elsevier Brasil.

Gavidia-Calderon, C., Sarro, F., Harman, M., and Barr, E. T.

(2021). The Assessor’s Dilemma: Improving Bug Re-

pair via Empirical Game Theory. IEEE Transactions

on Software Engineering, 47(10):2143–2161.

Gliem, J. A. and Gliem, R. R. (2003). Calculating, inter-

preting, and reporting cronbach’s alpha reliability co-

efﬁcient for likert-type scales. Midwest Research-to-

Practice Conference in Adult, Continuing, and Com-

munity Education.

Gross, P. P. and Sadowski, K. (1985). FOGINDEX: A read-

ability formula program for microcomputers. Journal

of Reading, 28(7):614–618.

Hassenzahl, M., Burmester, M., and Koller, F.

(2003). Attrakdiff: Ein fragebogen zur messung

wahrgenommener hedonischer und pragmatischer

qualit

at. Mensch & Computer 2003: Interaktion in

Bewegung, pages 187–196.

Huang, Y., Wang, J., Wang, S., Liu, Z., Wang, D., and

Wang, Q. (2021). Characterizing and predicting good

ﬁrst issues. International Symposium on Empirical

Software Engineering and Measurement.

Jadhav, D., Kundale, J., Bhagwat, S., and Joshi, J. (2023).

A Systematic Review of the Tools and Techniques in

Distributed Agile Software Development. Agile Soft-

ware Development: Trends, Challenges and Applica-

tions, pages 161–186.

Jim

enez, S., Alanis, A., Beltr

an, C., Ju

arez-Ram

ırez, R.,

Ram

ırez-Noriega, A., and Tona, C. (2023). Usqa:

A user story quality analyzer prototype for support-

ing software engineering students. Computer Appli-

cations in Engineering Education.

Koenke, K. (1971). Another practical note on readability

formulas. Journal of Reading, 15(3):203–208.

Lucassen, G., Dalpiaz, F., van der Werf, J. M. E., and

Brinkkemper, S. (2016). Improving agile require-

ments: the quality user story framework and tool. Re-

quirements engineering, 21:383–403.

Mani, S., Sankaran, A., and Aralikatte, R. (2019). Deep-

triage: Exploring the effectiveness of deep learning

for bug triaging. ACM International Conference Pro-

ceeding Series, pages 171–179.

Mergel, I. (2016). Agile innovation management in govern-

ment: A research agenda. Government Information

Quarterly, 33(3):516–523.

Odilinye, L. and Popowich, F. (2020). Personalized rec-

ommender system using learners’ metacognitive read-

ing activities. In Methodologies and Intelligent Sys-

tems for Technology Enhanced Learning, 10th Inter-

national Conference, pages 195–205. Springer.

OpenAI (2023). GPT-4 Technical Report. 4:1–100.

User Story Tutor (UST) to Support Agile Software Developers

Ortu, M., Destefanis, G., Adams, B., Murgia, A., March-

esi, M., and Tonelli, R. (2015). The JIRA repository

dataset: Understanding social aspects of software de-

velopment. ACM International Conference Proceed-

ing Series, 2015-October.

Ortu, M., Murgia, A., Destefanis, G., Tourani, P., Tonelli,

R., Marchesi, M., and Adams, B. (2016). The emo-

tional side of software developers in JIRA. Proceed-

ings - 13th Working Conference on Mining Software

Repositories, MSR 2016, pages 480–483.

PMI (2017). Success Rates Rise - 2017 9th Global Project

Management Survey. Technical report, PMI.

Porru, S., Murgia, A., Demeyer, S., Marchesi, M., and

Tonelli, R. (2016). Estimating story points from issue

reports. ACM International Conference Proceeding

Series.

Ricci, F., Rokach, L., and Shapira, B. (2010). Introduction

to recommender systems handbook. In Recommender

systems handbook, pages 1–35. Springer.

Rigby, D. K., Sutherland, J., and Noble, A. (2018). Agile

Scale: How to go from teams to hundreds. Havard

Business Review, May-June(June):1–3.

Scott, E. and Pfahl, D. (2018). Using developers’ features to

estimate story points. ACM International Conference

Proceeding Series, (106):106–110.

Soares, R. G. F. R. R. G. F. R. R. G. (2018). Effort Esti-

mation via Text Classiﬁcation And Autoencoders. In

2018 International Joint Conference on Neural Net-

works (IJCNN), volume 2018-July, pages 1–8. IEEE.

Sommerville, I. (2011). Software engineering. 9th edition.

StandishGroup (2015). The chaos report.

http://www.standishgroup.com. Acessed 2023.

Sutherland, J. (2014). Scrum: the Art of Doing Twice the

Work in Half the Time. Random House.

Tawosi, V., Al-Subaihin, A., Moussa, R., and Sarro, F.

(2022a). A Versatile Dataset of Agile Open Source

Software Projects, volume 1. Association for Com-

puting Machinery.

Tawosi, V., Moussa, R., and Sarro, F. (2022b). Deep Learn-

ing for Agile Effort Estimation Have We Solved the

Problem Yet? pages 1–17.

Textstat (2023). textstat/textstat: python package to calcu-

late readability statistics of a text object - paragraphs,

sentences, articles. Acessed 2023.

Torrent, T. T., Hoffmann, T., Almeida, A. L., and Turner,

M. (2023). Copilots for linguists: Ai, constructions,

and frames. Elements in Construction Grammar.

Trimble, J., Shirley, M. H., and Hobart, S. G. (2016). Agile:

From software to mission system. 14th International

Conference on Space Operations, 2016, pages 1–8.

Umer, Q., Liu, H., and Illahi, I. (2020). CNN-Based Au-

tomatic Prioritization of Bug Reports. IEEE Transac-

tions on Reliability, 69(4):1341–1354.

Valdez, A., Oktaba, H., Gomez, H., and Vizcaino, A.

(2020). Sentiment analysis in jira software reposito-

ries. Proceedings - 2020 8th Edition of the Interna-

tional Conference in Software Engineering Research

and Innovation, CONISOFT 2020, pages 254–259.

Wieringa, R. J. (2014). Design science methodology

for information systems and software engineering.

Springer.

Zach, A. (2023). Statology.

https://www.statology.org/internal-consistency/.

Accessed on 01.01.2023.

CSEDU 2024 - 16th International Conference on Computer Supported Education