Automated Social Media Feedback Analysis for Software Requirements

Elicitation: A Case Study in the Streaming Industry

Melissa Silva

and Jo

ao Pascoal Faria

1,2

Faculty of Engineering, University of Porto, Porto, Portugal

INESC TEC, Porto, Portugal

Keywords:

Requirements Engineering, Social Media, Requirements Mining.

Abstract:

Requirements Engineering (RE) is crucial for product success but challenging for software with a broad user

base, such as streaming platforms. Developers must analyze vast user feedback, but manual methods are im-

practical due to volume and diversity. This research addresses these challenges by automating the collection,

ﬁltering, summarization, and clustering of user feedback from social media, suggesting feature requests and

bug ﬁxes through an interactive platform. Data from Reddit, Twitter, iTunes, and Google Play is gathered via

web crawlers and APIs and processed using a novel combination of natural language processing (NLP), ma-

chine learning (ML), large language models (LLMs), and incremental clustering. We evaluated our approach

with a partner company in the streaming industry, extracting 66,168 posts related to 10 streaming services and

identifying 22,847 as relevant with an ML classiﬁer (75.5% precision, 74.2% recall). From the top 100 posts,

a test user found 89 relevant and generated 47 issues in 80 minutes—a signiﬁcant reduction in effort compared

to a manual process. A usability study with six specialists yielded a SUS score of 83.33 (“Good”) and very

positive feedback. The platform reduces cognitive overload by prioritizing high-impact posts and suggesting

structured issue details, ensuring focus on insights while supporting scalability.

1 INTRODUCTION

Requirements Engineering (RE) is crucial in software

development, encompassing activities such as elicita-

tion, analysis, negotiation, documentation, and vali-

dation of requirements. A software product’s success

largely depends on this process’s effectiveness, as re-

quirements form the foundation of software quality.

According to (Li et al., 2018), approximately 60% of

errors in software development projects originate dur-

ing the RE phase, and rectifying these errors later is

costly.

Despite its importance, RE is often plagued by

challenges stemming from poorly deﬁned require-

ments and limited user involvement (Ali and Hong,

2019). User feedback is essential for success, align-

ing the RE process with end-users’ visions and needs.

However, involving end-users can be difﬁcult due to

their unavailability, uncertainty about their identities

(particularly in market-driven development), and the

high costs associated with traditional RE techniques,

which are particularly burdensome for small startups

(Ali and Hong, 2019).

Recent advancements in data-driven RE and

Crowd-based RE (CrowdRE) have highlighted the po-

tential of leveraging social media and other crowd-

sourced data to address these challenges. Studies have

explored using natural language processing (NLP)

and machine learning (ML) to classify and analyze

feedback, laying a foundation for integrating social

media insights into RE workﬂows (Maalej et al.,

2016b); (Groen et al., 2017). However, many of these

studies remain theoretical, with limited practical ap-

plications or interaction with industry partners. There

is a lack of end-to-end approaches applied in indus-

trial environments.

This research extends these foundations by intro-

ducing an interactive platform that bridges the gap

between feedback extraction and actionable require-

ments. Unlike prior works, our platform not only col-

lects and processes social media feedback but also in-

tegrates directly with tools like GitLab, automating

issue creation and pre-ﬁlling all relevant ﬁelds. Users

can easily reﬁne these ﬁelds, streamlining the transi-

tion from feedback to implementation.

In this case study, we applied the platform to

gather feedback on various streaming platforms, pro-

viding our industrial partner with a clear picture of

Silva, M. and Faria, J. P.

Automated Social Media Feedback Analysis for Software Requirements Elicitation: A Case Study in the Streaming Industry.

DOI: 10.5220/0013364300003928

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2025), pages 149-160

ISBN: 978-989-758-742-9; ISSN: 2184-4895

149

user expectations to improve their Over-The-Top TV

platform. By analyzing social media feedback, the

platform helped the company understand which fea-

tures users are requesting and the most common is-

sues they encounter, enabling them to make informed

decisions to remain competitive in the streaming in-

dustry.

Contributions to the state of the art include an

innovative end-to-end workﬂow and interactive plat-

form for processing social media feedback, validated

by industry professionals. This workﬂow uses ad-

vanced techniques such as Large Language Models

(LLMs) and introduces new features like automated

issue creation and requirement suggestions.

In addition to the introduction, this document is

structured as follows: Chapter 2 discusses the state

of the art regarding social media-based requirements

gathering. Chapter 3 outlines the development of the

feedback processing workﬂow. Chapter 4 explains

the creation of the visualization platform. Chapter

5 presents the experiments to evaluate the platform.

Chapter 6 discusses limitations, and Chapter 7 sum-

marizes the ﬁndings. For further details, please con-

sult (Silva, 2024).

2 STATE OF THE ART

This study seeks to address the lack of tools that

not only extract and classify requirements from social

media but also allow stakeholders to actively interact

with and act upon these requirements. Our goal is to

bridge the gap between academic research and practi-

cal application by creating an interactive platform that

integrates automation and actionable workﬂows.

To develop this platform, we ﬁrst needed to un-

derstand the current use of social media for RE. This

exploration was essential to identify gaps and guide

the design of our approach.

We started by asking:

RQ1: What works exist to extract require-

ments from social networks?

Several studies have explored mining user feed-

back from social media platforms, such as app stores,

Twitter, Reddit, and Facebook (Iqbal et al., 2021);

(Nayebi et al., 2018); (Oehri and Guzman, 2020);

(Williams and Mahmoud, 2017). These studies high-

light the potential of social media feedback to provide

profound insights into user needs, including bug re-

ports and feature requests (Iqbal et al., 2021). Most

studies collect feedback from the product’s page (in

app stores), posts on the product’s forum (like Red-

dit), or posts where users directly address the prod-

uct’s accounts (such as on Twitter). However, ap-

proaches that rely on manual intervention lack scal-

ability or fail to capture diverse sources effectively,

highlighting the need for automatic extraction, which

led us to the following question:

RQ2: What works exist about automatically

classifying candidate requirements extracted

from social networks?

Initial efforts in this ﬁeld were manual (Kanchev

and Chopra, 2015); (Scanlan et al., 2022), but the vol-

ume of social media data required the development of

automated or semi-automated mechanisms. Common

workﬂows include data extraction (via scrapers or

APIs), pre-processing, and classiﬁcation using algo-

rithms such as Decision Trees (DT), Random Forest

(RF), Support Vector Machines (SVM), Naive Bayes

(NB), and Multinomial Naive Bayes (MNB). These

algorithms categorize feedback into relevant and ir-

relevant categories. The relevant feedback is divided

into three typical groups: feature requests, bug re-

ports, and other (Guzman et al., 2017); (Williams and

Mahmoud, 2017).

Some studies introduced speciﬁc preprocessing

steps to enhance performance, such as stemming,

lemmatization, verb tense detection, sentiment analy-

sis, and handling data imbalance. Others added post-

classiﬁcation techniques like clustering, topic model-

ing and summarization to improve the organization

and interpretation of classiﬁed data (Guzman et al.,

2017); (Williams and Mahmoud, 2017); (Iqbal et al.,

2021); (Oehri and Guzman, 2020); (Ebrahimi and

Barforoush, 2019); (McIlroy et al., 2016); (Panichella

et al., 2016); (Maalej et al., 2016a); (Di Sorbo et al.,

2017); (Villarroel et al., 2016).

However, many approaches stop at classiﬁcation,

failing to integrate workﬂows that connect extracted

feedback to actionable tasks in real-world software

development. This leads to the need for interactive

interfaces, as addressed in our third research question:

RQ3: What works exist about the visualiza-

tion of candidate requirements extracted from

social networks?

While many tools focus on classifying and prior-

itizing feedback, fewer address interactive visualiza-

tion. We identiﬁed ﬁve key works in this area:

• ARdoc: Offers a simple GUI for importing

and classifying reviews into categories like “Fea-

ture Request”, “Problem Discovery”, “Informa-

tion Seeking” and “Information Giving” with

category-speciﬁc color highlights. It lacks au-

tomated feedback gathering, ﬁltering and navi-

gation, among other features (Panichella et al.,

2016);

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

150

• OpenReq Analytics: Displays feedback trends

via heat maps, sentiment analysis, and keyword

searches, organized into problem reports and in-

quiries. ML models classify feedback, allow-

ing reﬁnement through user interaction. While

comprehensive, it lacks dynamic feedback navi-

gation and integration with social media platforms

(Stanik and Maalej, 2019);

• CLAP: Automatically clusters and prioritizes

user feedback into “Bug Report”, “Feature Re-

quest” and “Other”. However, reviews must be

manually imported, and the tool does not include

search or advanced ﬁltering options (Villarroel

et al., 2016);

• SURF: A command-line tool for summarizing

user feedback stored in XML ﬁles, categorized by

topic. It generates bar charts for intent distribu-

tion, but lacks built-in visualization or navigation

capabilities, requiring external tools to display re-

sults (Di Sorbo et al., 2017);

• The Automatic Classiﬁcation Tool: Retrieves

app reviews automatically and visualizes trends

via bar and pie charts but it is limited to app stores

(Maalej et al., 2016a).

While these tools provide valuable functionality,

they often lack dynamic interaction, task integration,

or validation in industrial contexts.

The limitations identiﬁed in RQ1, RQ2, and RQ3

reveal a critical need for tools that integrate extrac-

tion, classiﬁcation, and interactive visualization into

actionable workﬂows validated in industrial settings.

Building on these insights, our work addresses three

key gaps:

• A lack of practical, end-to-end workﬂows that

transition seamlessly from data extraction to ac-

tionable tasks;

• Limited interactive features to allow stakeholders

to explore and reﬁne extracted requirements;

• A disconnect between academic tools and their

application in industrial settings.

To address these gaps, we developed and validated

an interactive platform that integrates requirement ex-

traction, classiﬁcation, interaction, and GitLab issue

creation. By bridging these gaps, we aim to enable

stakeholders to transition smoothly from raw feed-

back to actionable tasks, improving decision-making

and fostering collaboration.

3 SOCIAL MEDIA FEEDBACK

PROCESSING WORKFLOW

Our goal is to automatically collect user feedback

from diverse social media platforms related to a do-

main or product of interest (in this case, streaming ser-

vices of interest to our industrial partner), and present

processed user feedback to stakeholders in a way that

facilitates the derivation of actionable issues (new fea-

tures, bug ﬁxes, etc.) for product evolution.

As such, our solution comprises two main compo-

nents: (i) a social media feedback processing work-

ﬂow implemented by modular Python scripts (de-

scribed in this Section) and (ii) a Web-based user in-

terface for generating actionable issues from the pro-

cessed feedback (described in Section 4).

Figure 1 depicts the system and user workﬂows

in our approach, distinguishing manual or interac-

tive user tasks (pink), system tasks implemented by

Python scripts (blue), and system tasks that rely on

OpenAI’s GPT-3.5 Turbo model (green). The section

of the paper where each task is described is indicated

in parentheses.

Figure 1: Bpmn Diagram of the Processing Workﬂows.

Automated Social Media Feedback Analysis for Software Requirements Elicitation: A Case Study in the Streaming Industry

151

We distinguish three sub-processes, which may

occur with different frequencies:

• Social Media Processing Workﬂow: comprises

the steps performed automatically by Python

scripts to extract posts from selected social me-

dia platforms based on predeﬁned keywords and a

user-deﬁned timeframe, clean the extracted data,

classify the extracted posts regarding their rele-

vance and type, and summarize and cluster the

posts that were classiﬁed as potentially relevant;

• Model Training and Evaluation: comprises the

steps conducted to generate classiﬁcation mod-

els using supervised machine learning techniques;

model training can be repeated as new labelled

data is generated in the next sub-process;

• From Processed Feedback to Actionable Items:

comprises the steps performed by the user, with

system support, to generate actionable issues in

an issue tracking system (currently GitLab) from

the processed feedback.

3.1 Data Extraction

To capture a broad and representative sample of user

feedback, we selected four key sources: iTunes,

Google Play, Twitter, and Reddit. While existing lit-

erature typically relies on at most two sources, our ap-

proach reﬂects a strategic effort to combine structured

app store reviews, real-time social media interactions,

and in-depth forum discussions.

The selection of these platforms was guided by

three main criteria:

• Textual Feedback Availability: We prioritized

platforms with substantial user-generated textual

content, ensuring suitability for NLP analysis;

• Public Accessibility: Only platforms with pub-

licly available data, accessible via APIs or

custom-built scrapers, were included;

• Relevance to Streaming Services: The chosen

platforms provide rich datasets related to user ex-

periences, feature requests, and issue reporting.

To further ensure the relevance and diversity of the

collected feedback, we analyzed ten streaming ser-

vices, selected based on the following considerations:

• Service Relevance: Four services were chosen

due to their direct association with our industrial

partner, aligning with their strategic interests;

• Market Popularity: The remaining six services

were selected based on their global user base,

subscription numbers, social media activity, and

market presence at the time of the study. This

approach ensured a balanced representation of

widely used and emerging platforms.

Our data collection was limited to posts from Jan-

uary 2023 to March 2024, to focus on recent insights

and maintain a manageable yet representative dataset.

Feedback from app stores was collected directly from

the apps’ pages using APIs and custom scrapers.

On Twitter, we performed targeted searches com-

bining the streaming service’s name with keywords

such as “feature,” “option,” “ﬁx,” “issue,” and “bug”

to extract posts speciﬁcally discussing features or

problems. A similar approach was applied to Red-

dit, where we analyzed posts from relevant subreddits

identiﬁed through prior research. These subreddits in-

cluded:

• Ofﬁcial Subreddits dedicated to the streaming

platforms, where available;

• Popular Alternatives with high user engagement

for platforms lacking ofﬁcial subreddits;

• Subreddits focused on devices frequently used to

access streaming services (e.g., smart TVs and

streaming devices) to capture device-related user

experiences.

In total, 73,214 posts were extracted across the

four platforms. These included app store reviews and

social media posts, with replies on Twitter and Red-

dit counting toward the total. Table 1 summarizes

the number of posts extracted per streaming service

and social media platform. The last two rows are dis-

cussed in the next subsections.

Table 1: Statistics of Extracted, Cleaned and Relevant Posts.

Streaming Social Media Platform (i) Total

Service iTunes Google Twitter Reddit

Play

ACL Player NA NA 69 428 497

Gas Digital Network 52 NA 64 99 215

Ita

u Cultural Play 5 3000 300 4 3309

RTP Play 15 199 1029 6157 7400

Netﬂix 1013 3000 2222 6636 12871

(HBO) Max 302 3000 1838 5432 10572

Hulu 1004 3000 1845 3032 8881

Prime Video

1008 3000 1492 4440 9940

Peacock 1008 3000 1788 3343 9139

Disney Plus 1008 3000 1109 5251 10368

Total Extracted 5415 21199 11756 34822 73214

Cleaned Total 5413 14230 11706 34819 66168

Classiﬁed Relevant (N

) 2591 8623 2888 8745 22847

3.2 Preprocessing

We standardized the data formats and ensured unifor-

mity across datasets using the following steps:

1. Date Formatting: Uniﬁed all date formats to

“month day year time” (e.g., “May 01 2023

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

152

04:43:01”);

2. Column Standardization: Harmonized column

names across datasets and removed irrelevant

columns;

3. Data Concatenation: Merged titles and feedback

text into a single string for iTunes and Reddit

datasets, following a similar approach as (Oehri

and Guzman, 2020).

Further data cleaning involved:

1. Null and Duplicate Removal: Eliminated null

content, duplicate entries, and bot-generated con-

tent. The latter was primarily identiﬁed through

explicit self-declarations (e.g., posts containing

the phrase “I am a bot”);

2. Keyword and Content Filtering: Removed ex-

tra whitespaces, unnecessary keywords, mark-

down tags, and content from ofﬁcial accounts;

3. Connect Tweets with Their Replies: Linked

tweets with their replies to facilitate conversation

analysis.

To standardize text and prepare it for classiﬁcation

models, we performed the following text processing

steps:

1. Text Normalization: Converted text to lowercase

and removed special characters;

2. URL Replacement: Replaced URLS with the

marker link , as recommended by (Guzman et al.,

2017);

3. Word and Abbreviation Expansion: Expanded

contractions (e.g., “couldn’t” to “could not”) us-

ing a list of common English contractions

(McIl-

roy et al., 2016). Futhermore, replaced common

abbreviations (e.g., “op” to “original poster”) to

improve readability and understanding;

4. Verb Tense Identiﬁcation: Following (Tizard

et al., 2019) and (Maalej et al., 2016a), we ana-

lyzed verb tense to understand the temporal nature

of user feedback. Past tense often indicates re-

ported experiences (e.g., bugs), while future tense

frequently highlights feature requests or hypothet-

ical scenarios. We used the Python library nltk

to identify verb tenses by tokenizing sentences

and applying a POS tagger. For composite future

tenses and irregular verbs, we implemented man-

ual rules, and for non-English posts, we translated

them into English using googletrans. Out of

66,168 posts, 5,795 were non-English and trans-

lated for analysis. Additionally, for posts with

https://www.yourdictionary.com/articles/

contractions-correctly

multiple sentences, we calculated the ratio of past,

present, and future tenses to gain insights into

temporal patterns of user feedback;

5. Sentiment Analysis: We employed the

PySentiStr library with the SentiStrength

algorithm to perform sentiment analysis, discern-

ing polarity (positive, negative, neutral) in user

feedback. Sentiment polarity provides insights

into user satisfaction, with negative sentiment

often highlighting bugs or complaints, and

positive sentiment indicating praise or successful

features. Following (Villarroel et al., 2016), we

averaged sentiment scores for each author’s posts

and systematically removed negations to improve

accuracy;

6. Typo Correction: To address typos in social me-

dia posts, we used the SpellChecker library, ver-

ifying posts are in English before applying correc-

tions. While this improves text accuracy, it is not

ﬂawless due to context-dependent errors, such as

confusing “loose” with “lose.”

After these processing steps, we obtained reﬁned

datasets that were ready for further analysis, totaling

66,168 posts (second row from the bottom in Table 1).

3.3 Classiﬁcation

To prepare the data for supervised classiﬁcation, we

manually labeled a statistical sample of 1% of the

dataset (picking a random sample of 1% of the posts

extracted from each social media platform). Posts

were marked as relevant or irrelevant based on their

potential utility to the industrial partner. Posts were

considered relevant if they discussed features, bugs,

or user experiences and excluded if they solely fo-

cused on content (e.g., speciﬁc shows).

We trained and tested classiﬁcation models using

supervised learning to predict the relevance of posts,

with the target variable being “relevant.” We used var-

ious algorithms, including DT, RF, SVM, NB, and

MNB, alongside preprocessing techniques like stop-

word removal, stemming, and lemmatization. We se-

lected widely used algorithms based on their preva-

lence in the literature and proven effectiveness.

For tokenization, stemming, and lemmatization,

we used the nltk library. Feature engineering in-

volved TﬁdfVectorizer, CountVectorizer (CV), post

length, and word/sentence embeddings (Word2Vec,

FastText, SBERT, USE). We tested models un-

der different data balance scenarios using SMOTE,

GridSearchCV, and RandomSearch for hyperparam-

eter tuning and evaluated performance using metrics

like precision (P), recall (R), F1-measure (F), accu-

racy (A), and area under the curve (AUC).

Automated Social Media Feedback Analysis for Software Requirements Elicitation: A Case Study in the Streaming Industry

153

The original scraped data, detailed variations, and

corresponding results can be found in GitHub

We selected the best model for each social media

platform by comprehensively analyzing the values of

F1-measure, accuracy, and AUC. Based on these met-

rics, we identiﬁed the combination of the model and

technique that produced the best results. It is essential

to acknowledge that our results might be subject to

biases, such as overﬁtting, and that real-world perfor-

mance could be lower than the training results due to

these factors. Table 2 showcases the algorithms (mod-

els and techniques) chosen to predict the relevance of

posts on each social media platform.

Table 2: Best Supervised Learning Algorithm per Platform.

Platform (i) Algorithm P

AUC

iTunes SVM+CV+SMOTE .80 1.00 .89 .91 .96

Google Play RF+CV+SMOTE .90 .90 .90 .86 .90

Reddit NB+USE .65 .57 .60 .76 .81

Twitter NB+CV+SMOTE .60 .67 .63 .78 .81

The last row in Table 1 summarizes the number

of posts classiﬁed as (potentially) relevant from the

cleaned dataset, totalling 22,847 posts.

Since we use distinct models for distinct datasets,

the overall precision (P) and recall (R) for the union

of the datasets can be calculated based on the number

of retrieved instances from each dataset (N

) and the

precision (P

) and recall (R

) of each model, as fol-

lows:

P =

∑

· P

∑

R =

∑

· P

∑

· P

Using the precision (P

) and recall (R

) metrics

from Table 2 and the number of retrieved instances

) from Table 1, we estimate the overall precision

and recall of our classiﬁcation models for the com-

bined dataset as 75.5% and 74.2%, respectively.

3.4 Summarization

Summarization condenses individual posts to high-

light their key points, simplifying the visualization of

user feedback. This enables stakeholders to quickly

identify critical information, such as bugs, feature re-

quests, or user experiences, without reading lengthy

texts.

To achieve this, we adopted an LLM-based ap-

proach using GPT-3.5 Turbo via the OpenAI API

The summarization was performed using the follow-

ing prompt, which was reﬁned over several iterations

(where text represents the input to be summarized):

https://tinyurl.com/yrhd4k68

https://platform.openai.com/

prompt = f"Concisely summarize this feedback,

capturing its main idea: ’{text}’. Maintain the

original language style; avoid third-person.

Keep it under 20 words."

GPT-3.5 Turbo was chosen because it demon-

strated superior performance compared to tradi-

tional summarization methods like TextRank, T5,

and LexRank, and provided a better cost-beneﬁt

ratio compared to other LLMs. Table 3 shows

that it achieved signiﬁcantly higher ROUGE (Recall-

Oriented Understudy for Gisting Evaluation)

scores

for precision and recall across all metrics.

Table 3: Evaluation of Summarization Algorithms.

ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum

Algorithm P R P R P R P R

Spacy .24 .40 .14 .22 .21 .35 .21 .35

Pegasus .26 .34 .13 .18 .24 .30 .24 .30

XLNet .37 .37 .20 .21 .35 .34 .35 .34

GPT2 .37 .37 .20 .21 .35 .34 .35 .34

Txtai .40 .42 .23 .23 .36 .37 .36 .37

SumBasic .32 .59 .16 .32 .27 .51 .27 .51

TextRank .31 .66 .18 .38 .27 .57 .27 .57

LexRank .34 .67 .20 .39 .30 .58 .30 .58

BART .37 .59 .22 .35 .33 .53 .33 .53

T5 .39 .58 .24 .35 .35 .51 .35 .51

GPT-3.5 Turbo .81 .84 .76 .80 .80 .83 .80 .83

3.5 Clustering

Given the large number of user feedback posts, clus-

tering is important to organize user feedback into the-

matic groups, revealing critical trends and recurring

issues. This helps stakeholders identify high-priority

topics, such as frequently reported bugs or common

feature requests. By grouping similar posts, cluster-

ing improves data visualization and facilitates more

efﬁcient analysis of large datasets.

We initially experimented with DBSCAN and K-

Means clustering methods, leveraging Principal Com-

ponent Analysis, but these approaches did not pro-

duce optimal results. Consequently, we developed

two custom algorithms inspired by ROUGE, priori-

tizing sentence similarity to group posts effectively.

The ﬁrst algorithm assigns each post to the most

semantically similar cluster using cosine similarity on

sentence embeddings. A post is assigned to the clus-

ter with the highest similarity, provided it exceeds a

threshold of 0.7—determined experimentally by test-

ing a selection of posts from the dataset to ensure co-

herent initial clusters reﬂecting underlying themes.

The second algorithm iteratively reﬁnes these

clusters by re-evaluating post assignments. If a post

has a higher similarity to another cluster (above a

threshold of 0.5, also determined experimentally), it is

https://huggingface.co/spaces/evaluate-metric/rouge

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

154

reassigned. This process, involving threshold-driven

reassignment and ﬁnite iteration, maintains stability

and convergence, ultimately enhancing the overall

clustering quality.

We used GPT-3.5 Turbo to generate titles for each

cluster, in order to improve visualization. The prompt

crafted was:

prompt = f"Suggest a suitable title for this

cluster of user complaints or feedback:{text}"

With this approach, the 22,847 posts classiﬁed as

relevant in our dataset were grouped into 1,184 clus-

ters, with an average of 19.3 posts per cluster. This

represents a signiﬁcant reduction in the number of

top-level items that require user attention. However,

as this number remains substantial, the ﬁltering and

sorting features of our interactive platform, detailed

in the next section, are crucial in helping users focus

on the most relevant content.

4 FROM PROCESSED

FEEDBACK TO ACTIONABLE

ISSUES

To enable users to visualize feedback data generated

by the processing workﬂow described in the previ-

ous section and transform it into actionable issues, we

developed a web-based application comprising three

main components: a PostgreSQL database for storing

processed data, a Vue.js frontend for user interaction,

and a FastAPI backend to facilitate communication

between the database and the client.

4.1 Dashboard Page

The application landing page features a Dashboard

for navigating feedback collections at various pro-

cessing stages (Figure 2):

• Suggested Feedback: Contains all new and unre-

viewed feedback. Users can evaluate these posts

for relevance and either move them to Accepted

Feedback or Deleted Feedback. Filtering and

sorting features help prioritise contents to review.

• Accepted Feedback: Includes posts reviewed

and deemed relevant. Users can create issues from

these posts or reclassify them as irrelevant, mov-

ing them to Deleted Feedback.

• Deleted Feedback: Contains posts marked as ir-

relevant, which users can still reclassify as rele-

vant and move to Accepted Feedback.

• Exported Feedback: An archive of all posts that

have been converted into issues, including links to

their corresponding GitLab issues.

Figure 2: Dashboard Page.

4.2 Accepting or Deleting Suggested

Feedback

The Suggested Feedback page presents unseen feed-

back in two views: View All, which lists all posts (Fig-

ure 3), and Clustered View, which groups posts into

clusters (Figure 4). Each post is displayed in a card

format, containing a summary and details such as cat-

egory, streaming service, social media source, rating,

likes, reposts, sentiment, and comments.

Figure 3: Suggested Feedback Page (Normal View).

Figure 4: Suggested Feedback Page (Clustered View).

Users can classify posts as relevant or irrelevant

using the green check and red cross buttons on each

card. The user is prompted to conﬁrm the category of

posts marked as relevant (“Bug Report,” “Feature Re-

quest” or “User Experience”), a feature inspired by

(Villarroel et al., 2016), and indicate the reason for

marking posts as irrelevant (“Already Solved,” “Re-

peated” or “Out of Scope”), to improve the system’s

classiﬁcation accuracy. Posts are then moved to Ac-

Automated Social Media Feedback Analysis for Software Requirements Elicitation: A Case Study in the Streaming Industry

155

Watch the Watchers: Create Issues from

Accepted Posts

Figure 5: Example of Issue Creation.

cepted or Deleted Feedback based on the user input.

To enhance navigation, users can search, ﬁlter, and

sort posts. The search bar allows keyword searches

within post text, while ﬁlters enable reﬁnement by

date range, category, sentiment, social media, and

streaming service. Sorting options include date, sen-

timent, rating, likes, comments, reposts, and a default

criteria prioritizing negative sentiment.

4.3 Generating Issues from Accepted

Feedback

The Accepted Feedback page offers two visualiza-

tion options, akin to the Suggested Feedback Page,

but a different set of actions.

To generate one or more issues (new features, bug

ﬁxes, etc.) based on a single post or a group of related

posts, the user has to perform the following steps (Fig-

ure 5):

1. Select complete clusters or individual posts as the

source for the new issue(s), using the check mark

on each post or cluster;

2. Press the “Create Issue” button and choose the ex-

portation method – via an issue-tracking system

API (currently GitLab) or as a CSV ﬁle;

3. Select one or more candidate requirements from

a list suggested by the platform with the help of

GPT-3.5 Turbo (based on the selected posts);

4. For each requirement selected in the previous

step, review the suggested issue description (title,

description, acceptance criteria, etc.) generated

by the system with the help of GPT-3.5 Turbo and

conﬁrm exportation.

This way, users can combine multiple options into

a single issue or create multiple issues from one or

more posts, allowing ﬂexibility in issue management

The system-generated form for issue creation con-

tains ﬁelds for Title and Description, and ad-

ditional ﬁelds like Private Token, Project Name,

Labels, and Due Date when connecting to GitLab.

The application suggests content for the ﬁrst two

ﬁelds, using Markdown templates

tailored for differ-

ent post categories (bug, feature, or user experience).

Bug templates explain the bug and the expected be-

havior, while feature templates include user stories

and tasks. Furthermore, the full text of the selected

post(s) and general description and acceptance crite-

ria generated by the GPT-3.5 Turbo prompt are in-

cluded in both templates to provide additional context

for developers.

In summary, our platform improves upon existing

solutions by analyzing feedback from multiple social

media platforms, offering advanced ﬁltering and sort-

ing, and enabling issue generation and export to an

issue tracking system (GitLab).

5 EVALUATION

Since the performance of the classiﬁcation and sum-

marization algorithms used in our approach was al-

ready presented in Section 3, this section focuses only

on the user studies carried out to evaluate our interac-

tive platform (described in Section 4) using two com-

plementary methods:

• User Feedback Evaluation: Conducted through

a structured usability and relevance questionnaire,

along with qualitative discussions, involving 13

employees of the industrial partner;

• Performance in Use Evaluation: Monitored

platform usage by an employee from the industrial

partner specialized in requirement engineering to

assess relevance identiﬁcation and issue creation

processes in different scenarios.

5.1 User Feedback

5.1.1 Design

This evaluation aimed to capture diverse perspectives

on usability and relevance, employing:

• Demographics Survey: Collected participants’

roles, tenure, and interaction frequency with user

feedback to contextualize their responses;

• Usability Questionnaire: Participants rated 10

statements of the standard System Usability Scale

(SUS) questionnaire (Brooke, 1996) on a 5-point

Likert scale (“Strongly Disagree” to “Strongly

Agree”);

https://tinyurl.com/md-templates

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

156

Figure 6: Platform Usability Survey Results.

• Relevance Questionnaire: Used a similar 5-

point Likert scale to measure participants’ percep-

tions of platform relevance and features.

5.1.2 Execution

Thirteen participants were selected by the industrial

partner based on their interaction with user feedback

and experience in roles related to RE or customer in-

teraction, including:

• Four participants with managerial roles: Chief

Marketing Ofﬁcer (1 year experience), VP of Dig-

ital Strategy (over 2 years experience), Sales Man-

ager (2 years experience), and Sales Director (2

years experience);

• Nine participants with technical roles: ﬁve Soft-

ware Developers (varied tenure), two Communi-

cation Specialists (1 and 2 years experience), and

two Designers (1 and 3 years experience).

Participants were given a hands-on demonstration

of the platform in an interactive workshop, allow-

ing them to explore its features before completing the

questionnaire.

5.1.3 Results

Demographics. We received responses from 6 par-

ticipants. Regarding the frequency of interacting with

user feedback, we used a 5-point Likert scale, where

1 means “Never” and 5 means “Very Often”. The re-

sponses were equally distributed between 3, 4 and 5.

Usability. Figure 6 displays participant responses to

the usability questionnaire. The platform achieved a

SUS score of 83.3, categorizing it as Good according

to established benchmarks (Bangor et al., 2009).

Relevance. Figure 7 displays participant responses

to the platform relevance questionnaire. Relevance

was assessed through positive statements evaluated

Figure 7: Platform Relevance Survey Results.

on a 5-point Likert scale. Scores were averaged for

each statement, with all items scoring 3 (“Agree”)

or higher. Statements on the platform’s helpful-

ness in creating issues and recommendation likeli-

hood scored the highest, averaging 3.2.

5.1.4 Discussion

Participants consistently rated the platform highly,

underscoring its potential to streamline feedback

management and issue creation processes. Qualita-

tive feedback highlighted:

• Strengths: Effective aggregation of user feed-

back, actionable insights, intuitive interface, and

potential for use in other domains;

• Improvement Suggestions: Enhanced correla-

tion between feedback and platform changes, ex-

panded audience segmentation, and features to an-

alyze feedback trends over time.

5.2 Performance in Use

5.2.1 Design

We conducted three experiments with a RE expert

to evaluate the platform’s ability to identify relevant

posts and generate actionable issues. More speciﬁ-

cally, the objectives were as follows:

1. Objective 1: Assess the platform’s impact on pro-

ductivity;

2. Objective 2: Assess the effectiveness of the plat-

form’s sorting and ﬁltering mechanisms for prior-

itizing relevant posts.

The participant was tasked with:

• Reviewing posts on the Suggested Feedback

page and classifying them as relevant or not;

• Creating issues from relevant posts using the Ac-

cepted Feedback page, leveraging the Clustered

View to assess clustering effectiveness.

Automated Social Media Feedback Analysis for Software Requirements Elicitation: A Case Study in the Streaming Industry

157

Each experiment targeted different sets of posts:

• Experiment 1: Posts related to streaming ser-

vices that utilize the industrial partner’s platform;

• Experiment 2: The top 100 posts related to all

streaming services, sorted by “Default Sort” (neg-

ative sentiment and descending likes);

• Experiment 3: 100 randomly selected posts re-

lated to all streaming services.

5.2.2 Results

The main results of these experiments are summa-

rized in Table 4.

Table 4: Results of Platform Usage Experiments.

Dataset Posts Posts Precision Issues Duration

Analyzed Accepted (Relevance) Created (approx.)

1. Filtered set 55 47 85.4% 24 35 min

2. Top 100 sorted 100 89 89.0% 47 80 min

3. Random sample 100 68 68.0% 42 60 min

5.2.3 Discussion

Objective 1. Regarding the impact on productivity,

the results demonstrate the potential for a signiﬁcant

reduction in the time required to process user feed-

back with our platform as compared to a manual pro-

cess. In an initial feasibility study, we manually col-

lected and classiﬁed hundreds of social media posts,

creating detailed issues for some of the posts deemed

relevant. The manual process required 2–3 minutes

per post for collection and classiﬁcation, plus an ad-

ditional 8–10 minutes to create an issue for each rel-

evant post. With only about a third of posts classi-

ﬁed as relevant, this amounts to 14–19 minutes per

generated issue in a manual process — an order of

magnitude higher than the times observed with our

platform in the 3 experiments, with 1.4-1.7 minutes

per generated issue with our platform.

Objective 2. The effectiveness of the platform’s

sorting and ﬁltering mechanisms for prioritizing

relevant posts is demonstrated by comparing the per-

centage of posts accepted as relevant in experiments 1

(85.4%) and 2 (89.9%) to the baseline in experiment

3 (68.0%). Experiment 3 used a random selection of

posts, while in experiment 2 posts were sorted by neg-

ative sentiment and the number of likes (default sort-

ing criteria), yielding 21% more relevant posts than

the baseline. Experiment 1 used a ﬁltered collection

of posts, focusing on the streaming services of great-

est interest for our industrial partner, yielding 19.4%

more relevant posts than the baseline.

The user found the application intuitive and partic-

ularly useful in creating issues and providing a clear

overview of feedback. The user also used the platform

to create issues from clusters and manually grouped

posts that addressed similar topics but used different

terminology, a task the clustering algorithm could not

accomplish due to its reliance on sentence similarity.

This highlighted a limitation in the algorithm, which

lacks the context awareness needed to group seman-

tically related but differently phrased posts. Address-

ing this requires models capable of deeper semantic

understanding, such as LLMs, which can capture con-

textual relationships beyond surface-level wording.

Overall, the experiments demonstrated the plat-

form’s ability to efﬁciently identify relevant posts and

generate actionable issues. However, future studies

should focus on evaluating the quality of generated

issues and developer satisfaction to ensure that issue

trackers are not overwhelmed.

6 LIMITATIONS AND THREATS

TO VALIDITY

While the platform has shown promising results in

aggregating and visualizing social media feedback,

some limitations and threats to the validity of our re-

search should be acknowledged.

6.1 Platform Limitations

• Contextual Understanding: The current cluster-

ing algorithm relies on sentence similarity, which

can miss posts that are contextually related but use

different terminology. This limitation affects the

precision of issue clustering and may require the

integration of more advanced models like LLMs

to improve context-awareness;

• Domain-Speciﬁc Adaptation: Although the

platform is modular, adapting it to new do-

mains requires modiﬁcations to data extraction

scripts. This could pose challenges in domains

with highly specialized or nuanced terminology;

• Longitudinal Impact: The platform’s impact

over extended periods has not been fully evalu-

ated. Long-term studies are needed to understand

how continuous use affects feedback management

and product development processes;

• Cost and Dependence on Proprietary APIs:

The platform uses GPT-3.5 Turbo for summariza-

tion and issue generation, enhancing automation

but raising concerns about cost and reliance on

proprietary technology. Future work aims to ex-

plore open-source LLMs and optimize API usage

with caching and batching strategies.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

158

6.2 Threats to Validity

• Construct Validity: The classiﬁcation and clus-

tering metrics used in our evaluation are standard

in natural language processing. However, these

metrics may not perfectly reﬂect the relevance or

usefulness of the results to stakeholders. Incorpo-

rating qualitative feedback from users into future

evaluations could provide a more holistic view;

• External Validity: The datasets used in our study

were sourced from a speciﬁc set of social media

platforms (e.g., Reddit, Twitter) and may not gen-

eralize to other feedback sources. Exploring other

platforms and data types is essential for validating

the platform’s scalability and versatility;

• Bias in Data Selection: While we aimed to gather

diverse feedback, the data collection process may

inadvertently introduce bias. For example, some

posts may have been excluded due to platform re-

strictions, potentially skewing the results;

• Temporal Validity: The platform was evaluated

over a short timeframe, limiting the ability to

study its long-term impact on the industrial part-

ner’s workﬂows. Metrics on real-world improve-

ments, such as productivity gains or changes in

decision-making processes, remain unmeasured.

Long-term evaluations are necessary to assess the

platform’s sustained utility in industrial settings.

By addressing these limitations and threats, future

research can improve the robustness of the platform

while deepening our understanding of its long-term

impact on RE practices.

7 CONCLUSIONS

This paper presented an approach and platform de-

signed to leverage social media for gathering user

feedback, offering stakeholders an intuitive visualiza-

tion tool to interpret and act on this feedback. Vali-

dated through collaboration with an industry partner

in the streaming domain, the platform demonstrated

its potential to streamline feedback management, en-

hance decision-making, and provide actionable in-

sights.

Key contributions of this work include:

• Industry Evaluation: The platform has been

evaluated in a real-world industry context, offer-

ing insights into its practical utility and adapt-

ability. This study contributes a focused explo-

ration of integrating indirect social media feed-

back into actionable workﬂows, with direct stake-

holder feedback afﬁrming its relevance;

• Automated Social Media Feedback Processing:

The platform automates data extraction across

multiple social media platforms, utilizing innova-

tive techniques such as LLM-based summariza-

tion and custom clustering algorithms. This ap-

proach enhances efﬁciency and scalability, align-

ing with but extending beyond traditional methods

found in the literature;

• Integrated Feedback Visualization, Issue Cre-

ation, and Requirement Suggestion: Our plat-

form integrates data aggregation, and actionable

workﬂows, enabling stakeholders to move seam-

lessly from feedback interpretation to task im-

plementation. It suggests requirements and ac-

ceptance criteria, directly aiding issue creation in

tools like GitLab. This end-to-end approach rep-

resents a signiﬁcant step toward bridging the gap

between feedback collection and practical imple-

mentation;

• Indirect Feedback Acquisition: This platform

captures user feedback on various streaming ap-

plications by searching for posts referencing prod-

ucts rather than requiring users to directly engage

with speciﬁc accounts. This method provides a

comprehensive view of multiple related products

on the market, providing insights that traditional

academic approaches often overlook.

The results of this work highlight the potential

of leveraging social media as a scalable source of

indirect feedback for requirements engineering. By

demonstrating how automation and visualization can

transform unstructured user posts into actionable in-

sights, the study contributes to the growing body of

research on data-driven and crowd-based RE.

Our ﬁndings suggest that platforms like ours can

complement traditional feedback channels by inte-

grating broader user perspectives from public discus-

sions, which is particularly valuable for market-driven

development. The platform’s features, such as in-

tegrated issue creation and requirement suggestions,

show promise for reducing cognitive load and im-

proving team productivity.

Future enhancements include improving classiﬁ-

cation and clustering algorithms, adding post-refresh

options, user notiﬁcations, and adapting the platform

to other domains. Thanks to its modular design,

adapting the platform requires only redeﬁning search

parameters to collect domain-speciﬁc data. Once con-

ﬁgured, the platform follows standardized processing

steps to analyze relevant user posts and extract action-

able requirements, regardless of the application do-

main. Initially applied to a streaming platform, the

methodology is scalable and generalizable to other in-

dustries where social media feedback is available.

Automated Social Media Feedback Analysis for Software Requirements Elicitation: A Case Study in the Streaming Industry

159

Long-term studies are required to assess the plat-

form’s impact on feedback management and product

development through qualitative user feedback analy-

sis and usability testing.

In conclusion, this work addresses the need for au-

tomated, scalable tools to interpret social media feed-

back, enhancing the RE process and helping stake-

holders meet evolving user expectations. Its contri-

butions to integrating feedback workﬂows into issue

management tools and expanding feedback sources

highlight its relevance in dynamic software develop-

ment environments.

REFERENCES

Ali, N. and Hong, J. E. (2019). A bird’s eye view on

social network sites and requirements engineering.

In ICSOFT 2019 - Proceedings of the 14th Interna-

tional Conference on Software Technologies, pages

347–354.

Bangor, A., Kortum, P. T., and Miller, J. T. (2009). De-

termining what individual sus scores mean: adding

an adjective rating scale. Journal of Usability Stud-

ies archive, 4:114–123.

Brooke, J. B. (1996). Sus: A ’quick and dirty’ usability

scale.

Di Sorbo, A., Panichella, S., Alexandru, C. V., Visaggio,

C. A., and Canfora, G. (2017). Surf: Summarizer

of user reviews feedback. In Proceedings - 2017

IEEE/ACM 39th International Conference on Soft-

ware Engineering Companion, ICSE-C 2017, pages

55–58.

Ebrahimi, A. M. and Barforoush, A. A. (2019). Prepro-

cessing role in analyzing tweets towards requirement

engineering. In ICEE 2019 - 27th Iranian Conference

on Electrical Engineering, pages 1905–1911.

Groen, E. C., Seyff, N., Ali, R., Dalpiaz, F., Doerr, J., Guz-

man, E., Hosseini, M., Marco, J., Oriol, M., Perini, A.,

and Stade, M. (2017). The crowd in requirements en-

gineering: The landscape and challenges. IEEE Soft-

ware, 34(2):44–52.

Guzman, E., Alkadhi, R., and Seyff, N. (2017). An ex-

ploratory study of twitter messages about software

applications. Requirements Engineering, 22(3):387–

412.

Iqbal, T., Khan, M., Taveter, K., and Seyff, N. (2021). Min-

ing reddit as a new source for software requirements.

In Proceedings of the IEEE International Conference

on Requirements Engineering, pages 128–138.

Kanchev, G. M. and Chopra, A. K. (2015). Social media

through the requirements lens: A case study of google

maps. In 1st International Workshop on Crowd-Based

Requirements Engineering, CrowdRE 2015 - Proceed-

ings, pages 7–12.

Li, C., Huang, L., Ge, J., Luo, B., and Ng, V. (2018). Au-

tomatically classifying user requests in crowdsourc-

ing requirements engineering. Journal of Systems and

Software, 138:108–123.

Maalej, W., Kurtanovi

c, Z., Nabil, H., and Stanik, C.

(2016a). On the automatic classiﬁcation of app re-

views. Requirements Engineering, 21(3):311–331.

Maalej, W., Nayebi, M., Johann, T., and Ruhe, G. (2016b).

Toward data-driven requirements engineering. IEEE

Software, 33(1):48–54.

McIlroy, S., Ali, N., Khalid, H., and E. Hassan, A. (2016).

Analyzing and automatically labelling the types of

user issues that are raised in mobile app reviews. Em-

pirical Software Engineering, 21(3):1067–1106.

Nayebi, M., Cho, H., and Ruhe, G. (2018). App store min-

ing is not enough for app improvement. Empirical

Software Engineering.

Oehri, E. and Guzman, E. (2020). Same same but dif-

ferent: Finding similar user feedback across multiple

platforms and languages. In Proceedings of the IEEE

International Conference on Requirements Engineer-

ing, volume 2020-August, pages 44–54.

Panichella, S., Di Sorbo, A., Guzman, E., Visaggio, C. A.,

Canfora, G., and Gall, H. (2016). Ardoc: App re-

views development oriented classiﬁer. In Proceedings

of the ACM SIGSOFT Symposium on the Foundations

of Software Engineering, volume 13-18-November-

2016, pages 1023–1027.

Scanlan, J., de Salas, K., Lim, D., and Roehrer, E. (2022).

Using social media to support requirements gathering

when users are not available. In Proceedings of the

Annual Hawaii International Conference on System

Sciences, pages 4227–4236.

Silva, M. (2024). Automated user feedback mining for

software requirements elicitation - a case study in the

streaming industry. Master’s thesis, Faculty of Engi-

neering of University of Porto. Available at: https:

//repositorio-aberto.up.pt/handle/10216/161054.

Stanik, C. and Maalej, W. (2019). Requirements intelli-

gence with openreq analytics. In Proceedings of the

IEEE International Conference on Requirements En-

gineering, volume 2019-September, pages 482–483.

Tizard, J., Wang, H., Yohannes, L., and Blincoe, K. (2019).

Can a conversation paint a picture? mining require-

ments in software forums.

Villarroel, L., Bavota, G., Russo, B., Oliveto, R., and

Di Penta, M. (2016). Release planning of mobile apps

based on user reviews. In Proceedings - International

Conference on Software Engineering, volume 14-22-

May-2016, pages 14–24.

Williams, G. and Mahmoud, A. (2017). Mining twitter

feeds for software user requirements. In Proceedings

- 2017 IEEE 25th International Requirements Engi-

neering Conference, RE 2017, pages 1–10.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

160