From Descriptive to Predictive: Forecasting Emerging Research Areas in
Software Traceability Using NLP from Systematic Studies
Zaki Pauzi
a
and Andrea Capiluppi
b
Bernoulli Institute, University of Groningen, The Netherlands
Keywords:
Natural Language Processing, Software Traceability, Time Series, Systematic Review.
Abstract:
Systematic literature reviews (SLRs) and systematic mapping studies (SMSs) are common studies in any disci-
pline to describe and classify past works, and to inform a research field of potential new areas of investigation.
This last task is typically achieved by observing gaps in past works, and hinting at the possibility of future re-
search in those gaps. Using an NLP-driven methodology, this paper proposes a meta-analysis to extend current
systematic methodologies of literature reviews and mapping studies. Our work leverages a Word2Vec model,
pre-trained in the software engineering domain, and is combined with a time series analysis. Our aim is to
forecast future trajectories of research outlined in systematic studies, rather than just describing them. Using
the same dataset from our own previous mapping study, we were able to go beyond descriptively analysing the
data that we gathered, or to barely ‘guess’ future directions. In this paper, we show how recent advancements
in the field of our SMS, and the use of time series, enabled us to forecast future trends in the same field.
Our proposed methodology sets a precedent for exploring the potential of language models coupled with time
series in the context of systematically reviewing the literature.
1 INTRODUCTION
In the field of software engineering, systematic re-
views have gained considerable popularity since the
introduction of Evidence-Based Software Engineer-
ing (EBSE) guidelines in 2004 (Kitchenham et al.,
2004); a framework to assess the quality of pri-
mary studies through aggregating all empirical stud-
ies on a particular topic. A simple search in Google
Scholar for “systematic reviews in software engineer-
ing” showed a growth trend over the past years, with
2021 results returning 300% more than in 2012
1
.
Despite the increasing popularity behind system-
atic reviews in software engineering, a key challenge
in conducting these still remains: balancing between
methodological rigour and required effort (Zhang and
Ali Babar, 2013). This paper addresses this challenge
by leveraging recent language models and time series
a
https://orcid.org/0000-0003-4032-4766
b
https://orcid.org/0000-0001-9469-6050
1
This search was done on incognito mode fil-
tering ranges of each year and only the review
articles. An example URL for the year 2021:
https://scholar.google.com/scholar?q=systematic+reviews+
in+software+engineering&hl=en&as sdt=0%2C5&as rr=
1&as vis=1&as ylo=2021&as yhi=2021
analysis to conduct a predictive analysis of trends in
research focus. The approach that we show has the
potential of reducing the manual effort required by re-
searchers, yet enhancing the methodological rigour in
reviewing literature. Based on the lessons outlined
from applying the systematic reviewing process (Br-
ereton et al., 2007), our proposed methodology is a
result of the following:
L15: Software engineering systematic reviews are
likely to be qualitative in nature. Using NLP tools
and techniques supplements this reviewing pro-
cess in providing a predictive analysis to answer
the research questions at hand. Descriptive quali-
tative analysis may not be enough to fully answer
the research questions, and recommendations typ-
ically are based on manual analysis and interpre-
tation.
L18: Review teams need to keep a detailed record
of decisions made throughout the review process.
Transparency in the decision-making process im-
proves explainability and supports effective re-
viewing process. Our proposed methodology in-
volves NLP models such as word embeddings
that are agnostic to individual perceptions (under-
standing) of semantics but rather representative of
538
Pauzi, Z. and Capiluppi, A.
From Descriptive to Predictive: Forecasting Emerging Research Areas in Software Traceability Using NLP from Systematic Studies.
DOI: 10.5220/0011964100003464
In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023), pages 538-545
ISBN: 978-989-758-647-7; ISSN: 2184-4895
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
(by training) large amounts of corpus text from
multiple sources.
The work that we present in this paper is based on
the data gathered from our own systematic mapping
study (Pauzi and Capiluppi, 2023). In that review, we
studied which trends have emerged in NLP techniques
to support or automate the tasks of software traceabil-
ity. The analysis was conducted within the software
development life cycle (SDLC) framework, and var-
ious research trends emerged from the evaluation of
past works.
Joining predictive analytics and our own system-
atic study (but potentially, any other systematic re-
view), we observe how the importance of certain
terms evolved over the past years; specifically for
requirement and design. Both of these terms were
chosen as they represent two core phases of the SDLC
that constitute a major chunk of traceability efforts us-
ing NLP (Pauzi and Capiluppi, 2023). We also anal-
yse the semantically relevant terms to these and fore-
cast the importance of these two terms for the next
two years.
2 RATIONALE AND RESEARCH
QUESTIONS
2.1 Real-World Applications of
Predictive Analytics
Identifying and forecasting emerging trends is a part
of predictive analytics, enabling stakeholders (such
as policymakers, executives, etc.) to make better-
informed decisions. We borrow from some recent
real-world applications and case studies to propose an
innovative approach to use the results of a(ny) system-
atic literature review.
In the energy industry, predictive analytics is a
particularly crucial component, due to global efforts
to reduce greenhouse gas emissions by shifting en-
ergy sources to sustainable ones. Predictive analysis
is done by mining text from patents, scientific pub-
lications, and Twitter data. Given the disruptive im-
pact it has on the energy industry, forecasting emerg-
ing trend is undoubtedly critical for government R&D
strategic planning, social investment, and enterprise
practices. In the photovoltaic industry, social media
(tweets) were mined to construct an evolution map of
Twitter users’ sense of and response to perovskite so-
lar cell technology (Li et al., 2019).
For analysing patents, pyGrams
2
is an open-
2
https://github.com/datasciencecampus/pygrams
source tool to discover emerging terminology in large
text datasets. The quantitative review of emerging ter-
minology from patent applications is an important ob-
jective for the Intellectual Property Office (IPO). A
similar concept is demonstrated in this paper.
2.2 Research Questions
Given the applicability of predictive analytics in di-
verse industries, we aim to replicate these applica-
tions, but in the context of systematically reviewing
the literature. We formulate the following research
questions:
RQ1: How have specific terms emerging from a
systematic review evolved in the past?
RQ2: What are the closest terms (in terms of se-
mantic similarity) to these?
RQ3: What is the forecast trajectory for the im-
portance of these specified terms?
3 BACKGROUND RESEARCH
Text mining on publications has been explored for
a variety of reasons; mainly to automate the man-
ual process of reviewing relevant literature (Thomas
et al., 2011). Examples include an automated screen-
ing of studies for selection (O’Mara-Eves et al.,
2015), improving search strategy (Ananiadou et al.,
2009; Stansfield et al., 2017), automatically extract-
ing an ontology of relevant topics (Osborne et al.,
2019), and automating the selection strategy such
as facilitating the identification, description or cate-
gorisation, and the summarisation of relevant litera-
ture (Thomas et al., 2011). Besides text mining as
a part of the systematic process, analysis of the use
of text mining in systematic reviews has also been
done previously, such as for the citation screening
stage (Olorisade et al., 2016).
Many challenges faced by systematic reviewers
are similar across different disciplines, with software
engineering being no different (Marshall et al., 2015).
To support reviewers in this arduous quest, readily
available tools were developed, providing text mining
as part of its core capabilities, such as the Systematic
Review Toolbox (Marshall and Brereton, 2015). By
combining NLP and time series, we aim to achieve
a more prescriptive-focused standpoint to expand the
current systematic review process in software engi-
neering. A similar work was based on ACL anthol-
ogy (Asooja et al., 2016): although the premise was
similar, the approach was specifically using only one
conference proceedings without a specific topic in
question.
From Descriptive to Predictive: Forecasting Emerging Research Areas in Software Traceability Using NLP from Systematic Studies
539
4 METHODOLOGY
Figure 1 shows a simplified workflow diagram to rep-
resent the sequence of steps in our proposed method-
ology. The full notebook can be found online and
used freely for further queries and results
3
.
Figure 1: Sequence of steps taken in our methodology.
4.1 Dataset
Our dataset of publications is based on the selec-
tion criteria of a previous mapping study (Pauzi and
Capiluppi, 2023): a total of 96 papers were obtained,
all dealing with NLP techniques to support (or auto-
mate) software traceability. The complete list of pa-
pers selected can be found online
4
, covering a period
of years 2013 to 2021.
4.2 Text Processing
Text content from every paper in scope was processed
through the pipeline shown in Figure 1. The cleaned
text from every paper (document) is then tokenised as
unigrams, bi-grams, and tri-grams. The steps taken
were the following:
1. Convert words into tokens (tokenisation)
2. Remove noise: keep only alphanumeric charac-
ters
3. Convert all to lowercase
3
https://github.com/zakipauzi/enase2023/blob/main/
enase2023.ipynb
4
https://github.com/zakipauzi/enase2023/blob/main/
papers.csv
4. Remove stopwords (using a standard NLTK pack-
age with additional words identified)
5. Convert tokens to their lemma form (SpaCy lem-
matizer
5
)
6. Remove letterhead or front page matter
4.3 Terms Frequency - Inverse
Document Frequency (TFIDF)
These N-grams
6
are vectorised using Terms Fre-
quency - Inverse Document Frequency (TFIDF).
TFIDF Vectorizer is a feature extraction technique for
large text corpus offered by scikit-learn (Pedregosa
et al., 2011). The number of times a term
7
oc-
curs (term frequency, t f ) in a given document is
multiplied with the inverse document frequency id f :
t f (t, d) id f (t). This technique enables a more ef-
fective term weighting scheme compared to the sole
count of terms as those very frequent terms would
shadow the frequencies of rarer yet more important
terms. The popularity of this term-weighting tech-
nique is vast: a survey conducted in 2015 showed that
83% of text-based recommender systems in digital li-
braries use TFIDF (Beel et al., 2016). In this paper,
the TFIDF score calculated of a term t (N-gram) ex-
tracted from the dataset is denoted as “tfidf
t
”.
4.4 Filtering and Visualising
We provide a historical descriptive trend of tfidf
t
for
top-ranked terms (by N-gram tokenisation) based on
two viewpoints: years trend (RQ1) and semantic simi-
larity (RQ2). Every paper has a published date, which
is used to group the tfidf
t
of each term corresponding
to the years it is extracted from. This is then plotted
on a line graph to show how these top-ranked terms
change in tfidf
t
across the years.
In the viewpoint of semantic similarity, we use
a pre-trained language model in the software en-
gineering domain (Efstathiou et al., 2018). We
vectorise the corpus tokens (N-grams) using the
Word2Vec (Mikolov et al., 2013) model trained on
Stack Overflow vocabulary dataset. This technique
allows us to represent words in a vector space, based
on the model that has been trained with a vocabu-
lary in the context of software engineering. This is
necessary to disambiguate polysemous words and de-
tect synonyms by interpreting them in the context of
5
https://spacy.io/api/lemmatizer
6
A collection of N adjacent tokens. This can be a uni-
gram (N=1), bi-gram (N=2), and so on.
7
These are the N-grams extracted. For example,
machine learning is a bi-gram term.
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
540
software engineering. The terms are then ranked ac-
cording to the similarity scores calculated using co-
sine similarity to the user input’s term: a “distance”
metric, irrespective of orientation and magnitude
the lower the angle between the vectors, the higher
the similarity.
4.5 Time Series Analysis for Emerging
Terminologies
The core feature of our methodology is the transition
from descriptive analysis (typically, what systematic
reviews focus on) to predictive analysis (i.e., what
importance will a term have in the future). Due to
the timestamps of publications, we are able to derive
time series data that reflects a variate time to the equa-
tion. In answering RQ3, we present the following key
steps:
4.5.1 Grouping Datapoints
Firstly, we sort the papers according to the published
date in ascending order. The timestamp of every paper
becomes the index in our table. The tfidf
t
of the user
input’s term is grouped according to the timestamp.
Where there are multiple values of tfidf
t
within the
same period (month), we calculate the average.
4.5.2 Check for Stationarity
Prior to fitting our model, we need to check for sta-
tionarity property that the mean, variance and au-
tocorrelation structure do not change over time. In
other words, a flat-looking series, without trend, with
constant variance over time. To check this, we run
the rolling statistics test and the Augmented Dicky-
Fuller (ADF) test (Bilgili, 1998). If stationarity is not
confirmed, we have to transform the series to remove
patterns of trend and seasonality
8
. In our paper, we
use the statsmodels package offered in Python
9
.
4.5.3 Fitting the Model
We use the auto-regressive integrated moving aver-
age (ARIMA) model to fit the datapoints reflecting
tfidf
t
of user input across the series. Due to the lim-
itation of space, we will be only exploring ARIMA
model using auto arima from the pmdarima pack-
age
10
. auto arima operates like a grid search, in that
8
Some examples: log transforming, square root, pro-
portional change, etc.
9
https://www.statsmodels.org/dev/generated/
statsmodels.tsa.stattools.adfuller.html
10
https://pypi.org/project/pmdarima
it tries various sets of p and q parameters
11
, selecting
the model that minimises the Akaike Information Cri-
teria (AIC)
12
.
5 RESULTS
Our pipeline is fit to any user input that is within
the vocabulary of the Word2Vec model; however, for
space limitations, we will focus here on the two core
phases of the SDLC, i.e., requirement and design.
User inputs requirement and design represent
a user interested to know how these topic terms have
played a part in using NLP for software traceability
over the past years.
5.1 RQ1
We calculate the TFIDF mean scores (mean tfidf
t
)
of requirement and design throughout the years in
scope. requirement and design are highly ranked
unigrams (in terms of tfidf
t
) from our dataset, with
the following timeline changes of mean tfidf
t
over the
years, shown in Figure 2:
Figure 2: Mean tfidf
t
of requirement and design over the
past years.
5.2 RQ2
Based on the user inputs, we also show the terms
ranked in semantic similarity (dictionary filtering)
with their respective sum and mean tfidf
t
. Tables 1
11
p is the number of autoregressive terms and q is the
number of lagged forecast errors in the prediction equation.
12
An estimator of prediction error which measures a sta-
tistical model in order to quantify the goodness of fit of the
model.
From Descriptive to Predictive: Forecasting Emerging Research Areas in Software Traceability Using NLP from Systematic Studies
541
and 2 show the top 10 terms for requirement and
design respectively.
Table 1: Top 10 ranked by semantic similarity to
requirement.
Term sum mean score
1 hipaa 0.4448 0.0046 0.4147
2 regulation 0.1010 0.0011 0.4070
3 constraint 0.1511 0.0016 0.3814
4 recomme-
ndation 0.1113 0.0012 0.3766
5 design 0.2311 0.0024 0.3412
6 feature 0.3203 0.0033 0.3404
7 concept 1.0397 0.0110 0.3363
8 architecture 0.4086 0.0043 0.3340
9 case 0.2604 0.0027 0.3218
10 knowledge 0.1159 0.0012 0.3102
Table 2: Top 10 ranked by semantic similarity to design.
Term sum mean score
1 architectural 0.3855 0.0040 0.6994
2 architecture 0.4086 0.0043 0.5899
3 modulari-
zation 0.1881 0.0020 0.5181
4 approach 0.7675 0.0080 0.4818
5 composit-
ional 0.1013 0.0011 0.4669
6 concept 1.0397 0.0108 0.4568
7 uml 0.1896 0.0020 0.4494
8 functional 0.1430 0.0015 0.4489
9 consistency 0.3496 0.0036 0.4479
10 sdlc 0.1016 0.0011 0.4335
5.3 RQ3 (User Input: Requirement)
To calculate the forecast trajectory of importance (us-
ing tfidf
t
of user input term), we check for stationarity
first. By grouping the tfidf
t
by month instead of years,
we are able to collect more datapoints, and this series
of tfidf
t
is used to check for stationarity and to fit the
ARIMA model for forecasting.
Figure 3 shows the rolling mean and rolling stan-
dard deviation (by 12 periods/months) across the se-
ries for user input requirement. We can see that sta-
tionarity is confirmed as there is no indication of an
upward or downward trend.
Results from the ADF test also confirm this:
ADF statistic (-5.514) is lower than the critical
values at 99% (-3.544), 95% (-2.911) and 90% (-
2.593).
The p-value is also lower than 0.05, so we can
reject our null hypothesis and conclude that our
data series is stationary.
Figure 3: Forecast of tfidf
t
values of requirement for the
next 24 periods (months).
Fitting the ARIMA model into our series, we can
see the trajectory of requirement for the next 24 pe-
riods (months). Results in the same Figure 3 show
that we would expect a uniformed horizontal tfidf
t
of
range 0.054 < tfidf
t
< 0.207. As tfidf
t
cannot be
below zero, we disregard the negative values in the
range.
5.4 RQ3 (User Input: Design)
We present Figure 4 for design. This figure shows
the rolling mean and rolling standard deviation (by
12 periods/months) across the series. We can see that
stationarity is confirmed as there is no indication of
an upward or downward trend.
Figure 4: Forecast of tfidf
t
values of design for the next 24
periods (months).
Results from the ADF test also confirm this:
ADF statistic (-6.851) is lower than the critical
values at 99% (-3.537), 95% (-2.908) and 90% (-
2.591).
The p-value is also lower than 0.05, so we can
reject our null hypothesis and conclude that our
data series is stationary.
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
542
Results in the same Figure 4 show that we would
expect a uniformed horizontal tfidf
t
for design within
range 0.018 < tfidf
t
< 0.0614. As tfidf
t
cannot be
below zero, we disregard the negative values in the
range.
6 DISCUSSION
At the text processing stage, the content of every pa-
per had to be cleaned to remove noise. The cleaned
text was then lemmatised: a process of transforming
the morphology of each word to its root form (e.g.
best -> good). This allows a more accurate tfidf
t
representation of our terms because multiple forms
of the same root word should only be considered as
one term. Stemming, on the other hand, is a process
of truncating suffixes (e.g. computer -> comput).
This step was not included in the pipeline for this pa-
per as truncated versions of the terms were not valid
in our pre-trained language model vocabulary for se-
mantic filtering. One key limitation of lemmatisers,
however, is that not all suffixes are removed. Table 2
shows architectural and architecture showing
the highest similarity to design, although they both
refer to the same root form architecture. In this
scenario, stemming would help as it would truncate
both terms to architectur.
From the tfidf
t
of both requirement and design,
we can see the trend of importance throughout recent
years, as shown in Figure 2. For requirement, we
can see a spike in 2018 and 2019. design, on the
other hand, has a steadily increasing line of impor-
tance, in terms of tfidf
t
. Both of these terms are iden-
tified as top-ranked unigrams from our dataset, and
more importantly, it shows that both of these SDLC
phases play a critical role in traceability using NLP.
Table 1 shows the top 10 semantically similar
terms to requirement traceability and Table 2 for de-
sign traceability. These semantically related terms al-
low us to analyse what terms play a part in these main
topics (terms) and their importance through its sum
tfidf
t
and mean tfidf
t
.
The ranked top 3 for requirement: hipaa,
regulation and constraint. It appears that dur-
ing recent years, traceability efforts in requirements
relating to HIPAA regulations (Atchinson and Fox,
1997)
13
play a major role in requirements traceabil-
ity. Critical software and systems in the healthcare in-
dustry necessitate the need for mandatory traceability,
and the use of NLP in these traceability efforts is iden-
tified as important. The top-ranked terms related to
13
Health Insurance Portability and Accountability Act of
1996
design in Table 2 are about architecture and some re-
lated to object-oriented programming concepts, such
as modularization and compositional.
Looking beyond the top 5 terms in Tables 1 and 2,
we can observe some terms that are ambiguous and
may be a part of a compound term (bi-gram or tri-
gram) instead of being a part of the user input it be-
longs to. For example, case in requirement traceabil-
ity may refer to “use case” as a requirement or a “test
case” in the SDLC testing phase. There are also over-
lapping terms between these two SDLC phases, and
this confirms that efforts in traceability using NLP
typically involve multiple SDLC phases.
In answering RQ3, we present Figures 3
and 4: showing the forecast trajectory of tfidf
t
for
requirement and design respectively. Both figures
show a horizontal trajectory, which is no surprise,
given the seeming stationarity feature in its original
form. Despite the spikes we see in some historical
datapoints, the tfidf
t
for requirement is expected to
have a maximum boundary of 0.2 for the next 2 years.
For design, the maximum expected tfidf
t
is 0.06 for
the next 2 years.
6.1 Implications and Outlook
Our proposed solution has practical and direct im-
plications for the research community, particularly in
mining publications to harness value through predic-
tive analytics. We can also target publication venues
(conferences etc.). For example, mining published
papers from previous conference proceedings (includ-
ing the past editions of ENASE) to forecast what re-
search areas are expected to emerge next in the field
of novel approaches to software engineering. This
would particularly be useful to the organising com-
mittee and reviewers to better understand the research
trajectory in the specified areas of interest, and how
these efforts are distributed. This also gives an insight
into how a program committee could be formed based
on trends and expertise. Since systematic reviews al-
ready cover a wide coverage in summarising past pa-
pers (albeit with a time lag), this would be particularly
useful in ensuring comprehensiveness in content cov-
erage.
For practitioners in the software engineering com-
munity, datasets that are rich with free text would
equally benefit greatly from our proposed approach.
Examples include mining forum posts and Q&A web-
sites (Stack Overflow etc.) to analyse and predict
the importance of interest topics. In the context of
systematic reviews (i.e., what was presented in this
paper), the connection between systematic reviews
and software engineering practice is arguably lacking
From Descriptive to Predictive: Forecasting Emerging Research Areas in Software Traceability Using NLP from Systematic Studies
543
and insufficient (Cartaxo et al., 2016; da Silva et al.,
2011). This was also investigated in a recent paper by
analysing how well systematic reviews help to answer
questions posted online through data from Q&A web-
sites (Cartaxo et al., 2017). With the vast amount of
free text being made available to the masses (coupled
with timestamps), our position paper proposes a com-
bination of NLP and time series that can benefit prac-
titioners by improving the connection (through anal-
ysis of historical and future predictions) between sys-
tematic reviews and real-world problems in the soft-
ware engineering community.
6.2 Threats to Validity
External validity: We acknowledge the limited
number and narrow topic scope of datapoints.
However, we present this work to showcase the
applicability of industry practices (as the case
studies presented in Section 1) in forecasting
emerging trends albeit within the scope of system-
atically reviewing the literature.
Internal validity: Forecasting any prediction re-
quires extensive analysis of multiple factors that
may influence expected results. A key limitation
of ARIMA is that it relies on the assumption that
past values of the series can alone predict future
values. Regardless, being aware of this does not
negate the importance of ARIMA in forecasting
predictions.
Conclusion validity: Extracted raw data (PDF
text) consisting of non-uniform structure, result-
ing in less accurate observations and conclusions.
There is also some degree of non-uniformity in the
way the published text represents meaning. For
example, the use of synonym terms that relate to
one concept will dilute the tfidf
t
of that concept
we are interested in (by user input).
7 CONCLUSION AND FUTURE
WORK
In this position paper, we introduced a meta-analysis
of publications in the form of predictive analysis,
through forecasting emerging research areas from
systematic studies. We combined NLP techniques
with time series analysis to detect and identify the
trend of a subset research area, denoted by its term.
Our approach leverages recent advancements in
language models and time series. It expands the anal-
ysis of systematic reviews in a particular research fo-
cus, and its applicability can be used for other re-
search areas as well. Based on the papers from our
own SMS, we showed that our proposed approach
can (i) detect trends of importance within the terms
emerging from the literature, and (ii) forecast trends
in research based on the results of a specific system-
atic study. This paper is a glimpse into the prospects
of leveraging data science for software engineering
academic purposes.
An extension of our methodology would be to go
further in analysing semantically related terms. We
can observe that not all terms derived are of similar hi-
erarchy (in terms of the topic scope), and constructing
a taxonomy based on existing defined ontology (such
as the ACM Computing Classification System (Cassel
et al., 2013)) will allow us to match and understand
directions and categories of research efforts more ef-
fectively.
With regards to the time series analysis, future
works involve testing out model results to validate
predictions: using a data series from the previous
decade to validate the results against the current one.
This helps to see how the solution was able to iden-
tify and detect the trend of importance for terms that
may have been gaining traction in the training set. For
example, in deep learning NLP, BERT was only in-
troduced and published in 2018 (Devlin et al., 2018).
If we were to look into previous papers on the topic
of deep learning and transformers, we should expect
relevant important terms to emerge such as neural
network and transfer learning.
In the software engineering community, emerging
solutions from the field of data science have taken the
limelight in recent times for researchers and practi-
tioners. Our paper explores this by analysing lead-
ing indicators rather than lagging ones – those which
systematic reviews typically focus on. The sheer ad-
vancement of language models and data analytics in
recent years have made giant leaps of progress and
capitalising on these would be the holy grail.
REFERENCES
Ananiadou, S., Rea, B., Okazaki, N., Procter, R., and
Thomas, J. (2009). Supporting systematic reviews
using text mining. Social Science Computer Review,
27(4):509–523.
Asooja, K., Bordea, G., Vulcu, G., and Buitelaar, P. (2016).
Forecasting emerging trends from scientific literature.
In Proceedings of the Tenth International Conference
on Language Resources and Evaluation (LREC’16),
pages 417–420, Portoro
ˇ
z, Slovenia. European Lan-
guage Resources Association (ELRA).
Atchinson, B. K. and Fox, D. M. (1997). From the field:
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
544
The politics of the health insurance portability and ac-
countability act. Health Affairs, 16(3):146–150.
Beel, J., Gipp, B., Langer, S., and Breitinger, C. (2016).
Research-paper recommender systems : a literature
survey. International Journal on Digital Libraries,
17(4):305–338.
Bilgili, F. (1998). Stationarity and cointegration tests: Com-
parison of engle-granger and johansen methodolo-
gies. Erciyes
¨
Universitesi
˙
Iktisadi ve
˙
Idari Bilimler
Fak
¨
ultesi Dergisi, (13):131–141.
Brereton, P., Kitchenham, B. A., Budgen, D., Turner, M.,
and Khalil, M. (2007). Lessons from applying the sys-
tematic literature review process within the software
engineering domain. Journal of Systems and Software,
80(4):571–583.
Cartaxo, B., Pinto, G., Ribeiro, D., Kamei, F., Santos, R. E.,
da Silva, F. Q., and Soares, S. (2017). Using q&a
websites as a method for assessing systematic reviews.
In 2017 IEEE/ACM 14th International Conference on
Mining Software Repositories (MSR), pages 238–242.
Cartaxo, B., Pinto, G., Vieira, E., and Soares, S. (2016).
Evidence briefings: Towards a medium to transfer
knowledge from systematic reviews to practitioners.
In Proceedings of the 10th ACM/IEEE International
Symposium on Empirical Software Engineering and
Measurement, ESEM ’16, New York, NY, USA. As-
sociation for Computing Machinery.
Cassel, L. N., Palivela, S., Marepalli, S., Padyala, A., Deep,
R., and Terala, S. (2013). The new acm ccs and
a computing ontology. In Proceedings of the 13th
ACM/IEEE-CS Joint Conference on Digital Libraries,
JCDL ’13, page 427–428, New York, NY, USA. As-
sociation for Computing Machinery.
da Silva, F. Q., Santos, A. L., Soares, S., Franc¸a, A. C. C.,
Monteiro, C. V., and Maciel, F. F. (2011). Six years of
systematic literature reviews in software engineering:
An updated tertiary study. Information and Software
Technology, 53(9):899–913. Studying work practices
in Global Software Engineering.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding.
Efstathiou, V., Chatzilenas, C., and Spinellis, D. (2018).
Word embeddings for the software engineering do-
main. In 2018 IEEE/ACM 15th International Confer-
ence on Mining Software Repositories (MSR), pages
38–41.
Kitchenham, B., Dyba, T., and Jorgensen, M. (2004).
Evidence-based software engineering. In Proceed-
ings. 26th International Conference on Software En-
gineering, pages 273–281.
Li, X., Xie, Q., Jiang, J., Zhou, Y., and Huang, L. (2019).
Identifying and monitoring the development trends of
emerging technologies using patent analysis and twit-
ter data mining: The case of perovskite solar cell
technology. Technological Forecasting and Social
Change, 146:687–705.
Marshall, C. and Brereton, P. (2015). Systematic review
toolbox: A catalogue of tools to support systematic re-
views. In Proceedings of the 19th International Con-
ference on Evaluation and Assessment in Software En-
gineering, EASE ’15, New York, NY, USA. Associa-
tion for Computing Machinery.
Marshall, C., Brereton, P., and Kitchenham, B. (2015).
Tools to support systematic reviews in software engi-
neering: A cross-domain survey using semi-structured
interviews. In Proceedings of the 19th International
Conference on Evaluation and Assessment in Software
Engineering, EASE ’15, New York, NY, USA. Asso-
ciation for Computing Machinery.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space. In Bengio, Y. and LeCun, Y., editors, 1st In-
ternational Conference on Learning Representations,
ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013,
Workshop Track Proceedings.
Olorisade, B. K., de Quincey, E., Brereton, P., and Andras,
P. (2016). A critical analysis of studies that address
the use of text mining for citation screening in system-
atic reviews. In Proceedings of the 20th International
Conference on Evaluation and Assessment in Software
Engineering, EASE ’16, New York, NY, USA. Asso-
ciation for Computing Machinery.
O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., and
Ananiadou, S. (2015). Using text mining for study
identification in systematic reviews: a systematic re-
view of current approaches. Syst. Rev., 4(1):5.
Osborne, F., Muccini, H., Lago, P., and Motta, E. (2019).
Reducing the effort for systematic reviews in software
engineering. Data Science, 2(1-2):311–340.
Pauzi, Z. and Capiluppi, A. (2023). Applications of nat-
ural language processing in software traceability: A
systematic mapping study. Journal of Systems and
Software, 198. Publisher Copyright: © 2023 The Au-
thor(s).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and
Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research,
12:2825–2830.
Stansfield, C., O'Mara-Eves, A., and Thomas, J. (2017).
Text mining for search term development in system-
atic reviewing: A discussion of some methods and
challenges. Research Synthesis Methods, 8(3):355–
365.
Thomas, J., McNaught, J., and Ananiadou, S. (2011). Ap-
plications of text mining within systematic reviews.
Research Synthesis Methods, 2(1):1–14.
Zhang, H. and Ali Babar, M. (2013). Systematic reviews
in software engineering: An empirical investigation.
Information and Software Technology, 55(7):1341–
1354.
From Descriptive to Predictive: Forecasting Emerging Research Areas in Software Traceability Using NLP from Systematic Studies
545