to understand business insights that can be used for
decision-making in companies.
Up to now, the existing literature has a shortage of
comprehensive studies which attempt to use state-of-
the-art topic modeling techniques in conjunction with
the NER to understand the evolution of topics about
the corresponding semantic information (i.e. key
persons, organizations, or regions). The semantic
information provides the principal actors responsible
for the generation of topics in the business market. To
cover this gap, this article focuses on the application
of Bidirectional Encoder Representations from
Transformers (BERT)-based topic modeling that
allows us to train the model using our latest news
articles dataset to find emerging semantic topics from
the changing business world. In addition, the
exploitation of the NER technique will help us to
uncover semantic information regarding the
dominant topics in the domain of the business
industry.
This paper is organized as follows. Section 2
reviews the background on topic modeling and NER.
Section 3 introduces the proposed method. The
discussion is mentioned in Section 4 and the
conclusion is described in Section 5.
2 BACKGROUND
Topic modeling methods refer to the techniques
which uncovers the underlying hidden latent semantic
representations in the given set of documents (Mann,
2021; Grootendorst, 2022). Topic models process text
documents to extract underlying themes (topics) it
covers, and how these topics are linked with each
other (Mann, 2021). Here, a topic is a set of the most
probable words in the cluster. Dynamic topic models
(Grootendorst, 2022) are the type of topic models that
can be used to analyse the evolution of topics over
time using a set of documents and quantifies their
trends in real-time. The application of these
techniques is applied to pieces of texts, e.g. online
news articles. Topic detection produces two types of
output, such as; 1) cluster output, and 2) term output
(Milioris, 2015). In the first method, referred to as a
document pivot, a cluster of documents is used for
representing a topic. Whereas in the latter, referred to
as feature-pivot, a cluster of terms is produced
(Milioris, 2015).
In the literature (Mann, 2021), there are several
popular methods of topic detection that fall into either
of the two categories. Some of them are; Latent
Dirichlet Allocation (LDA), Graph-Based Feature–
Pivot Topic Detection (GFeat-p), and Frequent
Pattern Mining (FPM). Srivastava and Sahami,
(2009) discussed LDA for understanding the textual
data using summarization, classification, clustering,
and trend analysis using the textual data. Hall et al.
(2008) discussed trend analysis using temporal data
and generated visualizations using LDA topic
models. In addition, Vayansky and Kumar (2020)
clustered the documents using topic modeling and
found the topics for each cluster. Schofield et al.
(2017) discussed the impact of pre-processing on
topic modeling applications. They found that the
removal of stop words from the text corpus has little
impact on the performance of the inference of topics.
Also, the process of stemming potentially reduces the
efficiency of the resulting topic model. Moreover,
Guo and Diab (2011) mentioned that traditional topic
models treat words as strings without considering
predefined knowledge about word sense. They
perform inference to extract topics by calculating
word co-occurrence. Though, the co-occurred words
may not be semantically related to topics.
To capture the semantics, there have been
numerous attempts regarding the latest developments
in the field of topic modeling using machine learning
techniques. For instance, Deng et al. (2020) proposed
a semi-supervised learning method by applying topic
modeling and deep learning to establish a better
understanding of the customer’s voice using textual
data. In addition, Sahrawat et al. (2020) used BERT
embeddings for extracting key phrases from textual
scholarly articles. Their approach used a BERT-based
classification algorithm with a Probabilistic and
Semantic Hybrid Topic Inference (PSHTI) model to
automate the process of recognizing main topics from
the data. Moreover, Grootendorst (2022) presented a
topic model that extracts coherent topic
representation using a class-based variation of Term
Frequency-Inverse Document Frequency (TF-IDF).
More precisely, it generates document embeddings
using a pre-trained transformer-based language
model, performs clustering on the embeddings, and
consequently generates topic representations with the
class-based TF-IDF mechanism. In literature,
BERTopic remains competitive among a variety of
existing classical topic modeling architectures as it
generates coherent topics. For this reason, it was
selected for this research.
Once the topic modeling is achieved, NER (Shelar
et al., 2020) is used to identify named entities in text.
In our case, we have chosen three named entities,
which are persons, organizations, and geographical
regions. There are different Java and Python-based
libraries available for executing NER on text. Some
of them are SpaCy, Apache OpenNLP, and