Topic Modelling: A Comparative Study for Short Text
Sara Lasri and El Habib Nfaoui
LISAC Laboratory Sidi Mohammed Ben Abdellah, University Fez, Morocco
Keywords: Topic Modelling, Latent Dirichlet Allocation, Biterm Model, LDA2Vec, WNTM.
Abstract: Massive amounts of short text collected every day. Therefore, the challenging goal is to find the information
we are looking for, so we need to organize, search, classify and understand this large quantity of data. Topic
modelling is a better performing technique to solve this problem. Topic modelling provides us with methods
to organize, understand and summarize the short categorical text.TM is an intuitive approach to extract the
most essential topics detection in a short text.
1 INTRODUCTION
Topic modelling is the task of identifying which
underlying concepts are discussed within a collection
of documents and determining which topics each
document is addressing (Andra, Pietsch, Stefan,
2019).
Topic modelling is a method to find out the hidden
semantic topics (Political, sports, or business, etc.)
from the observed documents in the text corpus
(Chris Bail, 2012).
Topic modelling provides methods for
automatically organizing, understanding, searching,
and summarizing corpus (Bhagyashree Vyankatrao
Barde, A. M. Bain wad. 2017)
Figure 1: Topic Modelling.
In general, documents modelled as mixtures of
subjects, where the subject is a probability
distribution over Words (Hamed, Yongli, Chi, Xia
Xinhua, Yanchao, Liang, 2019). Statistical techniques
are then utilized to learn the topic components and
mixture coefficients of each Document (Hamed,
Yongli, Chi, Xia Xinhua, Yanchao, Liang, 2019).
Detection
of the topics within short texts, such as
tweets, has become a challenge. However, directly
applying conventional topic models. (Hamed, Yongli,
Chi, Xia Xinhua, Yanchao, Liang, 2019).
In this paper, we present different methods for
topic modelling, and we compare them to find the
most efficient for uncovering the hidden themes in the
tweet.
2 TOPIC MODELLING
METHODS
2.1 Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is an unsupervised
generative probabilistic method; it is the most popular
topic modelling (Hamed, Yongli, Chi, Xia Xinhua,
Yanchao, Liang, 2019). The basic idea is that
documents represent random mixtures over latent
topics, where each subject characterizes by a
distribution over words (Hamed, Yongli, Chi, Xia
Xinhua, Yanchao, Liang, 2019).