Bayes and Support Vector Machines (SVM). These
methods typically rely on manually designed features
and statistical models for text classification. Although
these methods can achieve the goal of sentiment
analysis to some extent, they often perform poorly in
handling complex text data and struggle to capture
deep semantic information in text. In conclusion,
Deep learning-based techniques are gradually gaining
popularity in the past decade and become mainstream
in the field of sentiment analysis, especially LSTM
and CNN models have achieved significant results in
sentiment classification tasks. These methods can
better understand the semantic and contextual
information of text data, therefore raising sentiment
analysis's precision and effectiveness (Alsini, 2023;
Contreras, 2023).
The project's objective is to use LSTMs to
develop a sentiment analysis model for Weibo
comments. By breaking down remarks into individual
characters, it preprocesses the material and improves
comprehension of linguistic subtleties. Leveraging
pre-trained word embeddings from Tencent Artificial
Intelligence (AI) Lab enriches the model's
comprehension of context and sentiment. A
bidirectional LSTM architecture is employed to
capture both past and future context, improving
sentiment classification accuracy. The predictive
performance of the model is contrasted with different
deep learning architectures and conventional machine
learning techniques. The experiment demonstrates
the LSTM model's effectiveness in analyzing Weibo
comments' sentiment, underscoring its robust support
for social sentiment analysis and decision-making.
2 METHODOLOGIES
2.1 Dataset Description and
Preprocessing
The study utilizes a dataset comprising Weibo
comments, aiming to analyze sentiment through an
LSTM-based model. The dataset consists of text
comments which have been annotated for sentiment,
with the intention of facilitating a comprehensive
understanding of public sentiment on various topics
discussed on Weibo (ChineseNlpCorpus, 2018). The
dataset is a collection of Weibo comments, each
associated with a sentiment label indicating the
comment's overall sentiment (positive or negative).
These comments have been meticulously collected to
ensure a diverse representation of topics, linguistic
styles, and sentiments, providing a robust foundation
for analyzing sentiment nuances in social media text.
The preprocessing of the dataset is a critical step
to prepare the raw text data for the LSTM model. The
paper preprocessing pipeline involves several key
stages: Character Segmentation: Given the nature of
the Chinese language, which does not use spaces to
separate words, this paper adopts a character-level
segmentation approach. This method involves
breaking down each comment into individual
characters, thereby capturing the linguistic features
more effectively. Vocabulary Construction: A
vocabulary index is created from the segmented
dataset, with a maximum size set to 10,000 unique
tokens. Special tokens such as <UNK> for unknown
characters and <PAD> for padding are included to
handle out-of-vocabulary words and maintain
uniform comment lengths, respectively. Sequence
Padding and Truncation: To ensure uniform input
sizes for the LSTM model, comments are either
padded or truncated to a fixed length of 50 characters,
as defined by the pad_size parameter. This step
ensures that each input tensor to the model maintains
a consistent shape. Tokenization and Indexing: Each
character in a comment is replaced with its
corresponding index from the vocabulary, converting
the textual data into a numerical format that can be
processed by the model. Characters not found in the
vocabulary are replaced with the index for <UNK>.
Training, validation, and test sets are separated from
the pre-processed dataset with a distribution of sixty
percent, twenty percent, and twenty percent,
accordingly. By splitting the data this way, the model
can be trained on an important part of the data, refined
and confirmed on another portion, and then tested for
generalization on unseen data.
2.2 Proposed Approach
This study explores the sentiment orientation in
Weibo comments by leveraging a model based on
LSTM networks. LSTM, recognized for its capacity
to effectively process sequential data and model long-
term dependencies, emerges as an optimal tool for
addressing Natural Language Processing (NLP)
tasks, especially emotion analysis. The methodology
unfolds in several pivotal phases: data preprocessing,
involving character-level segmentation and sequence
normalization; model construction, which
incorporates pre-trained word embeddings to bolster
the understanding of textual sentiment; and a series of
training, fine-tuning, and evaluation steps, employing
diverse performance metrics such as F1 scores and
accuracy to ascertain the efficacy of the model.
Aimed at conducting a thorough analysis of sentiment
orientations in Weibo comments, the research