of this kind of software is also to execute the data
capture program, but the front end provides the user
with a friendly and simple operation interface. A total
of 1980 reviews of a certain mobile phone in Huawei
official flagship Store, 1980 reviews of a certain
mobile phone in Xiaomi official flagship Store, and
1980 reviews of a certain mobile phone in Apple
Store from January 6th, 2021 to May 12th, 2021 are
collected by using the mature software. A total of
5940 comments were collected, then we used the
Excel and Python languages to clean data.
Latent Dirichlet Allocation (LDA) model is a
three-layer Bayesian classical topic model based on a
probability graph proposed by David Blei in 2003
(Blei, Ng, & Jordan, 2003), it is an unsupervised text
mining technique that extracts topics from initial
documents. The LDA model treats a document as a
collection of words with no order between them. It
assumes that a document has multiple topics and that
each topic corresponds to a different word. In the
construction process of a document, first, select a
topic with a certain probability, and then select a word
under this topic with a certain probability, to generate
the first word of the document, repeat this process,
and then generate the whole article. The use of LDA
is the reverse process of the above document
generation process, that is, according to an obtained
document, to find out the topic of the document and
the words corresponding to these topics. LDA model
is widely used in the field of text mining, such as text
topic recognition, text classification, and text
similarity calculation (Wang, Zou, & Liu, 2018).
2.2 Data Analysis
As we can see in Fig.1, data cleaning is divided into
five steps:
Step 1: deal with irrelevant data, mainly with all
kinds of emojis in sentences, by deleting or replacing
them with near-sense text;
Step 2: delete blank data, mainly those without
comments, the system automatically gives “this user
did not fill in the comments!” data like this;
Step 3: data deduplication, delete completely
duplicate data;
Step 4: sentence filtering, remove sentences with
a length less than 4. Those sentences with a length
less than 4 are not of practical significance, so this
part of the sentence is removed;
Step 5: delete the repeated words in the sentence,
such as “I like, like, like, like this phone”, eliminate
the repeated “like” in the sentence.
After the data cleaning, Apple, Xiaomi, and
Huawei had 1856, 1979, and 1965 reviews
respectively.
Huawei official flagship store
review of a certain phone
Xiaomi official flagship store
review of a certain phone
Apple Store official flagship store
review of a certain phone
Search in Taobao website
Descendant collector
Use EXCEL storage format
deal with irrelevant data
delete blank data
data deduplication
sentence filtering
delete the repeated words
in the sentence
Data cleaning
Data Collection and Data Cleaning
Figure 1: Flow chart of data collection and data cleaning.
Fig. 2 shows the whole experimental design flow
chart of this paper.
Firstly, reviews data sets are collected and
processed according to mobile phone reviews of
various brands.
Secondly, Chinese word segmentation is the most
critical step in Chinese text processing, and the
quality of word segmentation directly affects the
results of text mining. In this paper, Jieba is used for
Chinese word segmentation. At the same time, a