The second step is demand mining and demand
analysis. Use the STM model to perform text mining
on the pre-processed review data, use the ratings
selected by the reviewer to divide the reviews into
positive reviews and negative reviews, and introduce
the positive and negative of the reviews as a topic
popular covariate into the STM and perform topic
extraction, According to the words in the topic to
manually classify and determine the topic label. Use
word cloud graphs to visualize words that appear in
key topics, and extract and analyze relevant comments
on certain key topics. Using the TF-IDF algorithm to
extract keywords for positive reviews and negative
reviews, compare and analyze the extracted keywords
with the subject terms extracted using the STM
method, explaining the difference between STM and
the conventional word frequency method for demand
mining. It reflects the necessity of using STM for
demand mining.
The third step is demand classification analysis.
Therefore, based on the use of STM for subject
extraction, experience value is added to classify user
demands, to obtain user demand classification for
short life experience products, and analyze each
experience value classification.
4 RESULTS & DISCUSSION
This article takes the typical product movie of short
life cycle experience product as an example, and
conducts demand mining and demand analysis from
movie user reviews.
4.1 Data Collection and Preprocessing
IMDB (Internet Movie Database, IMDB.com), as the
most detailed movie database in the world, provides a
platform for global movie critics to express their
personal opinions on movies that have been released.
Users demand to rate movies while commenting. The
number represents the overall degree of user
satisfaction with the movie, and reflects the user's
subjective judgment on the movie. This article uses
the score corresponding to the review as the standard
for dividing positive reviews and negative reviews.
Using the Python language web crawler to crawl
the comments from the IMDB website, the movie
ratings corresponding to the comments, the reviewer
ID, and the review time. This article selects the user
reviews of Captain Marvel released in 2019 as the data
source because the number of the reviews is
reasonable, and the difference in the number of
positive and negative comments is small. After
deleting invalid comments, 6144 comments were
finally collected, and the number of comments
corresponding to each comment star rating is shown in
Figure 3. Since the number of positive reviews with
more than 5 stars is much higher than the number of
negative reviews with less than 5 stars, in order to
balance the gap in the number of positive and negative
reviews, 1-6 star reviews are regarded as negative
reviews and 7-10 star reviews are regarded as positive
reviews. The positive and negative of is introduced as
a covariate into the STM model.
Figure 3: Star distribution of movie reviews.
4.2 Topic Extraction
Input the crawled movie reviews and their
corresponding pre-processed comment positive and
negative as topic popular covariates into STM, change
the number of topics several times for model training,
determine the optimal number of topics to be 20, and
use the STM model for topic extraction. Table 1 shows
the results of the extracted topics.
“Prob” represents the word with the highest
occurrence probability in the topic, but the word with
the highest occurrence probability in a certain topic
may also appear with a high probability in another
topic, so the degree of discrimination is insufficient.
STM introduces the FREX (Frequency-exclusivity)
statistic, which is defined as the ratio of topic-based
word frequency to word-topic exclusivity, which can
avoid quoting only the most commonly used words.
STM also introduces the metric “Lift”. This metric
refers to the probability of words appearing on the
topic divided by the probability of words appearing in
the entire corpus. This metric will highlight the more
common words in the topic than in the corpus, which
the frequency of occurrence in this topic is much
higher than in the entire corpus.
According to the words extracted from the above
metrics, a total of 20 topic tags are designed. Among
them, 20 topics related to movie features or the
process of watching movies are selected: topic1