2 DEVELOP A METHOD TO
EVALUATE GOVERNMENT
PUBLIC SERVICE
PERFORMANCE USING
SOCIAL MEDIA DATA
2.1 Evaluation Data Collection
For the collection of evaluation data, first of all, the
appropriate social media should be selected based on
how widely it is used and relevance to government
public services. The second is to crawl the data. There
are two ways to crawl data: One is to write a web
crawler program with python. Most social media plat-
forms open their own APIs and can call python 's
built-in encapsulation module requests method to
crawl the required data with API. This paper takes
China 's Sina Weibo as an example. After determin-
ing the search term, you can use the advanced search
function of Weibo to obtain important information
such as URLs and cookies, and then use python3.8 to
send requests through the request library. The parsed
web page data is stored in the local computer in the
csv format. The microblog collection process is
shown in Fig. 1. Another way is to use the data col-
lection platform on the market, we usually only need
to enter the keywords to get the relevant data. Their
disadvantage is that they cannot be customized at
will.
Figure 1: Process of Collecting Sina Weibo Data (own-drawn).
2.2 Screening of Evaluation Indicators
Based on Word Frequency Analysis
(1) Feature word extraction based on TF-IDF. TF-
IDF is a weighted method, which mainly solves the
problem of low word frequency and high importance.
TF is the word frequency, that is, the number of times
a word appears in the text; IDF is the reverse text fre-
quency, a measure of the general importance of
words; TF-IDF is the TF value multiplied by the IDF
value. Their formulas are as follows:
𝑇𝐹
=
,
∑
,
(1)
𝐼𝐷𝐹
=𝑙𝑜𝑔
||
:
∈
(2)
TF-IDF=TF*IDF (3)
Select the top 500 words in TF-IDF, delete the
words that obviously do not meet the evaluation char-
acteristics, delete the numerals, verbs, emotional
color words, and merge the words of regional charac-
teristics into ' XX ' place name.
(2) Word vector acquisition based on Word2vec.
To build a word vector model requires corpus train-
ing, the crawled data as a corpus or the use of existing
corpus, through Python call word2vec word vector
model training. Use the API interface of Gensim
module to load Word2Vec and set the word vector di-
mension. The dimension represents the characteris-
tics of words. The more features, the greater the dis-
crimination of words. However, too high dimension
setting may lead to errors due to insufficient computer
CPU and too large dimension, which leads to the re-
lationship between words too dilute. Thus, large cor-
pus is generally set to 300-500 dimensions, small spe-
cific areas of the corpus is generally 200-300 dimen-
sions.
(3) Construct evaluation index based on K-
Means clustering. The K-Means algorithm uses Eu-
clidean distance as the similarity index. The smaller
the Euclidean distance, the higher the similarity of the
two words. The idea of word clustering using k-
Means algorithm is as follows: 1) k points are ran-
domly selected as the clustering center; 2) Calculate
the distance from each word to each cluster center; 3
) Each point is divided into the nearest cluster center
to form k clusters ; 4 ) recalculate the centroid of each
cluster ; 5 ) Repeat the above steps until the position
of the centroid does not change or the set number of
iterations is reached. The core index of the elbow
method is SSE (sum of the squared errors). The rela-
tionship between SSE and k is the shape of an elbow,
and the k value corresponding to this elbow is the true
clustering number of the data, the formula is as fol-
lows:
𝑆𝑆𝐸 =
∑∑
|𝑝 − 𝑚
|
∈
(4)
Among them, SSE is the clustering error, which
represents the quality of the clustering effect. 𝑐
rep-
resents the i-th cluster, p represents the sample point
of 𝑐
, and 𝑚
represents the centroid of 𝑐
.