eral public.
This news item is labeled incorrectly as Military
by the linear SVM classifier.
LLDA-C computes the document-topic distribu-
tion for this news item shown in Table 3, from which
we can see that Technology has the highest document-
topic distribution, and so LLDA-C labels this news
item correctly as Technology.
Table 3: Document-topic distributions for the example news
item, where DTD stands for “document-topic distribution”.
Technology has the highest DTD of 0.379, Politics has the
second highest DTD of 0.192, and Military has the third
highest DTD of 0.104.
Category DTD Category DTD
Politics 0.192 Health 0.039
Technology 0.379 History 0.052
Military 0.104 Real estate 0.039
Sports 0.052 Automobiles 0.065
Entertainment 0.039 Games 0.039
For SLLDA-C, it first computes the top words in
each category in the training dataset (translated from
Chinese to English). Table 4 lists the top 19 words
for each of the categories of Politics, Technology, and
Military in the training dataset. The top words for the
other categories are omitted for this example.
Table 4: The top 19 words in each of the categories of Poli-
tics, Technology, and Military for SLLDA-C classification.
Politics Technology Military
development intelligent UAV
construction internet Equipment
countryside network arms
issue market military
agriculture innovation troops
cadres business target
strengthen science reconnaissance
reform user aircraft
government robot political
economy technology fight
leadership apple missile
plan service task
politicy computer aircraft
project online army
implement advertisement attack
innovation password achieve
further data test
management Silicon Valley antitank
conference signal engine
For this example, SLLDA-C computes the num-
ber of top words in each category that this news item
contains, and the result is shown as follows, where
abc indicates that the word “abc” is in the category of
Politics, abc in the category of Technology, and
:::
abc
in the category of Military. The other categories of
words are omitted for this example.
The Zhuhai Radio and Television station plans
to launch a live service to its users. The tele-
vision station deploys unmanned
::::::
aircrafts to
perform realtime recording and send realtime
network data back to the station. Transmis-
sion of pictures and video via cell phone sig-
nals is made easier than before, significantly
increasing efficiency. Zhuhai online mobile
phone users could log on to the station’s web
site and watch the current traffic conditions.
The unmanned
::::::
aircraft takes video of traffic in
intersections and transmits the video through
the Internet to the station’s web site. The
user clicks the traffic video on their browser,
which allows them to easily view the sur-
rounding traffic situations and acquire parking
information. This brings a new experience to
the general public.
We can see that this news item contains the largest
number of top words in the category of Technology
(the number is 7). The number of top words in each
of the other categories is all smaller than 7 (in this ex-
ample we only list the top words in three categories).
Thus, SLLDA-C correctly labels this news item as
Technology.
5 CONCLUSIONS
We conclude that both LLDA-C and SLLDA-C out-
perform SVM on precisions, particularly when only
a small training dataset is available, where SSLDA-
C is much more efficient than SVM. We showed
that LLDA-C is moderately better than SLLDA-C on
precisions, recalls, and both Macro-F
1
and Micro-F
1
scores, while LLDA-C incurs higher time complexity
than SVM. In terms of recalls, LLDA-C is better than
SVM, which is better than SLLDA-C. In terms of av-
erage Macro-F
1
and Micro-F
1
scores, the LLDA clas-
sifiers are better than SVM. To further explore classi-
fication properties we introduced the concept of con-
tent complexity and showed that among the news arti-
cles correctly classified by LLDA-C, SLLDA-C, and
SVM, the number of SCC documents in each category
correctly classified by either LLDA-C or SLLDA-C is
larger than that by SVM. However, for the news ar-
ticles incorrectly classified by LLDA-C, SLLDA-C,
and SVM, this result does not hold.
For the applications with news classification (Bai
et al., 2015), if new categories are created for appli-
cations, it is much better to start with LLDA-C, for it
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
82