world, domestic, showbiz, sports, weather, culture,
economy and science news topics, respectively.
It seems that many extracted bigrams reflect var-
ious aspects of the present society. Unfortunately,
many of them were not good news. Many topics from
the world news were related to terrorism, the Islamic
State, and Edward Snowden. The reason why there
was a large number of topics related to the weather
reflects recent extraordinary weather throughout the
world. Many topics from science news were related
to the STAP cells scandal. However, we extracted
some pleasurable topics, such as the opening of the
Hokuriku Shin-kansen, the new Japanese railroad, and
excellent athletic performances by Yuzuru Hanyu and
Kei Nishikori.
The results of these two experiments are not in-
consistent with our memories and sense impressions.
It can be said that our method of using genres I and N
can yield recently popular news topics.
Table 4: The topics extracted by LDA.
score words
genre I
0.021 America, president, North Korea, Russia,
Obama, Ukraine, ...
0.007 Isramic, Jordan, Syria, Iraq, Turkey, Japanese
0.030 Live, Debut, concert, AKB, Fun, dance, Idol
0.022 Mt. Ontake-san, East Japan, earthquake
0.009 Virus, influenza, allergy, vaccine, asthma,
a sideeffect,...
0.009 Massan, Ellie, Wisky, ...
0.022 Drama, Kanbei, scene, Taiga, Hero, ...
genre D
0.014 Massan, Ellie, Wisky, Scotland,...
0.160 date, party, girls,...
0.103 policemen, PC, police, net, stalker,...
0.023 Kanbei, Hanbei, onago, gozaru, child,...
0.018 Idol, gege, GMT, memory, mother, cafe,...
genre N
0.199 truck, bicyle, car, police, taxi,
intersection, driver,...
0.096 Obama, president, America, Washington,
Snowden,...
0.078 Tokyo Denryoku, atomic energy, tank,
radioactive rays,...
0.075 goal, soccer, team, league, World cup,...
0.055 Russia, Ukraine, America, President, Europe
0.037 Virus, influenza, Ebola, vaccine, WHO,...
0.018 Islamic, Jordan, Japanese, pilot, jornalist,
4.4 Supplementary Experiment Using
LDA
This research is potentially related to event or topic
detection. Latent Dirichlet Allocation (LDA) is a
well-known and popular topic model for event or
topic detection(Blei et al., 2003) . It is a powerful
model for analyzing massive sets of text data. Many
works on topic detection have used LDA (Lau et al.,
2012)(Fujimoto et al., 2011)(Keane et al., 2015).
Here, we also tried to apply LDA to our event de-
tection. We tried to detect topics in text sets of genres
I, N, and D by the LDA of the Mallet language toolkit
(McCallum, 2002). We set the number of topics to be
used at 100, and the other parameters used the Mal-
let defaults. Part of the extracted words are shown in
Table 4. In the Table, “score” means Dirichlet coeffi-
cient.
Because this method of extracting topics is differ-
ent from our method, it cannot compare the results di-
rectly. We investigated similar word groups between
genres I and D, and genres I and N. The results be-
tween genres I and D show that there are two simi-
lar word groups, related to the TV series Massan and
Kanbei. The results between genres I and N show that
there are three similar word groups, related to Isramic,
Ebola, and Russia and Ukraine.
Relatively few word groups related to two genres
were detected in this supplementary experiment using
LDA.
5 DISCUSSION
In the first two experiments, there was no overlap in
the 1,333 bigrams made up of the 422 bigrams in
genre D and the 911 bigrams in genre N. Therefore,
the total number of candidate bigrams was 1,333 out
of 3,507, and we extracted 510 bigrams whose Pear-
sons r in genre I equaled 0.7 or higher. Finally, 395
out of 510 bigrams that had relations to topics from
specific dramas or news articles were found.
All four dramas related to the extracted bigrams
in genre D were big hits from the recent 28 months.
Many extracted bigrams from genre N were related to
topics that captured public attention during these 28
months.
These results have encouraged us to use our
method of determining the sort of topic after detect-
ing a frequent word in genre I by investigating the ap-
pearance tendency of the same word in genres N and
D. The results also support our expectation that the
CCTV corpus could act as a rich categorized chroni-
cle data.