match over 70% with those of topics 1, 2, 7, 9, 11, and
12 in ITA. This suggests that the granularity of topics
extracted by ITA is not significantly different from
that of LDA. Simultaneously, it can be inferred that
the issue of redundant topics with a high number of
common words in LDA is mitigated. This allows for
a more multifaceted understanding of the overall
topic structure of the documents.
In the comparative experiment with human classi-
fication using newspaper articles, it was found that
the results of ITA and the Vector Classification based
Figure 5: The number of matching top ten words between
topics extracted by LDA and ITA.
on word2vec exhibited the highest agreement rates
with human classification. This indicates the potential
for achieving document classification that aligns
more closely with human perception through ITA and
document classification based on word embeddings.
Moreover, the lack of significant differences in the
average number of articles assigned to each topic by
humans between LDA and Independent Topic Anal-
ysis implies that there is no substantial difference in
the granularity of the extracted topics. This finding
can also be understood from the comparison in Figure
5. Specifically, it appears that discrepancies in preci-
sion arise during the classification process of articles
into topics beyond the common content identified by
LDA and ITA. The number of articles classified by
humans for each topic accounted for approximately
30% of the total articles. This characteristic of news-
paper articles, which comprehensively documents
events occurring each day, likely contributes to the
difficulty in extracting large topics with a high num-
ber of assigned articles.
Additionally, the agreement rate of TF Classifica-
tion was lower compared to Vector Classification.
ITA aims to minimize the common information
among topics in its extraction method. Consequently,
rare words that appear in a limited number of docu-
ments may be assigned a higher level of importance.
However, when calculating the similarity between
topics and documents using the TF Classification
method, it may only be high for a few documents con-
taining these important yet rare words. This can lead
to classification results that diverge from human per-
ception. Simultaneously, the similarity to articles
containing important words within the topics consist-
ently produces high values. This characteristic could
explain why the results for agreement rates were
higher than those of LDA.
While LDA estimates the agreement rates with
each topic based on word distribution, the proposed
method calculates the agreement rates between topics
and the constituent words of articles. This is achieved
using word embeddings derived from co-occurrence
learning. Methods utilizing probabilistic generative
models are known to achieve high precision in group-
ing documents. However, due to the abundance of
common words among the extracted topics and high
mutual information, a single document may corre-
spond to multiple topics. Therefore, when evaluated
based on proximity to human perception, it is plausi-
ble that precision may be lower.
Regarding the number of articles classified by hu-
mans, ITA showed little variation across weeks, while
LDA exhibited variability in its values. A common
characteristic attributed to LDA is the difficulty in de-
termining optimal parameters. Since the optimal so-
lution varies depending on the dataset and purpose,
adjustments are required for each application. Conse-
quently, experience and trial-and-error may be neces-
sary. In this study, the parameters for LDA were de-
termined based on prior research; however, the output
of LDA is highly sensitive to its parameters. There-
fore, changing the parameters could potentially alter
the experimental results. The same applies to
Word2Vec. In contrast, the results of ITA do not fluc-
tuate significantly based on user input. It is believed
that the minimal variability in the number of articles
classified by humans is a result of this characteristic.
5 CONCLUSIONS
In this paper, we conducted a classification of news-
paper articles into various topics using ITA, a method
known for its high independence in topic extraction,
along with word embeddings. We compared the clas-
sification results of our proposed method with those
obtained through LDA, a widely used topic extraction