Examination of Document Clustering Based on Independent Topic Analysis and Word Embeddings

Riku Yasutomi, Seiji Yamada, Takashi Onoda

2025

Abstract

In recent years, research on text mining, which aims to extract useful information from textual data, has been actively conducted. This paper focuses on document classification methods that extract topics from textual data and assign documents to the extracted topics. Among these methods, the most representative is Latent Di-richlet Allocation (LDA). However, it has been pointed out that LDA often extracts similar topics due to the high amount of shared information between topics. Therefore, this paper proposes a document classification method based on Independent Topic Analysis (ITA), which extracts topics based on the independence of topics, and on Word Embedding, which learn word co-occurrence. This approach aims to avoid extracting similar topics and to achieve information grouping that is closer to human intuition. As a comparative metric, we used the agreement rate between the results of manually classifying documents into topics and those classified by each method. The results of the comparative experiment showed that the agreement rate for document classification based on ITA and Word Embedding was the highest. From these results, it was suggested that the proposed method could achieve document classification closer to human perception.

Download


Paper Citation


in Harvard Style

Yasutomi R., Yamada S. and Onoda T. (2025). Examination of Document Clustering Based on Independent Topic Analysis and Word Embeddings. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 185-192. DOI: 10.5220/0013104100003890


in Bibtex Style

@conference{icaart25,
author={Riku Yasutomi and Seiji Yamada and Takashi Onoda},
title={Examination of Document Clustering Based on Independent Topic Analysis and Word Embeddings},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={185-192},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013104100003890},
isbn={978-989-758-737-5},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Examination of Document Clustering Based on Independent Topic Analysis and Word Embeddings
SN - 978-989-758-737-5
AU - Yasutomi R.
AU - Yamada S.
AU - Onoda T.
PY - 2025
SP - 185
EP - 192
DO - 10.5220/0013104100003890
PB - SciTePress