Examination of Document Clustering Based on Independent Topic Analysis and Word Embeddings
Riku Yasutomi, Seiji Yamada, Takashi Onoda
2025
Abstract
In recent years, research on text mining, which aims to extract useful information from textual data, has been actively conducted. This paper focuses on document classification methods that extract topics from textual data and assign documents to the extracted topics. Among these methods, the most representative is Latent Di-richlet Allocation (LDA). However, it has been pointed out that LDA often extracts similar topics due to the high amount of shared information between topics. Therefore, this paper proposes a document classification method based on Independent Topic Analysis (ITA), which extracts topics based on the independence of topics, and on Word Embedding, which learn word co-occurrence. This approach aims to avoid extracting similar topics and to achieve information grouping that is closer to human intuition. As a comparative metric, we used the agreement rate between the results of manually classifying documents into topics and those classified by each method. The results of the comparative experiment showed that the agreement rate for document classification based on ITA and Word Embedding was the highest. From these results, it was suggested that the proposed method could achieve document classification closer to human perception.
DownloadPaper Citation
in Harvard Style
Yasutomi R., Yamada S. and Onoda T. (2025). Examination of Document Clustering Based on Independent Topic Analysis and Word Embeddings. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 185-192. DOI: 10.5220/0013104100003890
in Bibtex Style
@conference{icaart25,
author={Riku Yasutomi and Seiji Yamada and Takashi Onoda},
title={Examination of Document Clustering Based on Independent Topic Analysis and Word Embeddings},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={185-192},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013104100003890},
isbn={978-989-758-737-5},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Examination of Document Clustering Based on Independent Topic Analysis and Word Embeddings
SN - 978-989-758-737-5
AU - Yasutomi R.
AU - Yamada S.
AU - Onoda T.
PY - 2025
SP - 185
EP - 192
DO - 10.5220/0013104100003890
PB - SciTePress