Authors:
Tomonari Masada
;
Yuichiro Shibata
and
Kiyoshi Oguri
Affiliation:
Nagasaki University, Japan
Keyword(s):
Maximal substrings, Document clustering, Suffix array, Bayesian modeling.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Biomedical Engineering
;
Data Engineering
;
Data Mining
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Health Information Systems
;
Information Systems Analysis and Specification
;
Knowledge Management
;
Ontologies and the Semantic Web
;
Sensor Networks
;
Signal Processing
;
Society, e-Business and e-Government
;
Soft Computing
;
Software Agents and Internet Computing
;
Web 2.0 and Social Networking Controls
;
Web Information Systems and Technologies
Abstract:
This paper provides experimental results showing how we can use maximal substrings as elementary features
in document clustering. We extract maximal substrings, i.e., the substrings each giving a smaller number
of occurrences even after adding only one character at its head or tail, from the given document set and
represent each document as a bag of maximal substrings after reducing the variety of maximal substrings by
a simple frequency-based selection. This extraction can be done in an unsupervised manner. Our experiment
aims to compare bag of maximal substrings representation with bag of words representation in document
clustering. For clustering documents, we utilize Dirichlet compound multinomials, a Bayesian version of
multinomial mixtures, and measure the results by F-score. Our experiment showed that maximal substrings
were as effective as words extracted by a dictionary-based morphological analysis for Korean documents. For
Chinese documents, maximal substrings were not so e
ffective as words extracted by a supervised segmentation
based on conditional random fields. However, one fourth of the clustering results given by bag of maximal
substrings representation achieved F-scores better than the mean F-score given by bag of words representation.
It can be said that the use of maximal substrings achieved an acceptable performance in document clustering.
(More)