DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering

Tomonari Masada; Yuichiro Shibata; Kiyoshi Oguri

Research.Publish.Connect.

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering

Topics: Data Mining; Knowledge Management; Web 2.0 and Social Networking Controls

In Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 2: ICEIS, 5-13, 2011 , Beijing, China

Authors: Tomonari Masada ; Yuichiro Shibata and Kiyoshi Oguri

Affiliation: Nagasaki University, Japan

Keyword(s): Maximal substrings, Document clustering, Suffix array, Bayesian modeling.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Biomedical Engineering ; Data Engineering ; Data Mining ; Databases and Information Systems Integration ; Enterprise Information Systems ; Health Information Systems ; Information Systems Analysis and Specification ; Knowledge Management ; Ontologies and the Semantic Web ; Sensor Networks ; Signal Processing ; Society, e-Business and e-Government ; Soft Computing ; Software Agents and Internet Computing ; Web 2.0 and Social Networking Controls ; Web Information Systems and Technologies

Abstract: This paper provides experimental results showing how we can use maximal substrings as elementary features in document clustering. We extract maximal substrings, i.e., the substrings each giving a smaller number of occurrences even after adding only one character at its head or tail, from the given document set and represent each document as a bag of maximal substrings after reducing the variety of maximal substrings by a simple frequency-based selection. This extraction can be done in an unsupervised manner. Our experiment aims to compare bag of maximal substrings representation with bag of words representation in document clustering. For clustering documents, we utilize Dirichlet compound multinomials, a Bayesian version of multinomial mixtures, and measure the results by F-score. Our experiment showed that maximal substrings were as effective as words extracted by a dictionary-based morphological analysis for Korean documents. For Chinese documents, maximal substrings were not so e ffective as words extracted by a supervised segmentation based on conditional random fields. However, one fourth of the clustering results given by bag of maximal substrings representation achieved F-scores better than the mean F-score given by bag of words representation. It can be said that the use of maximal substrings achieved an acceptable performance in document clustering. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.139.236.89

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Masada, T.; Shibata, Y. and Oguri, K. (2011). DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering. In Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 2: ICEIS; ISBN 978-989-8425-53-9; ISSN 2184-4992, SciTePress, pages 5-13. DOI: 10.5220/0003403300050013

@conference{iceis11,
author={Tomonari Masada. and Yuichiro Shibata. and Kiyoshi Oguri.},
title={DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering},
booktitle={Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 2: ICEIS},
year={2011},
pages={5-13},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003403300050013},
isbn={978-989-8425-53-9},
issn={2184-4992},
}

TY - CONF

JO - Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 2: ICEIS
TI - DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering
SN - 978-989-8425-53-9
IS - 2184-4992
AU - Masada, T.
AU - Shibata, Y.
AU - Oguri, K.
PY - 2011
SP - 5
EP - 13
DO - 10.5220/0003403300050013
PB - SciTePress