DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering

Tomonari Masada, Yuichiro Shibata, Kiyoshi Oguri

2011

Abstract

This paper provides experimental results showing how we can use maximal substrings as elementary features in document clustering. We extract maximal substrings, i.e., the substrings each giving a smaller number of occurrences even after adding only one character at its head or tail, from the given document set and represent each document as a bag of maximal substrings after reducing the variety of maximal substrings by a simple frequency-based selection. This extraction can be done in an unsupervised manner. Our experiment aims to compare bag of maximal substrings representation with bag of words representation in document clustering. For clustering documents, we utilize Dirichlet compound multinomials, a Bayesian version of multinomial mixtures, and measure the results by F-score. Our experiment showed that maximal substrings were as effective as words extracted by a dictionary-based morphological analysis for Korean documents. For Chinese documents, maximal substrings were not so effective as words extracted by a supervised segmentation based on conditional random fields. However, one fourth of the clustering results given by bag of maximal substrings representation achieved F-scores better than the mean F-score given by bag of words representation. It can be said that the use of maximal substrings achieved an acceptable performance in document clustering.

References

  1. Abouelhoda, M., Ohlebusch, E., and Kurtz, S. (2002). Optimal exact string matching based on suffix arrays. In SPIRE'02, the Ninth International Symposium on String Processing and Information Retrieval, pages 31-43.
  2. Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022.
  3. Chen, X., Hu, X., Shen, X., and Rosen, G. (2010). Probabilistic topic modeling for genomic data interpretation. In BIBM'10, IEEE International Conference on Bioinformatics & Biomedicine, pages 18-21.
  4. Choi, K., Isahara, H., Kanzaki, K., Kim, H., Pak, S., and Sun, M. (2009). Word segmentation standard in Chinese, Japanese and Korean. In the 7th Workshop on Asian Language Resources, pages 179-186.
  5. Chumwatana, T., Wong, K., and Xie, H. (2010). A SOMbased document clustering using frequent max substrings for non-segmented texts. Journal of Intelligent Learning Systems & Applications, 2:117-125.
  6. Gang, S. (2009). Korean morphological analyzer KLT version 2.10b. http://nlp.kookmin.ac.kr/HAM/kor/.
  7. Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K. (2001). Linear-time longest-common-prefix computation in suffix arrays and its applications. In CPM'01, the 12th Annual Symposium on Combinatorial Pattern Matching, pages 181-192.
  8. Li, Y., Chung, S., and Holt, J. (2008). Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering, 64:381-404.
  9. Madsen, R., Kauchak, D., and Elkan, C. (2005). Modeling word burstiness using the Dirichlet distribution. In ICML'05, the 22nd International Conference on Machine Learning, pages 545-552.
  10. Minka, T. (2000). Estimating a Dirichlet distribution. http://research.microsoft.com/enus/um/people/minka/papers/dirichlet/.
  11. Mochihashi, D., Yamada, T., and Ueda, N. (2009). Bayesian unsupervised word segmentation with nested PitmanYor language modeling. In ACL/IJCNLP'09, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the Fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 100-108.
  12. Navarro, G. and Makinen, V. (2007). Compressed full-text indexes. ACM Computing Surveys (CSUR), 39(1).
  13. Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103-134.
  14. Nong, G., Zhang, S., and Chan, W. (2008). Two efficient algorithms for linear time suffix array construction. http://doi.ieeecomputersociety.org/10.1109/TC.2010.188.
  15. Okanohara, D. and Tsujii, J. (2009). Text categorization with all substring features. In SDM'09, 2009 SIAM International Conference on Data Mining, pages 838- 846.
  16. Poon, H., Cherry, C., and Toutanova, K. (2009). Unsupervised morphological segmentation with log-linear models. In NAACL/HLT'09, North American Chapter of the Association for Computational Linguistics - Human Language Technologies 2009 Conference, pages 209-217.
  17. Sutton, C. and McCallum, A. (2007). An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning, pages 93-128.
  18. Teh, Y. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In COLING/ACL'06, Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics, pages 985-992.
  19. Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. (2005). A conditional random field word segmenter for SIGHAN bakeoff 2005. In the Fourth SIGHAN Workshop, pages 168-171.
  20. Tsuruoka, Y., Tsujii, J., and Ananiadou, S. (2009). Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty. In ACL/IJCNLP'09, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 477-485.
  21. Wang, X. and McCallum, A. (2006). Topics over time: a non-Markov continuous-time model of topical trends. In KDD'06, the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 424-433.
  22. Zhang, D. and Dong, Y. (2004). Semantic, hierarchical, online clustering of Web search results. In APWeb'04, the Sixth Asia Pacific Web Conference, pages 69-78.
  23. Zhang, D. and Lee, W. (2006). Extracting key-substringgroup features for text classification. In KDD'06, the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 474- 483.
Download


Paper Citation


in Harvard Style

Masada T., Shibata Y. and Oguri K. (2011). DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering . In Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8425-53-9, pages 5-13. DOI: 10.5220/0003403300050013


in Bibtex Style

@conference{iceis11,
author={Tomonari Masada and Yuichiro Shibata and Kiyoshi Oguri},
title={DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering},
booktitle={Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2011},
pages={5-13},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003403300050013},
isbn={978-989-8425-53-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering
SN - 978-989-8425-53-9
AU - Masada T.
AU - Shibata Y.
AU - Oguri K.
PY - 2011
SP - 5
EP - 13
DO - 10.5220/0003403300050013