Ministry of Education, Culture, Sports, Science and
Technology (MEXT).
REFERENCES
Abouelhoda, M., Ohlebusch, E., and Kurtz, S. (2002). Op-
timal exact string matching based on suffix arrays.
In SPIRE’02, the Ninth International Symposium on
String Processing and Information Retrieval, pages
31–43.
Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet
allocation. Journal of Machine Learning Research,
3:993–1022.
Chen, X., Hu, X., Shen, X., and Rosen, G. (2010). Prob-
abilistic topic modeling for genomic data interpreta-
tion. In BIBM’10, IEEE International Conference on
Bioinformatics & Biomedicine, pages 18–21.
Choi, K., Isahara, H., Kanzaki, K., Kim, H., Pak, S., and
Sun, M. (2009). Word segmentation standard in Chi-
nese, Japanese and Korean. In the 7th Workshop on
Asian Language Resources, pages 179–186.
Chumwatana, T., Wong, K., and Xie, H. (2010). A SOM-
based document clustering using frequent max sub-
strings for non-segmented texts. Journal of Intelligent
Learning Systems & Applications, 2:117–125.
Gang, S. (2009). Korean morphological analyzer KLT ver-
sion 2.10b. http://nlp.kookmin.ac.kr/HAM/kor/.
Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K.
(2001). Linear-time longest-common-prefix computa-
tion in suffix arrays and its applications. In CPM’01,
the 12th Annual Symposium on Combinatorial Pattern
Matching, pages 181–192.
Li, Y., Chung, S., and Holt, J. (2008). Text document
clustering based on frequent word meaning sequences.
Data & Knowledge Engineering, 64:381–404.
Madsen, R., Kauchak, D., and Elkan, C. (2005). Model-
ing word burstiness using the Dirichlet distribution. In
ICML’05, the 22nd International Conference on Ma-
chine Learning, pages 545–552.
Minka, T. (2000). Estimating a Dirichlet dis-
tribution. http://research.microsoft.com/en-
us/um/people/minka/papers/dirichlet/.
Mochihashi, D., Yamada, T., and Ueda, N. (2009). Bayesian
unsupervised word segmentation with nested Pitman-
Yor language modeling. In ACL/IJCNLP’09, Joint
Conference of the 47th Annual Meeting of the Asso-
ciation for Computational Linguistics and the Fourth
International Joint Conference on Natural Language
Processing of the Asian Federation of Natural Lan-
guage Processing, pages 100–108.
Navarro, G. and Makinen, V. (2007). Compressed full-text
indexes. ACM Computing Surveys (CSUR), 39(1).
Nigam, K., McCallum, A., Thrun, S., and Mitchell, T.
(2000). Text classification from labeled and un-
labeled documents using EM. Machine Learning,
39(2/3):103–134.
Nong, G., Zhang, S., and Chan, W. (2008). Two efficient
algorithms for linear time suffix array construction.
http://doi.ieeecomputersociety.org/10.1109/TC.2010.188.
Okanohara, D. and Tsujii, J. (2009). Text categorization
with all substring features. In SDM’09, 2009 SIAM
International Conference on Data Mining, pages 838–
846.
Poon, H., Cherry, C., and Toutanova, K. (2009). Unsu-
pervised morphological segmentation with log-linear
models. In NAACL/HLT’09, North American Chapter
of the Association for Computational Linguistics - Hu-
man Language Technologies 2009 Conference, pages
209–217.
Sutton, C. and McCallum, A. (2007). An introduction to
conditional random fields for relational learning. In
Introduction to Statistical Relational Learning, pages
93–128.
Teh, Y. (2006). A hierarchical Bayesian language model
based on Pitman-Yor processes. In COLING/ACL’06,
Joint Conference of the International Committee on
Computational Linguistics and the Association for
Computational Linguistics, pages 985–992.
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Man-
ning, C. (2005). A conditional random field word
segmenter for SIGHAN bakeoff 2005. In the Fourth
SIGHAN Workshop, pages 168–171.
Tsuruoka, Y., Tsujii, J., and Ananiadou, S. (2009). Stochas-
tic gradient descent training for L1-regularized
log-linear models with cumulative penalty. In
ACL/IJCNLP’09, Joint Conference of the 47th Annual
Meeting of the Association for Computational Lin-
guistics and the fourth International Joint Conference
on Natural Language Processing of the Asian Federa-
tion of Natural Language Processing, pages 477–485.
Wang, X. and McCallum, A. (2006). Topics over time: a
non-Markov continuous-time model of topical trends.
In KDD’06, the 12th ACM SIGKDD International
Conference on Knowledge Discovery and Data Min-
ing, pages 424–433.
Zhang, D. and Dong, Y. (2004). Semantic, hierarchical,
online clustering of Web search results. In APWeb’04,
the Sixth Asia Pacific Web Conference, pages 69–78.
Zhang, D. and Lee, W. (2006). Extracting key-substring-
group features for text classification. In KDD’06, the
Twelfth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 474–
483.
DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document
Clustering
13