Mining Japanese Collocation by Statistical Indicators

Takumi Sonoda, Takao Miura


In this investigation, we discuss a computational approach to extract collocation based on both data mining and statistical techniques. We extend n-grams consisting of independent words and that we take frequencies on them after filtering on colligation. Then we apply statistical filters for the candidates, and compare these feature selection methods in statistical learning with each other. Five methods are evaluated, including term frequency (TF), Pairwise Mutual Information (PMI), Dice Coefficient(DC), T-Score (TS) and Pairwise Log-Likelihood ratio (PLL).We found PMI, MC and TS the most effective in our experiments. Using these we got 88 percent accuracy to extract collocation.


  1. Backhaus, A. (2006) Co-location of education as a unit of vocabulary, Journal of International Student Center, Hokkaido University (in Japanese)
  2. Han, J. and Kamber, M. (2006) Data Mining (2nd ed.) Morgan Kauffman, 2006
  3. Harremoes, P. and Tusnady, G. (2012) Information Divergence is more chi-squared distributed than the chi squared statistic proc. ISIT 2012, pp. 538-543
  4. Himeno, M. (2004) Kenkyu-Sha Nihongo Hyogen Katsuyou Jiten (Dictionary of Japanese Notation) Kenkyu-Sha (in Japanese), 2004
  5. Ishikawa, S. (2000) Statistical Indexes for Identifying Collocations in Corpus Research Institure for Mathematical Sciences 190, pp. 1-28, 2006, Kyoto. Univ. (in Japanese)
  6. Justeson, J., Katz, S. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text Natural Language Engineering, 1995
  7. Kurohashi, S. and Nagao, M. (1994) A method of case structure analysis for Japanese sentences based on examples in case frame dictionary. In IEICE Transactions on Information and Systems, Vol. E77-D No.2, 1994 (in Japanese)
  8. Manning, D. and Schutze,H. (1999) Foundations of Statistical Natural Language Processing MIT Press, 1999
  9. Sonoda, T. and Miura, T. (2012) Data Mining for Japanese Collocation 7th International Conference on Digital Information Management (ICDIM), Macau, 2012
  10. Stubbs, M. (2002) Words and Phrases - Corpus Studies of Lexical Semantics Blackwell Publishers, 2001
  11. Tanomura, T. (2009) Retrieving collocational information from Japanese corpora : An attempt towards the creation of a dictionary of collocations Osaka University Bulletin, Osaka University Knowledge Archive (in Japanese), 2009
  12. Yang, Y. and Pedersen, J.O. (1997) A Comparative Study on Feature Selection in Text Categorization Proc. International Conference on Machine Learning (ICML), 1997, pp.412-420

Paper Citation

in Harvard Style

Sonoda T. and Miura T. (2013). Mining Japanese Collocation by Statistical Indicators . In Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8565-59-4, pages 381-388. DOI: 10.5220/0004397503810388

in Bibtex Style

author={Takumi Sonoda and Takao Miura},
title={Mining Japanese Collocation by Statistical Indicators},
booktitle={Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},

in EndNote Style

JO - Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Mining Japanese Collocation by Statistical Indicators
SN - 978-989-8565-59-4
AU - Sonoda T.
AU - Miura T.
PY - 2013
SP - 381
EP - 388
DO - 10.5220/0004397503810388