# Mining Japanese Collocation by Statistical Indicators

### Takumi Sonoda, Takao Miura

#### Abstract

In this investigation, we discuss a computational approach to extract collocation based on both data mining and statistical techniques. We extend n-grams consisting of independent words and that we take frequencies on them after filtering on colligation. Then we apply statistical filters for the candidates, and compare these feature selection methods in statistical learning with each other. Five methods are evaluated, including term frequency (TF), Pairwise Mutual Information (PMI), Dice Coefficient(DC), T-Score (TS) and Pairwise Log-Likelihood ratio (PLL).We found PMI, MC and TS the most effective in our experiments. Using these we got 88 percent accuracy to extract collocation.

#### References

- Backhaus, A. (2006) Co-location of education as a unit of vocabulary, Journal of International Student Center, Hokkaido University (in Japanese)
- Han, J. and Kamber, M. (2006) Data Mining (2nd ed.) Morgan Kauffman, 2006
- Harremoes, P. and Tusnady, G. (2012) Information Divergence is more chi-squared distributed than the chi squared statistic proc. ISIT 2012, pp. 538-543
- Himeno, M. (2004) Kenkyu-Sha Nihongo Hyogen Katsuyou Jiten (Dictionary of Japanese Notation) Kenkyu-Sha (in Japanese), 2004
- Ishikawa, S. (2000) Statistical Indexes for Identifying Collocations in Corpus Research Institure for Mathematical Sciences 190, pp. 1-28, 2006, Kyoto. Univ. (in Japanese)
- Justeson, J., Katz, S. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text Natural Language Engineering, 1995
- Kurohashi, S. and Nagao, M. (1994) A method of case structure analysis for Japanese sentences based on examples in case frame dictionary. In IEICE Transactions on Information and Systems, Vol. E77-D No.2, 1994 (in Japanese)
- Manning, D. and Schutze,H. (1999) Foundations of Statistical Natural Language Processing MIT Press, 1999
- Sonoda, T. and Miura, T. (2012) Data Mining for Japanese Collocation 7th International Conference on Digital Information Management (ICDIM), Macau, 2012
- Stubbs, M. (2002) Words and Phrases - Corpus Studies of Lexical Semantics Blackwell Publishers, 2001
- Tanomura, T. (2009) Retrieving collocational information from Japanese corpora : An attempt towards the creation of a dictionary of collocations Osaka University Bulletin, Osaka University Knowledge Archive (in Japanese), 2009
- Yang, Y. and Pedersen, J.O. (1997) A Comparative Study on Feature Selection in Text Categorization Proc. International Conference on Machine Learning (ICML), 1997, pp.412-420

#### Paper Citation

#### in Harvard Style

Sonoda T. and Miura T. (2013). **Mining Japanese Collocation by Statistical Indicators** . In *Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,* ISBN 978-989-8565-59-4, pages 381-388. DOI: 10.5220/0004397503810388

#### in Bibtex Style

@conference{iceis13,

author={Takumi Sonoda and Takao Miura},

title={Mining Japanese Collocation by Statistical Indicators},

booktitle={Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},

year={2013},

pages={381-388},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0004397503810388},

isbn={978-989-8565-59-4},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,

TI - Mining Japanese Collocation by Statistical Indicators

SN - 978-989-8565-59-4

AU - Sonoda T.

AU - Miura T.

PY - 2013

SP - 381

EP - 388

DO - 10.5220/0004397503810388