Mining Japanese Collocation by Statistical Indicators

Takumi Sonoda; Takao Miura

doi:10.5220/0004397503810388

Mining Japanese Collocation by Statistical Indicators

Takumi Sonoda, Takao Miura

2013

Abstract

In this investigation, we discuss a computational approach to extract collocation based on both data mining and statistical techniques. We extend n-grams consisting of independent words and that we take frequencies on them after filtering on colligation. Then we apply statistical filters for the candidates, and compare these feature selection methods in statistical learning with each other. Five methods are evaluated, including term frequency (TF), Pairwise Mutual Information (PMI), Dice Coefficient(DC), T-Score (TS) and Pairwise Log-Likelihood ratio (PLL).We found PMI, MC and TS the most effective in our experiments. Using these we got 88 percent accuracy to extract collocation.

References

Backhaus, A. (2006) Co-location of education as a unit of vocabulary, Journal of International Student Center, Hokkaido University (in Japanese)
Han, J. and Kamber, M. (2006) Data Mining (2nd ed.) Morgan Kauffman, 2006
Harremoes, P. and Tusnady, G. (2012) Information Divergence is more chi-squared distributed than the chi squared statistic proc. ISIT 2012, pp. 538-543
Himeno, M. (2004) Kenkyu-Sha Nihongo Hyogen Katsuyou Jiten (Dictionary of Japanese Notation) Kenkyu-Sha (in Japanese), 2004
Ishikawa, S. (2000) Statistical Indexes for Identifying Collocations in Corpus Research Institure for Mathematical Sciences 190, pp. 1-28, 2006, Kyoto. Univ. (in Japanese)
Justeson, J., Katz, S. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text Natural Language Engineering, 1995
Kurohashi, S. and Nagao, M. (1994) A method of case structure analysis for Japanese sentences based on examples in case frame dictionary. In IEICE Transactions on Information and Systems, Vol. E77-D No.2, 1994 (in Japanese)
Manning, D. and Schutze,H. (1999) Foundations of Statistical Natural Language Processing MIT Press, 1999
Sonoda, T. and Miura, T. (2012) Data Mining for Japanese Collocation 7th International Conference on Digital Information Management (ICDIM), Macau, 2012
Stubbs, M. (2002) Words and Phrases - Corpus Studies of Lexical Semantics Blackwell Publishers, 2001
Tanomura, T. (2009) Retrieving collocational information from Japanese corpora : An attempt towards the creation of a dictionary of collocations Osaka University Bulletin, Osaka University Knowledge Archive (in Japanese), 2009
Yang, Y. and Pedersen, J.O. (1997) A Comparative Study on Feature Selection in Text Categorization Proc. International Conference on Machine Learning (ICML), 1997, pp.412-420

Download

Paper Citation

in Harvard Style

Sonoda T. and Miura T. (2013). Mining Japanese Collocation by Statistical Indicators . In Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8565-59-4, pages 381-388. DOI: 10.5220/0004397503810388

in Bibtex Style

@conference{iceis13,
author={Takumi Sonoda and Takao Miura},
title={Mining Japanese Collocation by Statistical Indicators},
booktitle={Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2013},
pages={381-388},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004397503810388},
isbn={978-989-8565-59-4},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Mining Japanese Collocation by Statistical Indicators
SN - 978-989-8565-59-4
AU - Sonoda T.
AU - Miura T.
PY - 2013
SP - 381
EP - 388
DO - 10.5220/0004397503810388