6 Discussion and further work
This work presents two methods of automatic extraction of collocations in the case
of the Greek language: The “mean and variance” method and the X
2
testing. In the
case of the X
2
testing, we have demonstrated that it is possible to work effectively
with large corpora of the Greek Language.
We could use various other statistical methods for calculating significance, like
mutual information (MI), log likelihood (LL) ratio test, t-test etc., but we choose to
use the chi-square statistics. The reason is that the other tests assume a parametric
distribution of the data. This is unsuitable when calculating frequencies of bigrams.
Likelihood ratio seems to work better than X
2
, when we have sparse data.
MI compares the joint probability p(w
1
,w
2
) that two words occur together with the
independent probabilities p(w
1
), p(w
2
) that the two words occur in the data. The form
MI(w
1
, w
2
) = log
2
( p(w
1
,w
2
) / p(w
1
)* p(w
2
) ) does not give us interesting results for
very low frequencies.
The X
2
testing is the most commonly used test of statistical significance in
computational linguistics and can be used in many different contexts.
In the future, our study can incorporate lexical knowledge to assist in extracting
collocations and improve the results. Such knowledge can be based on the use of
lexical thesaurus, synonymy, hypernymy and part of speech tagging available for the
Greek language. Pearsen [13] has worked in a similar way using WordNet Lexicon
[12] for the English language. Using such statistical methods we might have an
effective representation of the knowledge. By combining statistical methods in a
conceptual graph knowledge representation framework, we could collect valuable
information and obtain rich knowledge bases. In general, combining statistical
methods and Computer assessment of knowledge structure seems to be an interesting
and promising next step.
7 References
1. Benson & Morton 1989. The structure of the collocational dictionary. In International
Journal of Lexicography 2:1-14.
2. Caroll J., Minnen G., Pearse D., Canning Y., Delvin S. and Tait J. (1999). Simplifying
text for language-impaired readers. In Preceedings of the 9th Conference of the European
Chapter of the ACL (EACL '99), Bergen, Norway, June.
3. Choueka, Y.; Klein, T.; and Neuwitz, E. (1983). "Automatic retrieval of frequent
idiomatic and collocational expressions in a large corpus." Journal for Literary and Linguistic
Computing, 4, 34-38. In Information Theory, 36(2), 372-380. Fano, R. (1961). Transmission of
Information: A Statistical Theory of Information. MIT Press. Flexner, S., ed. (1987). The
Random House.
4. Church, K., and Hanks, P. (1989). "Word association norms, mutual information, and
lexicography." In Proceedings, 27th Meeting of the ACL, 76--83. Also in Computational
Linguistics, 16(1). algorithm." IEEE Transactions on Information Theory, IT-26(1), 15-25.
HaUiday, M. A. K., and Hasan, R. (1976). Cohesion in English. Longman.
5. Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence.
Computational Linguistics, Volume 19, number 1, pp61-74.
157