Experimental Veriﬁcation of Collocation Detection Methods

Galiya S. Ybytayeva

1 a

, Nina F. Khairova

2 b

, Orken Zh. Mamyrbayev

3 c

Kuralay Zh. Mukhsina

3 d

and Bagashar Zh. Zhumazhanov

3 e

Satbayev University, 22 Satpayev Street, Almaty, 050000, Kazakhstan

National Technical University “Kharkiv Polytechnic Institute”, 2 Kirpichov Str., Kharkiv, 61000, Ukraine

Institute of Information and Computational Technologies, 28 Shevchenko Str., Almaty, 050010, Kazakhstan

Keywords:

Collocation, Corpus, Corpus Linguistics, Corpora, Association Measures.

Abstract:

The article describes the results of a study to determine the correct phrases in the Kazakh language. The

experiment consisted in the search and analysis of bigrams with frequent verbs, adjectives and nouns of the

Kazakh language. Applying a statistical method to corpus material allows researchers to quantify the data

obtained. The article provides an overview of MI, t-score indicators for calculating the strength of links within

phrases, including their main characteristics. The purpose is to study the combinability characteristics of these

lexical units, to correlate the results obtained on the basis of various association measures on different corpus,

to compare the most popular association measures.

1 INTRODUCTION

In the process of penetration of modern information

and communication technologies into all areas of sci-

ence, in particular, into philological science, the pop-

ularity of using linguistic corpora of texts in the study

of various aspects of the language is growing. In

recent years, a whole range of methodological stud-

ies has appeared in the methodological literature on

teaching schoolchildren and students the lexical and

grammatical side of a foreign language using vari-

ous linguistic corpora (Sysoyev, 2010; Chernyakova,

2012; Ryazanova, 2012). An analysis of this and

other studies shows that the authors have reached a

certain agreement on the conceptual content of the

term “corpus linguistics”. It refers to an organized

collection of texts selected and tagged according to a

speciﬁc methodology and presented electronically.

The main attention in our research is paid to the

corpus of parallel texts. In our study, we understand

the corpus of parallel texts as a type of corpus linguis-

tics consisting of a source text in one language and its

translation into another language or languages.

https://orcid.org/0000-0002-4243-0928

https://orcid.org/0000-0002-9826-0286

https://orcid.org/0000-0001-7569-1721

https://orcid.org/0000-0002-8627-1949

https://orcid.org/0000-0002-5035-9076

This is a linguistic corpus of texts that allows you

to study lexical connectivity or the phenomenon of

word combinations in context.

Recently, in connection with the increasing need

for automated systems, much attention is paid to the

problem of automatic segmentation of word combina-

tions in texts. There are various statistical indicators

to evaluate the compatibility of words. Some dimen-

sions are called associative measures or association

measures. They allow you to calculate the strength of

the connection between the elements of word combi-

nations and are based on the frequency of these word

combinations and the individual words included in

them. Thus, it is possible to calculate some charac-

teristics of the stability of lexical units, which allows

them to be arranged on a conditional scale: from free

combinations to phraseological units. In total, there

are more than 80 measures to assess the strength of the

connectedness of word combinations (Pecina, 2009).

The article is organized as follows. Section 2 is

devoted to the literature review. Section 3 provides an

overview of the statistical method, Section 4 presents

the research methodology, and the ﬁnal section dis-

cusses the research ﬁndings and suggests future plans.

Ybytayeva, G., Khairova, N., Mamyrbayev, O., Mukhsina, K. and Zhumazhanov, B.

Experimental Veriﬁcation of Collocation Detection Methods.

DOI: 10.5220/0012008900003561

In Proceedings of the 5th Workshop for Young Scientists in Computer Science and Software Engineering (CSSE@SW 2022), pages 13-18

ISBN: 978-989-758-653-8; ISSN: 2975-9471

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

2 RELATED WORKS

Although the term “collocation” has recently come

into regular use, it occupies one of the most impor-

tant places in modern linguistics. In a broad sense, it

is a combination of two or more words that tend to

co-occur. Currently, collocations play a leading role

in lexicographic practice (Atkins and Rundell, 2008;

Kilgarriff, 2006). Recently, special collocation dic-

tionaries are being created abroad and in Kazakhstan

(Krishnamurthy, 2006; Smagulova, 2010; Zhanuzak

et al., 2011).

However, existing dictionaries of regular expres-

sions, ﬁrstly, do not contain their complete list, and

secondly, they often do it in an insufﬁciently consis-

tent manner. This is especially true for the Kazakh

language. Therefore, the relevance of works on auto-

matic detection of collocations from texts is undeni-

able.

Currently, we see several important application

tasks that require automated methods for extracting

collocations from large corpora of texts. In particu-

lar, these tasks include the creation of dictionaries and

other lexicographic tools, the creation of ontologies,

language learning, repair of linguistic processors, and

information retrieval.

Let us brieﬂy discuss the concept of the word

combination. There are different deﬁnitions of this

concept. In general, many deﬁnitions of collocation

are based on the phenomenon of semantic and gram-

matical interdependence of phrase elements (Iordan-

skaya and Mel’chuk, 2007).

The term ”collocation” in the Russian scientiﬁc

literature was ﬁrst used by Akhmanova (Akhmanova,

1996) in the Dictionary of Linguistic Terms. The ﬁrst

work in Russian linguistics devoted to the study of the

concept of collocation in the material of the Russian

language was the monograph of Borisova (Borisova,

1995).

Kozhakhmetova et al. (Kozhakhmetova et al.,

1988) were worked on the problem of translation of

correct word combinations from the Kazakh language

to a foreign language without loss of meaning and

national-cultural aspect. The scholars published a dic-

tionary of some 2,300 regular expressions. It is effec-

tive to use in verbal translation, but we believe that it

would be more effective if the regular word combina-

tions were divided into meaning categories.

3 STATISTICAL METHOD

Nowadays the term “collocation” is widely used in

corpus linguistics, in which the concept of colloca-

tion is reinterpreted or simpliﬁed compared to tradi-

tional linguistics. This approach can be called statisti-

cal. Priority is given to the frequency of coincidences,

so word combinations in corpus linguistics can be de-

ﬁned as statistically persistent phrases. In addition,

a statistically persistent combination can be phraseo-

logical and arbitrary. In recent years, a lot of research

and development on collocations has appeared, ad-

dressing both the theoretical aspects of a statistical ap-

proach to this notion and practical methods of phrase

detection.

This is the emergence of a large representative cor-

pus of texts, allowing to obtain reliable information

on the frequency of a particular combination in the

language as a whole. A high value of the frequency

of matches seems to indicate the stability of the com-

bination. However, this description is not sufﬁcient

to talk about the preferred combinability of certain

words. Therefore, a number of statistical measures

(called “association measures”) have been created to

calculate the strength of the relationship between ele-

ments in a word combination. In general, these mea-

sures take into account both the frequency of match-

ing and other parameters, primarily the frequency in

a given corpus of each individual element.

However, statistics are not enough. The question

needs to be answered as to what other requirements

such statistically stable combinations should meet.

Most corpus managers are able to calculate the

frequency of occurrence of words or word forms and

the frequency of matches. Based on this data, there

are many measures of association.

The total number of these dimensions is counted

in dozens. The values of associative measures can be

seen as indicators of the strength of the syntagmatic

relationship between phrasal elements. See (Evert,

2004) for a description of the most common mea-

sures. MI, t-score is used more frequently than others.

Some case managers allow the calculation of these

measures.

The MI (mutual information) measure introduced

in (Church and Hanks, 1990) compares context-

dependent frequencies, such as randomly occurring

words in a text, with independent frequencies:

MI(n, c) = log

f (n, c) · N

f (n) · f (c)

, (1)

here: n – keyword (node); c – collocation; f (n, c) –

frequency of occurrence of keyword n paired with

collocation c; f (n), f (c) – absolute (independent) fre-

quency of keyword n and word c in the corpus (text);

N – total number of word uses in the corpus (text).

If the value of MI(n, c) is greater than a certain

value, then the expression can be considered statisti-

CSSE@SW 2022 - 5th Workshop for Young Scientists in Computer Science Software Engineering

cally signiﬁcant. If MI(n, c) is less than zero, then n

and c are called complementary.

The t-score also takes into account the frequency

of occurrence of a keyword and its combination, an-

swering the question of how non-random the strength

of the association between the word combinations is:

t − score =

f (n, c) −

f (n)· f (c)

f (n, c)

(2)

4 IDENTIFICATION OF WORD

COMBINATIONS BASED ON

THE STATISTICAL METHOD

The aim of the work is a comparative analysis of dif-

ferent associative measures based on the corpus of the

Kazakh language. In addition, the dependence of the

results (the list of word combinations derived from

the same measure) on the text material (text type) is

investigated.

Our dataset includes a parallel Russian-Kazakh

corpus, which has been developed over three years

(Khairova et al., 2019), and an XML dictionary

of synonyms with criminally related vocabulary

(Khairova et al., 2021). The parallel Kazakh-Russian

corpus includes texts from four news sites of the

Kazakh information Internet space zakon.kz, cara-

van.kz, lenta.kz, nur.kz for the period from April 2018

to June 2021.

At the moment the volume of the parallel Kazakh-

Russian corpus is 3000 texts in Russian and 3000

in Kazakh, including two thousand texts containing

agreed Kazakh-Russian sentences.

We extracted the vocabulary for our XML dic-

tionary of synonyms manually from the English,

Ukrainian, Kazakh and Russian texts on criminal mat-

ters. Seven main thematic categories were identiﬁed

for the terms, Movement, Trafﬁc Accident, Injure,

Offense, Arrest, Trial, PD. The choice of categories

was due to the fact that the information resources

from which the texts were taken contained the largest

amount of data on the three criminal areas of “Po-

lice”, “Transfer”, “Crime” and their aforementioned

subspecies. This made it possible to make our dictio-

nary narrowly focused. All terms were also divided

by parts of speech, i.e. only nouns, verbs, adjectives

and word combinations were included in the dictio-

nary. Figure 1 shows a fragment of the dictionary,

which now includes about 650 basic words (over 320

nouns, over 100 adjectives, about 170 verbs and 40

word combinations) and over 2500 synonyms. It is

currently still under active development.

Our study was based on the corpus of news texts

“nur.kz”, “zakon.kz”, “patrul.kz”, “caravan.kz”, “in-

form.kz”, which includes 857 texts.

Table 1 contains data for 15 word combinations

with word “police” sorted by value of MI parame-

ter. The columns of the table, in addition to the word

combination itself, show the following characteris-

tics: Freq Word 1 & Word 2 – frequency of matching,

Freq Word 1 – frequency of word combination, Freq

Word 2 – key word, MI – MI value, T-score – t-score

value.

The analysis of the data in table 1 (15 word com-

binations in total) shows that the ranks of the word

combinations obtained using different indicators do

not coincide.

It should also be noted that different dimensions

affect the frequency of the words composing a word

combination and the frequency of their combinabil-

ity. Thus, MI is considered to be sensitive to low-

frequency words, while t-score is useful for ﬁnding

high-frequency word combinations.

We compared the automatically generated word

combinations on different association indices with

data from different dictionaries. The mate-

rial served as collections of 2 nouns without

homonyms (sozdik.kz) and 1 adjective (Kazakh-

Russian, Russian-Kazakh Terminological Dictionary.

Jurisprudence).

We call the above word combinations “correct”.

Below are graphs showing MI values, t-score mea-

surements on the ordinate axis and bigram ranks on

the abscissa axis. Black colour indicates “correct”

word combinations from the dictionary “sozdik.kz”

(4, 10 ranks) and “Kazakh-Russian, Russian-Kazakh

terminological dictionary. Jurisprudence” (rank 7) in-

dicates an additional phrase found in the dictionary.

The same tendency is observed for all obtained

word combinations: the smaller the value of the

measure, the greater the probability that these word

combinations will not be registered as regular word

combinations in dictionaries of the Kazakh language.

Thus, we can say that the compatibility data given in

dictionaries corresponds to the data obtained on the

basis of associative measures.

As a result of the experiment, it seems important

to identify phrases that are not registered in any of the

dictionaries. The analysis of such word combinations

shows that the bigrams located at the top of the list by

degree of probability (sorted in descending order of

one of the dimensions) turn out to be stable, so they

can be included in the list.

As mentioned above, other statistical criterion

methods based on linguistic models should also work.

This idea has been adopted and implemented in the fa-

Experimental Veriﬁcation of Collocation Detection Methods

Figure 1: The fragment of the multilingual synonyms dictionary.

Figure 2: Values of the MI measure for collocations with

the word “Police”.

Figure 3: Values of the t-score measure for collocations

with the word “Police”.

mous Sketch Engine (Kilgarriff et al., 2004). It yields

typical word combinations for a given keyword, on

the one hand, due to a syntax restricting the compat-

ibility of words in a given language, and on the other

hand, due to possible laws related to semantics and

linguistic origin.

It turns out that there are few “correct” colloca-

tions, but this is because the vocabulary we have been

relying on is too small, so it needs to be expanded.

We can say that a new vocabulary is needed, which

should contain various regular expressions.

The results of searching and identifying word

combinations of this type are useful for lexicogra-

phers who know how to select different examples for

dictionaries, and for linguists who study vocabulary

and syntax in a certain aspect.

5 CONCLUSION

When comparing the phrases obtained using statisti-

cal methods with dictionaries, the same tendency is

observed: the lower the value of the measure, the

more these phrases are not recorded in dictionaries

of the Kazakh language, and vice versa. Most of the

phrases recorded in dictionaries are at the top of the

list based on one of the measures of association. Thus,

it can be said that the data on stable compatibility

given in dictionaries coincide with the data obtained

on the basis of measures of association, or, in other

words, statistical measures of association better de-

termine the real semantic-syntagmatic relations.

A comparative analysis of different association

CSSE@SW 2022 - 5th Workshop for Young Scientists in Computer Science Software Engineering

Table 1: Values of associative measures for the word “Polisia”.

№ Collocation Word 1 Word 2

Freq

Word 1

& Word 2

Freq

Word 1

Freq

Word 2

Word in

Corpus

T-

score

1 Patr

uldik polısıa patr

uldik polısıa 4 8 906 178645 6,62 1,98

2 Qarjy polısıasy qarjy polısıasy 3 8 906 178645 6,21 1,71

3 Polısıa jasa

gy polısıa jasa

gy 2 906 6 178645 6,04 1,39

4 Polısıa b

olimi polısıa b

olimi 12 906 40 178645 5,89 3,41

5 Polısıa

ga j

ugin

y polısıa

ga j

ugin

y 5 906 17 178645 5,86 2,20

6 Polısıa shaqyr

y polısıa shaqyr

y 7 906 27 178645 5,68 2,59

Askerı polısıa

askerı polısıa 1 4 906 178645 5,62 0,98

8 Tur

gylyqty polısıa tur

gylyqty polısıa 3 13 906 178645 5,51 1,69

9 Polısıa qyzmetkeri polısıa qyzmetkeri 55 906 272 178645 5,32 7,23

10 Polısıa basqarmasy polısıa basqarmasy 57 906 294 178645 5,26 7,35

11 Polısıa qyzmeti polısıa qyzmeti 15 906 93 178645 4,99 3,75

12 Polısıa departamenti polısıa departamenti 40 906 323 178645 4,61 6,07

13 Polısıa

ga habarlas

y polısıa

ga habarlas

y 8 906 66 178645 4,58 2,71

14 Polısıa basshysy polısıa basshysy 9 906 94 178645 4,24 2,84

15 Polısıa k

oligi polısıa k

oligi 7 906 111 178645 3,64 2,43

measures carried out on a set of all data obtained for

different word classes shows the following.

The MI measure can give the best average result.

It makes it possible to distinguish between correct

phraseological collocations as well as collocations in

which proper names act as collocations, as well as

low-frequency special terms. The disadvantages of

using the t-score are primarily related to the fact that

it determines the frequency with collocations, in par-

ticular with auxiliary words. Therefore, in order to

“remove” the most frequent words for t-score, it is

necessary to set up a list of stop words whose com-

binations are always at the top of the table: auxiliary

words, pronouns or conjunctions. However, this also

applies to other dimensions.

Whether statistical measures should be taken into

account when searching for a lemma or a phrase

remains an open question. The structural syntac-

tic formulas and semantic constraints underlying the

phrases also need to be taken into account.

In the future it is planned to test the effectiveness

of the method on a large corpus.

ACKNOWLEDGEMENTS

This work was carried out with the ﬁnancial support

of the Committee of Science of the Ministry of Ed-

ucation and Science of the Republic of Kazakhstan

(No. AR09259309).

REFERENCES

Akhmanova, O. S. (1996). Slovar’ lingvisticheskikh termi-

nov. Editorial URSS, Moscow.

Atkins, B. T. S. and Rundell, M. (2008). The Oxford Guide

to Practical Lexicography. Oxford University Press.

Borisova, Y. G. (1995). Kollokatsii. Chto eto takoye i kak

ikh izuchat’. Filologija, Moscow, 2 edition.

Chernyakova, T. A. (2012). Metodika formirovaniya lek-

sicheskikh navykov studentov na osnove lingvistich-

eskogo korpusa. The thesis for the degree of candidate

of pedagogical sciences.

Church, K. W. and Hanks, P. (1990). Word Associ-

ation Norms, Mutual Information, and Lexicogra-

phy. Computational Linguistics, 16(1):22–29. https:

//aclanthology.org/J90-1003.

Evert, S. (2004). The Statistics of Word Co-occurences:

Word Pairs and Collocations. PhD thesis, Institut

ur Maschinelle Sprachverarbeitung (IMS), Univer-

sit

at Stuttgart, Stuttgart.

Iordanskaya, L. N. and Mel’chuk, I. A. (2007). Smysl i so-

chetayemost’ v slovare. Yazyki slavyanskikh kul’tur,

Moscow.

Khairova, N., Kolesnyk, A., Mamyrbayev, O., and

Mukhsina, K. (2019). The Aligned Kazakh-Russian

Parallel Corpus Focused on the Criminal Theme.

In Lytvyn, V., Sharonova, N., Hamon, T., Chered-

nichenko, O., Grabar, N., Kowalska-Styczen, A., and

Vysotska, V., editors, Proceedings of the 3rd In-

ternational Conference on Computational Linguis-

tics and Intelligent Systems (COLINS-2019). Volume

I: Main Conference, Kharkiv, Ukraine, April 18-19,

2019, volume 2362 of CEUR Workshop Proceedings,

pages 116–125. CEUR-WS.org. http://ceur-ws.org/

Vol-2362/paper11.pdf.

Experimental Veriﬁcation of Collocation Detection Methods

Khairova, N., Kolesnyk, A., Mamyrbayev, O., Ybytayeva,

G., and Lytvynenko, Y. (2021). Automatic multilin-

gual ontology generation based on texts focused on

criminal topic. In Sharonova, N., Lytvyn, V., Chered-

nichenko, O., Kupriianov, Y., Kanishcheva, O., Ha-

mon, T., Grabar, N., Vysotska, V., Kowalska-Styczen,

A., and Jonek-Kowalska, I., editors, Proceedings of

the 5th International Conference on Computational

Linguistics and Intelligent Systems (COLINS 2021).

Volume I: Main Conference, Lviv, Ukraine, April 22-

23, 2021, volume 2870 of CEUR Workshop Proceed-

ings, pages 108–117. CEUR-WS.org. http://ceur-ws.

org/Vol-2870/paper11.pdf.

Kilgarriff, A. (2006). Collocationality (and how to mea-

sure it). In Corino, E., Marello, C., and Onesti,

C., editors, Proceedings of the 12th EURALEX In-

ternational Congress, pages 997–1004, Torino, Italy.

Edizioni dell’Orso. https://euralex.org/publications/

collocationality-and-how-to-measure-it/.

Kilgarriff, A., Rychl

y, P., Smrz, P., and Tugwell, D. (2004).

The Sketch Engine. In Williams, G. and Vessier,

S., editors, Proceedings of the 11th EURALEX Inter-

national Congress, pages 105–115, Lorient, France.

Universit

e de Bretagne-Sud, Facult

e des lettres et des

sciences humaines. https://euralex.org/publications/

the-sketch-engine/.

Kozhakhmetova, K. K., Zhaysakova, R. E., and

Kozhakhmetova, S. O. (1988). Kazakh-Russian

phraseological dictionary. Mektep, Almaty.

Krishnamurthy, R. (2006). Collocations, pages 596–600.

Elsevier, Netherlands, 2nd edition. https://doi.org/10.

1016/B0-08-044854-2/00414-4.

Pecina, P. (2009). Lexical Association Measures: Colloca-

tion Extraction. Studies in Computational and The-

oretical Linguistics. Institute of Formal and Applied

Linguistics, Prague. https://ufal.mff.cuni.cz/books/

preview/pecina

preview.pdf.

Ryazanova, Y. A. (2012). Metodika formirovaniya gram-

maticheskikh navykov rechi studentov na osnove

lingvisticheskogo korpusa. The thesis for the degree

of candidate of pedagogical sciences.

Smagulova, G. S. (2010). Magynalas frazeologizmder

sozdigi. Yeltanym baspasy, Almaty.

Sysoyev, P. V. (2010). Lingvisticheskiy korpus v metodike

obucheniya inostrannym yazykam. Yazyk i kul’tura,

1(9):99–111.

Zhanuzak, T., Omarbekov, S., and Zhunisbek, A. (2011).

Kazak adebietinin sozdigi. On bes tomdyk. Almaty.

CSSE@SW 2022 - 5th Workshop for Young Scientists in Computer Science Software Engineering