Table 3. Semantic similarities for some Russian words as computed. For each word, an English
translation is given in square brackets.
festivalq
[festi-
val]
koncert [concert] (0.62), vystavka [exhibition] (0.58), spektaklq
[performance, show] (0.57), teatr [theater] (0.52), balet [ballet] (0.51),
konkurs [competition] (0.49), forum [forum] (0.49), opera [opera] (0.49),
gastrolq [(performance) tours] (0,49), kamernyj [chamber (adj.)] (0.48)
'dolzh-
nostq'
[post]
post [post] (0.59), zvanie [rank] (0.49), naznachatq [to appoint] (0.46),
otstavka [resignation] (0.46), naznachenie [appointment] (0.45), oklad
[salary] (0.45), uvolqnjatq [to fire] (0.42)
dom
[house]
kvartira [flat] (0.59), ulica [street] (0.54), zdanie [building] (0.54), dvor
[garden] (0.53), gorod [town] (0.48), komnata [room] (0.45), domik
[small house] (0.43), dacha [summer cottage] (0.39), semqja [family]
(0.38), derevnja [village] (0.38)
doma-
shnij
[dome-
stic]
domashnie [domestic] (0.55), koshka [cat] (0.32), zhivotnoe [animal]
(0.31), kompqjuter [computer] (0.31), privychnyj [accustomed] (0.30),
ujut [coziness] (0.29), kuxnja [kitchen] (0.29), xozjajstvo [facilities]
(0.28), piwa [food] (0.27), kurica [hen] (0.27)
begatq
[to run]
xoditq [to walk] (0.53), prygatq [to jump] 0.53, gonjatq [to race] (0.52),
metatqsja [to race in panic] 0.50, nositqsja [to rush] (0.49), broditq [to
wander] (4.5), gonjatqsja [to chase] (0.40), pobezhalyj [the color of a
burnt steel] (0.40), polzatq [to creep] (0,40), chistitq [to clean] (0.40)
davno
[long
ago]
pora [period] (0.55), nedavno [recently] (0.46), davnym-davno [very long
ago] (0.44), uzkij [narrow] (0,43), sej [sow! (imperative)] (0,42), kogda-to
[once upon a time] (0.39), navsegda [forever] (0.38), nikogda [never]
(0.38), rano [early] 0.35, teperq [now, nowadays] 0.34
7 Summary and Prospects
We have presented a statistical method for the corpus-based automatic computation of
related words which has been evaluated on the TOEFL synonym test. Its performance
on this task favorably compares to other purely corpus-based approaches and suggests
that sophisticated and language dependent syntactic processing is not essential.
The automatically generated sample thesauri of related words for English, German
and Russian, each comprising in the order of 50,000 entries, are freely available from
the author. Although, unlike other thesauri, at the current stage they do not distinguish
between different kinds of relationships between words, there is one advantage over
manually created thesauri: Given a certain word, not only a few related words are
listed. Instead, all words of a large vocabulary are ranked according to their similarity
to the given word. Since even at the higher ranks the distinctions obtained seem
meaningful, this is an important feature that is indispensable for some kinds of ma-
chine processing, e.g. for word sense disambiguation and induction.
Future work that we envisage includes applying our method to corpora of other
languages, adding multi-word units to the vocabulary, and to find solutions to the
problem of word ambiguity that has not been dealt with here.
79