categorization of context such that to be able to use
it for disambiguation of the word senses. Therefore,
in the first step we determine, to which category the
ambiguous word belongs. Then we construct the
related context discriminator and, since each word
sense belongs to one category, the correct sense can
be predicted (Ide and Veronis, 1998; Makki and
Homayounpour, 2008). The thesaurus is used for
determining the conceptual categories.
In this step, we extract sentences for each sense
which contain the ambiguous word and then begin to
disambiguate with regard to words that are
collocated with one of the senses of the ambiguous
word, but it must be noted that, due to the ambiguity
of the homograph, we cannot extract sentences
merely through the ambiguous word. Thus, we
should use thesaurus to find the synonymous words
for each sense of the ambiguous word and extract
the sentences containing those words for each
conceptual category from the input corpus.
However, in reality, some words in thesaurus are not
only ambiguous in themselves, but also they cannot
often replace the ambiguous word in the sentence;
therefore, in this method, a very limited number of
words in the thesaurus were considered, and to
increase accuracy, some sentences were extracted
from the corpus in a supervised manner. More
precisely, a sentence in which the ambiguous word
or its synonym resides is extracted, because by
considering a window around the ambiguous word
we may go beyond the scope of one sentence and it
should be noted that two contiguous sentences do
not necessarily share a conceptual connection. In
fact, our aim is to extract the words which are
collocated with the ambiguous word in the sentence
as well as their probability of occurrence.
Another difference of this method with the
previous one is the use of texts whose words have
been stemmed. The reason is that in the collected
texts, the same word may occur in different
morphological forms and in an unstemmed mode a
separate probability is calculated for each form. If
the input sentence is the same, every time a word is
likely to appear in one of its forms, it has been
considered as a stem for it. After identifying the
categories, similar to (Ide and Veronis, 1998) the
discriminators are constructed in the following steps:
1. For the conceptual categories, determine the
representatives of the contexts.
2. In contexts, calculate the weights of the
words
3. For new contexts, use the calculated weights
in step 2, for predicting sense of ambiguous
word.
In the proposed algorithm, the above steps are
developed as follows:
2.1 Determine the Representatives of
the Contexts
In the supervised mode, the sentences containing
ambiguous words were extracted from the
Hamshahri Corpus and added in separate conceptual
categories according to the meaning of the
ambiguous word. In the unsupervised mode, the
sentences containing a synonym of the ambiguous
words in different conceptual categories were
extracted. Thus, for each conceptual category
several sentences were obtained in a classified form.
2.2 Calculate the Weights of the Words
For every word (w) in the collected sentences the
probability Pr(TCat|w) is calculated for each
category (TCat) of the ambiguous word by use of the
law of conditional probability. Similar method is
proposed in (Makki and Homayounpour, 2008;
Yarowsky, 1992), the salient words with larger
probabilities (Formula 1) are selected, and their
logarithms are considered as weights.
In (Ide and Veronis, 1998) this probability as
well as its logarithm is considered for all of the
words in the collected texts. The presented article is
doing the same, but because Pr(w) is equal for all
conceptual categories, only Pr(w|TCat) is considered
in the proposed method, which is caused a higher
speed in the system.
2.3 Predicting Sense of Ambiguous
Words
For any newly entered ambiguous word, the system
assigns a score value. This score value calculated
based on Formula 2 which adds the words category
weights.
Score(TCat)=
∑ log( (Pr(w|TCat)*Pr(TCat)) / Pr(w))
For each ambiguous word, the highest assigned
category score is selected as its proper sense. Also
those words of the test context that have not
appeared in any of the collected contexts, for a
certain category, scores are calculated by Formula 3.
Pr(w|TCat)= log(1 / N(TCat) )
Word Sense Disambiguation of Persian Homographs
329