Automatic Alignment of Persian and English Lexical

Resources: A Structural-Linguistic Approach

Rahim Dehkharghani and Mehrnoush Shamsfard

Natural Language Processing Laboratory, Shahid Beheshti University, Tehran, Iran

stract. Cross-lingual mapping of linguistic resources such as corpora,

ontologies, lexicons and thesauri is very important for developing cross-lingual

(CL) applications such as machine translation, CL information retrieval and

question answering. Developing mapping techniques for lexical ontologies of

different languages is not only important for inter-lingual tasks but also can be

implied to build lexical ontologies for a new language based on existing ones.

In this paper we propose a two-phase approach for mapping a Persian lexical

resource to Princeton's WordNet. In the first phase, Persian words are mapped

to WordNet synsets using some heuristic improved linguistic approaches. In the

second phase, the previous mappings are evaluated (accepted or rejected)

according to the structural similarities of WordNet and Persian thesaurus.

Although we applied it to Persian, our proposed approach, SBU methodology is

language independent. As there is no lexical ontology for Persian, our approach

helps in building one for this language too.

1 Introduction

WordNet is a rich computational linguistic resource for Natural Language Processing

(NLP) used in machine translation, internet searches, document classification,

information retrieval, question answering and many web applications. It is useful to

use existing lexical resources for constructing WordNet for a certain language.

English WordNet (Princeton's WordNet) has been employed for automatic creation of

WordNets for many other languages. On the other hand, cross-lingual mapping

between various WordNets of various languages enables many cross-lingual

applications like the ones mentioned above.

In this paper, we propose a new approach for mapping Persian words’ senses to

WordNet synsets. This approach includes two phases. In the first phase, words of

source language (Persian) are mapped to WordNet synsets. The associations of first

phase are not reliable; therefore, we should get help from other sources. In the second

phase, extracted mappings are accepted or rejected according to the hierarchy of

English WordNet and Persian thesaurus.

This paper is organized as follows: In Section 2, previous related works are

described. Section 3 introduces our suggested approach and Section 4 presents some

experimental results. Finally in Section 5 conclusions and future works are discussed.

Dehkharghani R. and Shamsfard M. (2009).

Automatic Alignment of Persian and English Lexical Resources: A Structural-Linguistic Approach .

In Proceedings of the 6th International Workshop on Natural Language Processing and Cognitive Science , pages 55-66

DOI: 10.5220/0002175000550066

 SciTePress

2 Related Works

Daude and colleagues [1] presented a new and robust approach for mapping multi-

lingual hierarchies. They applied a constraint satisfaction algorithm (relaxation

labeling) to select the best match for a node of hierarchy among all the candidate

nodes in the other side. They took advantage of hypernymy and hyponymy relations

in hierarchies. The following year, the same group [2] applied their work on mapping

of nominal part of WordNet 1.5 to WordNet 1.6 with a high precision.

Lee and colleagues [3] presented automatic construction of Korean WordNet from

existing lexical resources in 2000. Six automatic WSD (Word Sense Disambiguation)

techniques were used for mapping Korean words collected from bilingual MRD

(Machine Readable Dictionary) to English WordNet synsets. These techniques use the

synonymy relations, IS-A relations and glosses of synsets in WordNet. They used

Machine Learning methods to combine these six techniques.

Mihaltz and Proszeky [4] presented the results of creating the nominal database of

Hungarian WordNet. They presented 9 different automatic methods, developed for

mapping Hungarian nouns to WordNet 1.6 synsets. Those methods are extracted from

bilingual and monolingual dictionaries and Princeton's WordNet. Synonymy and

hypernymy relations and cooccurrence words were used for this mapping.

Rmanand and colleagues [5] presented observations on structural properties of

WordNets of three languages: English, Hindi, and Marathi. They reported their work

on mapping English, Hindi and Marathi synsets. The translations of Hindi words and

equivalent synsets of each translation in English WordNet were obtained. Then the

similarity of two synsets in Hindi and English hierarchy are computed by a formula

considering the level of Hindi and English synset. They took advantage of synonymy

and hypernymy relations.

Rodriguez and colleagues [6] focused on the semi-automatic extension of Arabic

WordNet (AWN) using lexical and morphological rules and applying Bayesian

inference. The AWN project was finished in 2008 but this group constructed a little

part of it semi-automatically. In this research, a novel approach to extending AWN is

presented whereby a Bayesian Network is automatically built from the graph and then

the net is used as an inference mechanism for scoring the set of candidate associations

between Arabic verbs and WordNet synsets .

Farreres [7-10] proposed a two-phase methodology for bilingual mapping of

ontologies (Spanish thesaurus to English WordNet as a case study). This work is

among the most comprehensive in the area of bilingual ontology mapping. The result

of this work was the automatic construction of nominal part of Spanish WordNet.

Furthermore, it was used in semi-automatic construction of Arabic WordNet [6].

Therefore we chose it as a base and made some improvements on it. These two

methodologies (Farrares’ and SBU) are compared with each other in sub-section 3.3.

Farreres' methodology is structured as a sequence of two processes. The aim of the

first phase that is based on a work in 1997 [11], is mapping Spanish words (SWs) to

WordNet synsets (WNSs) using 17 similarity methods. To compose these similarity

methods, Farreres used the Logistic Regression model. He obtained the formula

which whose input was an association between SW and a WNS and the output is

correctness probability of that association. The second phase takes advantage of

Spanish thesaurus and WordNet hierarchies to accept or reject associations produced

in the first phase. The mapping in the second phase is sense to synset instead of word

to synset.

Current work is the next generation of the work introduced at [12]. In our previous

work we mapped Persian words to English synsets in a different way using a heuristic

improved translation and dictionary based method.

3 Our Suggested Approach

Our goal is finding the most appropriate English synset(s) to be mapped to Persian

thesaurus nodes. The suggested approach is language independent. It can be applied

to any pair of languages and we used Persian-English pair as a case study.

This approach takes advantage of some existing resources in the source (Persian)

and target language (English). Essential resources are bilingual Persian-English and

English-Persian dictionaries, monolingual Persian-Persian dictionary, Persian

thesaurus and English WordNet. We used Aryanpour dictionary [13] (including

252864 entries) as Persian-English and English-Persian dictionary, Sokhan dictionary

[14] (incl. about 116 thousand entries) as Persian-Persian dictionary and WordNet

2.1.

The schema of two phases of the approach is shown in Fig. 1.

Fig. 1. Processes of first and second phases.

In this figure and the rest of the paper, PW stands for Persian Word, EW for

English Word, PS for Persian Sense, WNS for WordNet Synset, WtS for Word-to-

Synset association and StS for Sense-to-Synset association. As shown in Fig. 1, the

first phase includes finding English translations of each PW using a bilingual

dictionary and also equivalent synsets of each translation. In the second phase,

Persian senses, instead of Persian words, should be mapped to WordNet synsets.

3.1 First Phase: Mapping Persian Words (PW) to WordNet Synsets (WNS)

Translations of each PW, should be found in a bilingual dictionary. Also for each

English translation (EW) of PW, its synsets in WordNet should be obtained. There are

many candidate synsets (WNSs) for each PW in WordNet the majority of which is not

appropriate for PW. The bilingual dictionary provides 4.93 English translations on

average for each Persian noun. These English translations correspond to 3.21 WordNet

synsets on average.

So we should specify truth probability of associations between PW

and WNSs using some similarity methods.

Similarity Methods. Similarity methods are divided into three main groups regarding

the kind of knowledge sources involved in the process: Classification methods,

Structural methods and Conceptual Distance methods.

Classification methods include two subgroups, namely, Monosemous and

polysemous.

Monosemous Group. English words in this group have only one synset in

WordNet. Four Monosemous methods are described below:

Mono1 (1:1): A Persian word has only one English translation. Also the English

word has Persian word as its unique translation.

Mono2 (1:N, N>1): A Persian word has more than one English translation. Also

each English word has the Persian word as its unique translation.

Mono3 (N:1, N>1): Several Persain words have the same translation EW. The

English word EW has several translations to Persian.

Mono4 (M:N, M,N>1): Several Persian words have different translations. English

words also have several translations to Persian. Note that there is at least two Persian

words having several common English words.

Polysemous Group. English words in this group have several synsets in WordNet.

Polysemous methods are like the Monosemous ones. We do not expand them for

avoiding repetition.

Structural Methods are based on the comparison of the taxonomic relations

between WordNet synsets. Four methods constituting structural methods are as

follows:

Intersection Method: If English words share at least one common synset in

WordNet, the probability of associating Persian word to common synsets increases.

Brother Method: If some synsets of English words are brothers (they have common

father), the probability of associating Persian word to brother synsets increases.

Ancestor Method: If some synsets are ancestors of another synset, the probability

of associating the Persian word to hyponym synset increases.

Child Method: If some synsets are descendants of another synset, the probability of

associating Persian word to hypernym synset increases.

Conceptual Distance Methods are based on semantic closeness of synsets in

WordNet. There are many formulas computing conceptual distance (CD) between two

concepts (word or synset). For example, it is defined in [15] as the length of the

shortest path between two concepts in a hierarchy [10]. We used the equation 1 [3,

16] for computing semantic similarity between two concepts.





,





2



,













(1)

s and t are the synsets; sim(s, t) is semantic similarity of s and t; depth(x) is depth

of synset x regarding the root of WordNet hierarchy (the node "entity" for nouns); and

finally LCA(s, t) is the Least Common Ancestor of s and t. LCA(s, t) is an ancestor of

s and t which is the deepest one in the WordNet hierarchy.

Two implications of equation 1 are (a) deeper synsets have higher semantic

similarity together than the shallow ones and (b) shorter path between s and t causes

higher semantic similarity. CD methods are divided into four:

CD1 Method: This method uses the words which have the related-to relation with

PW. If some synsets of PW are semantically closer to some synsets of related-to

words, probability of associating Persian word to its closer synsets increases. The

related-to words are extracted from Fararooy thesaurus

[17]. For example, the word

رﺎﮔزﻮﻣﺁ (amoozgar, instructor) is related to دﺎﺘﺳا (ostad, master).

CD2 Method: This method uses genus word(s) of PW. In fact, genus is one of

hypernyms of PW. If some synsets of PW are semantically closer to some synsets of

genus words, probability of associating Persian word to its closer synsets increases.

For example, Sokhan dictionary defines the Persian word

زاوﺁ ( avaz, song)

as:

اﺪﺻﻲﻳ ﻪﮐ... (sedayi ke …, the sound that …) . So the term اﺪﺻ (seda, sound) is

genus of زاوﺁ (avaz, song) and زاوﺁ (avaz, song) is a kind of اﺪﺻ (seda, sound).

CD3 Method: This method is based on the semantic similarity of candidate synsets

of PW. If some synsets of PW are semantically closer to all other candidate synsets,

probability of associating Persian word to its closer synsets increases.

CD4 Method: It uses semantic label(s) of Persian word. This label indicates the

domain of PW. If some synsets of PW are semantically closer to some synsets of

semantic label(s), the probability of associating Persian word with its closer synsets

increases. For example, Sokhan dictionary defines the Persian word

ﮎدرا (ordak,

duck) as: )رﻮﻧﺎﺟي( ﻩﺪﻧﺮﭘ اي ﻪﮐ... ((janevari) parandeyi ke …, ((animal) a

bird that …) , then the term رﻮﻧﺎﺟي (janevari, animal) is the semantic label of ﮎدرا

(ordak, duck).

Presentation of Similarities. We used the vector (PW-EW-WNS, m

, … , m

AccOrRej) to present associations between PWs and WNSs according to English

translations (EWs) of PW. m

method) can be assigned values between 0 to 5 to

indicate its intensity in association. The value 0 indicates that the method is not

applicable to the association and the values 1 to 5 indicate the intensity of application.

AccOrRej indicates accepting or rejecting the association.

Composition of Methods. Now some questions come into mind: Are all of methods

useful? Should they be independent? How important is each of them? How can we

specify their coefficients for computing final similarity?

This thesaurus provides only “related-to” relations between Persian words and has about 7500

word groups including the words that are related together

First term in parenthesis is pronunciation of Persian term and other term(s) is its English

translation

Coefficients of each method in final equation of probability computation should be

specified. The input of our methodology is an association between PW and a synset

having vector of 16 values and the output is correctness probability of that

association. To achieve this goal, we took advantage of Logistic Regression model

[18]. Equation 2 is the formula computing correctness probability of an association

(P(ok)) using Logistic Regression.



















∑









1







∑









(2)

is coefficient of i

method but β

is a constant. The higher the value of β

, the

higher the impact of m

on probability computation. m

is value of i

method in an

association.

We used SPSS as a statistical tool for Logistic Regression. For training step, at

first, we applied our methodology on 150 Persian words. Having computed vectors of

each association, about 2500 associations between Persian words and WordNet

synsets were created. For regressing these associations, it was necessary to enter only

some of them and their correctness probability achieved by human evaluation to

SPSS. Of course the more associations given to SPSS leads to more accuracy in

computation of coefficients. SPSS estimates coefficients according to correctness

probabilities of given associations. For this reason we classified associations in

groups having the same vector. Then about 120 groups were achieved. Groups having

less than 5 vectors were eliminated because their effects in this regression were very

low. For each association of each group, we accepted or rejected it. For example, the

vector 0000000104400111 was accepted in 40 cases and was rejected in 10 cases,

then its correctness probability by human evaluation is 40 / 50 = %80. After

computing of this probability for each vector, we entered them to SPSS.

3.2 Second Phase: Taxonomy Matching

After the first phase ended, the candidate synsets were obtained with different

probabilities for each Persian word. In this phase, the Persian senses, instead of

Persian words, should be mapped to WordNet synsets exploiting taxonomy matching.

To do the taxonomy matching we need hypernymy and hyponymy relations between

senses. We used about 200 Persian senses of nominal part of FarsNet (an ongoing

project to develop Persian WordNet) as the test data.

We used a sense-level taxonomy instead of a word-level one. For two reasons, the

sense-level is more suitable in this mapping: 1) there are usually several parents for a

word in word-level taxonomy (each parent for each sense) [9]; 2) also, using the

sense-level taxonomy, we can specify that on the basis of which sense of PW it is

associated with WNS. Therefore, the mapping in this phase is more complicated than

the first phase because the sense of PW in an association should be specified.

The candidate synsets of the sense PS are summation of candidate synsets of all

PWs which are the member of PS, but the common synsets in this collection, which

are shared by several PWs, considered as only one.

As Fig. 1 showed the aim of the second phase is exchanging some WtSs to StSs.

WtS means a weighted association between Persian word and WordNet synset [8],

having a probability value as its weight while StS means Sense-to-Synset association.

As mentioned before, a PW may be repeated in one or more PS(s) depending on the

number of its meanings (senses). A PW is called monosemous, if it has only one

sense, and it is called polysemous, if it has several senses. It is obvious that the

mapping process is easier for monosemous PWs.

There are one or more candidate synsets (usually more than one) for each PS. The

problem lies in finding the most appropriate WNS(s) for each PS. To evaluate the

StSs, we took advantage of results obtained in the first phase and hierarchical

similarities of Persian and English branches in the second phase. The parameters

which are considered in evaluation of an StS are nconn, level

, level

, gap and prob

(correctness probability of WtS). Level

is the level of Persian sense according to the

base sense. Level

is the level of WordNet synset according to the base synset. nconn

is the total number of associations between Persian senses (including base sense and

its hypernyms) and WordNet synsets (including base synset and its hypernyms). A

gap in a Persian branch is a sub-branch without associations which separates two sub-

branches with associations.

Suggested Algorithm. We have presented a formula to compute the correctness

probability of each StS (equations 3 and 4).



∑∑





6











11























417

  





6







0











0.15 for Co

, Co

and threshold as the most appropriate values.

(3)

Where

(4)

and l

are the levels of PS and WNS respectively. Co

and Co

are the coefficients

of Persian and English branches indicating the importance (influence) of each branch

in the value of FinalProp. Prob is the correctness probability (p(ok)) of an association

obtained in the first phase. Since the gaps cause the reduction of the correctness

probability of StS, we reduced the value of FinalProb for each gap according to its

level. GapValue is the summation of gaps' scores (equation 4). Because we need a

value between 0 and 1, the FinalProb is divided into the value 417. The number of all

possible associations between six Persian senses (one base sense and five hypernyms)

and 11 synsets (one base synset and ten hypernyms) is 417. The English branch is

more complete and enriched than Persian one, and considering the number of

hypernyms more than 5 and 10 in the Persian and English branches respectively, does

not increase the precision of mapping noticeably, therefore, we used five and ten

hypernyms in Persian and English branches. Values of Co

, Co

and threshold

are computed by conducting a lot of experiments. We evaluated the associations by

lots of values of mentioned parameters and we obtained the values 2.1, 0.8, 0.08 and

The system should decide which association can be accepted and which should be

rejected. In the second phase, given a PW, the steps of SBU methodology are as

bel

fy the total number of WtSs from PSs to WNSs

the PS which includes the

d (0.147) are accepted and the rest are rejected.

, which share WNSs

3.3 Comparison of SBU and Farreres' Methodologies

pernym) and Distant (non-

stor method.

nd Child

re close

er synsets. Having these relations makes

ant method is the inverse case of Intersection method in

ow:

• Specify the PSs of PW

• Speci

• If PW is monosemous

o If PW has only one WtS, it is accepted as StS for

o If PW has several WtSs which have the FinalProb value higher than the

threshol

• If PW is polysemous, it is repeated in several PSs. For each WtS

of PW in PS

there are equivalent WtSs in other senses (e.g. PS

) of PW

with WtS

. Then, only the WtS which have the higher FinalProb is accepted as

StS. Of course, the FinalProb of accepted WtS has to be higher than threshold,

otherwise it will be rejected.

The differences of first phase in two methodologies are:

• We merged two methods, Father (immediate hy

immediate hypernym) as proposed by Farreres as Ance

• We applied Child method in a different way from Father and Distant methods,

while in the Farreres' they are not detached. Severance of Ancestor a

methods causes to lead associations into hypernym synsets with general

meanings or hyponym synsets with specific meanings. This leading is done by

means of training step. The mapping system learns which hypernym or

hyponym associations are more important than others in training step.

• We utilized the words having "related-to" relation with PW instead of co-

occurrence relation because at most cases, "related-to" words were mo

to the PW rather than co-occurrent words. We used Fararooy thesaurus for

extracting "related-to" words of PW.

• We got advantage of CD3 method only for synsets that do not have Brother,

Ancestor and Child relations with oth

create dependency between methods, while the methods must be independent

from each other. According to statistical method Logistic Regression for

estimating coefficients (importance) of each method, this dependency prevents

exact estimation of coefficients. This limitation was not considered in Farreres'

methodology.

• In Farreres’ there is a Hybrid similarity method consisting of Variant and Field

methods. Vari

Structural group but Intersection starts from the PW to arrive at WNS, while

Variant starts from WNS to arrive at PW. The dependency of these two

methods is another drawback of Farreres' methodology. Therefore, we

eliminated the Variant method in our approach. We also eliminated Field

Method which is the same as CD1 method but uses the semantic label(s) of SW

which indicate the domain of SW.

• We used the numbers 0 to 5 instead of 0 and 1 to represent the intensity of each

a bootstrapping having these steps in which

the originating StS.

s are pending to the next iteration because there were not enough

d phase are as follows:

hesaurus for

English branches in

om the

in the Spanish

not be

method. In other words, two values of 0 to 1 were replaced by six values of 0 to

5. For example, in the Intersection method, if two, three or four English

translations of PW share a synset, 1, 2 or 3 are assigned to m

respectively,

while in Farreres' methodology, there is no difference between these cases and

the value of m

is 1 in all of them.

In the second phase, Farreres proposed

r each association the PRB denotes the pair of related branches:

• A disconnected PRB is a good indicator for incorrectness of

• A connected PRB is, on the other side, an indicator that the base StS would be

correct, although in many cases it wasn’t enough evidence; some factors

(nconn, level, gaps) can be extracted to filter those connected PRB. It mea

that the PRB having lowest level, highest nconn and no gap, would be

accepted.

• Other PRB

evidences for accepting or rejecting them.

Differences of two methodologies in the secon

• Farreres' methodology needs the total number of nodes of a t

mapping because an StS can be accepted based on pre-accepted StSs. Our

methodology does not need it but only the PRB which would be evaluated. One

of the advantages of our methodology is that it can be used in languages which

do not have a complete thesaurus because the suggested approach evaluates an

StS independent from other ones. Of course, pre-accepted StSs help to map more

exactly, but lack of them does not limit the methodology.

• The associations of hypernym senses of Spanish and

Farreres' methodology are considered as 0 and 1, not as a probability value. In

SBU methodology, they are considered as a value between 0 and 1 and so more

precise values of final correctness probability of an StS can be computed.

• The parameter level in the Farreres' is the level of first connected sense fr

base sense to WordNet branch but we studied all levels of associations between

two branches, either on Persian (level

) or WordNet level (level

• In Farreres' methodology only the number of Spanish ancestors

branch with some association to the WordNet branch is considered as nconn, but

in SBU methodology, the total number of associations between PSs and WNSs

are studied as nconn, which covers the Farreres definition of nconn, too.

• Some specific techniques that are used in SBU methodology, could

considered in the Farreres' methodology. For example, when there are several

association from a PS

to WNS

(their levels are higher than 0), several

probabilities exist between the PS

and WS

. What is done in SBU methodology

in these cases is use the arithmetical mean (average) of probabilities. Since in the

Farreres' methodology, there is no probability in hypernym associations, no

decision considered for this case.

4 Evaluation

In the first phase, the coefficients of Table 1 are obtained for two methodologies.

Table 1. Comparison of similarity methods and their coefficients in 2 methodologies.

According to those coefficients, different precisions and recalls were obtained

depending on threshold values. The highest precision and recall values for SBU

methodology are 0.62 and 0.7, and for the Farreres' methodology are 0.61 and 0.57

respectively, considering 0.38 as the threshold.

Farreres-

ased SBUβ

Methods'

Category

-2.291 --3.505-Classification β

0 Mono1Mono1 0β

0 Mono20Mono2β

0.3 Mono31. 551Mono3β

- 0.301Mono40Mono4β

22.037 Poly10Poly1β

0 Poly20Poly2β

-0 3 0. 0 .68Poly351Poly3β

-0.86 Poly40Poly4β

1.628 Inters.1. 3Structural 64Inters.β

0.503 Brothe

0.639Brothe

0.973 Fathe

0.311Ancesto

0.302 Distan

0.974Chil

0.137 CD10.673CD1β

Conceptual

Distance

1.054 CD20.408CD2β

0.403 C )D3 (β

-2.140CD3β

0.177CD4β

0 Varian

(β )

Hybrid

-0 5 --β .31Field

In the second phase, an algorithm and formula were presented. According to the

ed formula (equation 3) transforms to

lues obtained for Co

, Co

and threshold, our highest precision and recall

values were 0.69 and 0.73 respectively. Since the second phase of Farreres'

methodology needs a complete thesaurus to do the mapping, it is impossible to

implement it on the languages like Persian. Therefore, we compared evaluation of our

methodology and an adaptation of Farreres' methodology according to the parameters

used by the methodologies under discussion.

With the comparisons above, our suggest

uation 5 in the Farreres' methodology. Note that the GapValue of Farreres'

methodology is computed by equation 4. In his methodology, definitions of level and

nconn are different from those in ours, the coefficient level

is ignored and some

specific points are not considered. After conducting many experiments, the values

0.60, 0.64 and 0.43 were obtained for parameters level, Co

and threshold in Farreres'.

According to the values of three parameters level, Co

and threshold, the highest

precision and recall values in Farreres' methodology obtained by our suggested

algorithm and the equation 5, were 0.63 and 0.62 while the highest values of precision

and recall in our methodology were 0.69 and 0.73 respectively (Table 2).



∑





6















(5)

Table 2. Comparison of Final precision and Recall of two methodologies.

RecallPrecision Threshold

0.730.690.21SBU

0.620.630.43 SBU- Farreres based

5 Conclusion and Future Works

In this paper we proposed a methodology for mapping bilingual lexical ontologies

ity methods indicating how

sim

cond phase can be done applying

the

References

1. Daudé, J., Padró, L., Rigau, G.: Mapping multilingual hierarchies using relaxation labeling.

In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural

Language Processing and Very Large Corpora (EMNLP/VLC’99), Maryland (1999).

based on Farreres' methodology. The source ontology is a thesaurus in a language and

the destination one is English WordNet. Suggested methodology is language

independent, however we used Persian thesaurus as the source ontology. This

methodology is composed of two phases: a) mapping the words of source language to

WordNet synsets and b) acceptance or rejection of associations between source

thesaurus senses and WordNet synsets using the associations of first phase. The recall

values we obtained during the first phase of our methodology were higher than those

obtained in Farreres' but precision was almost the same.

In the first phase, we took advantage of 16 similar

ilar a Persian word is to each of its candidate synset. We obtained coefficients

(importance) of each method used in computing correctness probability of each

association by Logistic Regression model. In this phase, we obtained a formula whose

input is an association between Persian word and WordNet synset and whose output is

correctness probability of this association. In the second phase, we took advantage of

associations between hypernym senses and hypernym synsets of a given base

association. Actually, the similarities of concepts between two branches of thesaurus

and WordNet hierarchies helped us decide which synset(s) are the most appropriate

candidate(s) for a given sense of source language. At the end, the values 0.69 and 0.73

were obtained for precision and recall respectively.

For the future works, we will investigate if the se

Bayesian Network which gives a formula to compute the final correctness

probability of an StS (sense-to-synset association) according to its hypernyms

associations.

2. Daudé, J., Padró, L., Rigau, G.: Mapping WordNets using structural information. In

Proceedings of the 38

Annual Meeting of the Association for Computational Linguistics,

Hong Kong, China (2000).

3. Lee, C., Lee, G., Jung Yun, S.: Automatic WordNet mapping using word sense

disambiguation. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in

Natural Language Processing and Very Large Corpora (EMNLP/VLC 2000), Hong Kong

(2004).

emi-automatic Extensions using Bayesian Inference. Proceedings of the

arrakech, Morocco

Networks’, Taipei, (2002).

Brno. Masaryk University, (2003).

ependent Lexico-

onal Conference on Recent Advances in Natural Language Processing (RANLP),

khan dictionary. Sokhan Pub., Iran, Tehran (2002).

tions on Systems, Man and Cybernetics, 19(1):17–30, (1989).

urement.

(2007).

(2000).

4. Mihaltz. M., Prószéky. G.: Results and Evaluation of Hungarian Nominal WordNet v1.0. In

Proceedings of the Second International WordNet Conference (GWC 2004), Brno, Czech

Republic

5. Lebart, L.: Traitement Statistique des Données. DUNOD, Paris (1990).

6. Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., Antonia Martí, M.:

Arabic WordNet: S

6th Conference on Language Resources and Evaluation LREC2008. M

(2008).

7. Farreres, J., Gibert, K., Rodríguez, H.: Semiautomatic creation of taxonomies. In G. N. et

al., editor, Proceedings of the Coling 2002 Workshop ”SemaNet’02: Building and Using

Semantic

8. Farreres, J., Gibert, K., Rodríguez, H.: Towards binding Spanish senses to WordNet senses

through taxonomy alignment. In S. et al., editor, Proceedings of the Second International

WordNet Conferenc, pages 259–264,

9. Farreres, J., Rodríguez, H.: Selecting the correct synset for a Spanish sense. In Proceedings

of the LREC international conference, Lisbon, Portugal (2004).

10. Farreres, J.: Automatic Construction of Wide-Coverage Domain-Ind

Conceptual Ontologies. PhD Thesis, Polytechnic University of Catalonia, Barcelona

(2005).

11. Atserias, J., Climent, S., Farreres, X., Rigau, G., Rodriguez, H.: Combining multiple

methods for the automatic construction of multilingual WordNets. In Proceedings of the

Internati

Tzigov Chark, Bulgaria (1997).

12. Shamsfard M., Towards Semi Automatic Construction of a Lexical Ontology for Persian,

Proceedings of the 6th Conference on Language Resources and Evaluation LREC2008.

Marrakech, Morocco (2008).

13. Assi, M., Aryanpour, M.: Aryanpour English-Persian and Persian-English dictionary.

http://www.aryanpour.com

14. Anvari, H.: Persian-Persian So

15. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on

semantic nets. IEEE Transac

16. Simpson, T., Dao, T.: WordNet-based semantic similarity meas

http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx

17. Fararooy, J.: Fararooy Persian Corpus. http://www.persianthesaurus.com

18. Ramanand, J., Ukey A., Kiran Singh, B., Bhattacharyya, P.: Mapping and Structural

Analysis of Multi-lingual WordNets. IEEE Data Eng. Bull. 30(1): 30-43