Towards the Enrichment of Arabic WordNet with Big Corpora

Georges Lebboss

, Gilles Bernard

, Noureddine Aliane

and Mohammad Hajjar

LIASD, Paris 8 University, Paris, France

Lebanese University, IUT, Saida, Lebanon

Keywords:

Semantic Relations, Semantic Arabic Resources, Arabic WordNet, Synsets, Arabic Corpus, Data

Preprocessing, Word Vectors, Word Classiﬁcation, Self Organizing Maps.

Abstract:

This paper presents a method aiming to enrich Arabic WordNet with semantic clusters extracted from a large

general corpus. The Arabic language being poor in open digital linguistic resources, we built such a corpus

(more than 7.5 billion words) with ad-hoc tools. We then applied GraPaVec, a new method for word vec-

torization using automatically generated frequency patterns, as well as state-of-the-art Word2Vec and Glove

methods. Word vectors were fed to a Self Organizing Map neural network model; the clusterings produced

were then compared for evaluation with Arabic WordNet existing synsets (sets of synonymous words). The

evaluation yields a F-score of 82.1% for GrapaVec, 55.1% for Word2Vec’s Skipgram, 52.2% for CBOW and

56.6% for Glove, which at least shows the interest of the context that GraPaVec takes into account. We end up

by discussing parameters and possible biases.

1 INTRODUCTION

The Arabic language is poor in open digital linguis-

tic resources, especially semantic ones. Work in the

ﬁeld of automatic semantic analysis is not as devel-

oped as for European languages; improving such re-

sources is an important goal for researches on Ara-

bic language and semantics. Among these resources

we choose Arabic WordNet (Black et al., 2006; Ro-

driguez et al., 2008; Regragui et al., 2016), an open

semantic database where lexical items are organized

in synsets (sets of synonymous words), linked by

semantic relationships, based on WordNet (Miller,

1995), now version 2.1 (Miller et al., 2005). Arabic

WordNet (hereafter AWN) is still poor in words and

synsets and needs to be enriched.

The end-to-end system presented here (ﬁgure 1)

generates semantic word clusters computed from a

large general corpus (Lebboss, 2016). Existing meth-

ods (subsection 2.1) are based on dictionaries (either

digitized paper ones or database dictionaries as Wik-

tionary), on translation and aligned multilingual cor-

pus, on WordNets and ontologies, on morphological

parsing or on combinations of those resources. There

are not any methods based on large general corpora.

Available Arabic corpora are small, and researchers

working on Arabic corpora usually have had to de-

vise their own. Our ﬁrst step was to build the biggest

Figure 1: Global view of our system.

possible open corpus (section 3), keeping in mind

that it should be dynamically computed (so as to ex-

pand as much as possible as resources grow) and that

the building tool should be freely available for re-

searchers. The corpus built contains more than 7.5

billion words; it is by large the biggest one ever made

for Arabic language.

Arabic corpora usually are preprocessed by or-

thographic normalization and lemmatization. Ara-

bic lemmatization has been thoroughly analyzed by

Al Hajjar in his PhD thesis (Al Hajjar, 2010); we

Lebboss G., Bernard G., Aliane N. and Hajjar M.

Towards the Enrichment of Arabic WordNet with Big Corpora.

DOI: 10.5220/0006505701010109

In Proceedings of the 9th International Joint Conference on Computational Intelligence (IJCCI 2017), pages 101-109

ISBN: 978-989-758-274-5

choose the lemmatizer that according to his evalua-

tion yielded the best results (Khoja et al., 2001).

The main issue is word vectorization. Methods

considered here are based upon distributional proper-

ties of words (subsection 2.2). The main character-

istic of GraPaVec as opposed to the state-of-the-art

methods is that the context taken into account is the

surrounding pattern of high frequency words rather

than a window of neighbouring lexical items or skip-

gram of lexical items (section 4). In other words, we

keep what others throw away and throw away what

others keep.

Vectors are fed to a clustering algorithm. We have

chosen here the neural network model Self Organiz-

ing Maps (Kohonen, 1995), because of two advan-

tages: minimization of misclassiﬁcation errors (mis-

classiﬁed items go to adjacent clusters) and easy vi-

sualization of the results. Those results are then eval-

uated by comparison with AWN synsets (section 6).

2 RELATED WORK

2.1 Building Semantic Resource

In 2008, three methods were proposed by the AWN

team.

One (Rodriguez et al., 2008) builds a bilingual

lexicon of <English word, Arabic word, POS> tu-

ples from several publicly available translation re-

sources. It merges in one set the base concepts

of EuroWordNet (1024 synsets) and Balkanet (8516

synsets). Keeping only the tuples whose English

word was included in the merged set, they produced

<English word, Arabic word, Concept> tuples. Ara-

bic words linked to the same concept were candidates

to enter an synset in AWN. Their candidatures were

to be validated by lexicographers; however, as of to-

day only 64.5% have been processed, of which 74.2%

were rejected as incorrect.

They obtained better results with another method

(Rodriguez et al., 2008), where they generated new

Arabic forms by morphological derivation from the

words in AWN synsets, controled their existence with

databases such as GigaWord non free Arabic cor-

pus, the Logos multilingual translation portal, or New

Mexico State University Arabic-English lexicon, and

used their translation to link them to WordNet synsets

and then back to AWN, to be validated by lexicog-

raphers. A similar method was proposed later (Al-

Barhamtoshy and Al-Jideebi, 2009); words were mor-

phologically hand-parsed by linguists, then translated

and associated to synsets with equivalence relations

between the synsets made explicit in the Inter-Lingual

Index deep structure (Vossen, 2004).

The third method (Alkhalifa and Rodriguez, 2008)

extracted named entities from Arabic Wikipedia,

linked them to named entities from the corresponding

English Wikipedia page, linked those to named enti-

ties from WordNet, and then back to synsets of AWN.

Though the result was much better (membership was

correct up to 93.3%), the coverage was scarce.

A different approach (Abouenour et al., 2008)

exported the entire set of data embedded in AWN

into a database integrated with Amine AWN ontol-

ogy, tapped by a Java module based on Amine Plat-

form APIs. This module used the mapping between

English synsets in WordNet and Suggested Upper

Merged Ontology (Niles and Pease, 2003) concepts to

build the Amine AWN type hierarchy. Then, it added

Arabic synonyms based on the links between Word-

Net synsets and AWN synsets.

Later the same team (Abouenour et al., 2010;

Abouenour et al., 2013) used YAGO (Yet Another

Great Ontology) from Max-Planck Institute, translat-

ing its named entities into Arabic with Google transla-

tion, then added them to AWN according to two types

of mappings (direct mapping through WordNet, map-

ping through YAGO relations to AWN synsets).

Abdul Hay’s PhD thesis (Abdulhay, 2012) ex-

tracted semantic categories from a multilingual

aligned corpus with English and two langages from

EuroWordNet. If all but Arabic words were mem-

bers of synsets linked by Inter-Lingual Index, then

the Arabic word should also be in a linked synset in

AWN. Results were correct up to 84%.

Another team worked on iSPEDAL, a database

monolingual dictionary digitizing monolingual pa-

per dictionaries (Al Hajjar, 2010). Two methods

have been proposed (Hajjar et al., 2013) for enrich-

ing iSPEDAL. One used semi-structured information

from plain dictionaries to deduce links (synonymy,

antonymy). The other used translation by available

resources to and from a foreign language to com-

pute synonymy of Arabic words by correlating their

translations. A somewhat similar approach (Abde-

lali and Tlili-Guiassa, 2013) extracts synonymy and

antonymy relationships from Arabic Wiktionary.

Arabase platform (Raafat et al., 2013) aims to

integrate every available Arabic semantic resource,

from King Abdulaziz City for Science and Technol-

ogy database, to Arabic StopWords Sourceforge re-

source and AWN. It has, according to the authors, “a

good potential to interface with WordNet”. Arabase

computes by hand-made rules semantic properties of

vocalized words

and forms a sort of virtual WordNet.

Short vowels are not written in Arabic words in normal

As one can see, researches on Arabic seman-

tic categories has extensively used foreign resources;

very little has been done on extracting semantic infor-

mation from Arabic data alone, and nothing based on

an Arabic general corpus.

2.2 Word Vectorization

Structural linguistics (Harris, 1954) postulated that

words with similar distributions have similar cate-

gories and meanings. Since (Salton et al., 1975)

ﬁrst proposed it, projecting words in vector space has

been the ﬁrst step in many models of word clustering,

where semantic properties could be linked to similar-

ities in the distribution of the word vectors.

In such models a distribution is deﬁned by a vector

of the contexts of a word. A context is deﬁned by two

elements: units and distance to the word. Units can be

words, phrases or ngrams, more recently skipgrams

(ngrams with “holes”). The distance can be a step

function, as in the bag of word model (only words in

the same document are nearby contexts), or a function

of the number of units separating word and context.

On the resulting matrix {words × contexts}

(where each component is the frequency of a word

in a context), mathematical models have been applied

in order to reduce it to a bunch of clusters, from the

most simple, tf-idf, to much more complex ones, such

as latent semantic analysis, latent Dirichlet allocation,

neural networks (various models), linear or bilinear

regression models... Clustering models display quite

a big variety of reduction methods, which contrasts

with the poverty of context variety. Usually units are

lexical items or ngrams of lexical items and sets are

either documents or ﬁxed-length windows.

Some examples: in Hyperspace Analogue to Lan-

guage (Lund et al., 1995), context is a ﬁxed-length

window of lexical items around the word. In Web-

SOM (Honkela et al., 1997), a model with two lay-

ers of Self Organizing Map neural network, context

is a ﬁxed-length window of lexical items around the

word. Among the rare exceptions to the lexical items

context was a SOM classiﬁer applied to a context of a

ﬁxed-length window of grammatical categories to the

left of the word (Bernard, 1997).

Word2vec (Mikolov et al., 2013) is a set of two

unsupervised neural models, with log-linear classi-

ﬁers, where words and context vectors are trained si-

multaneously; in CBOW (Continuous Bag Of Words)

context is deﬁned by a ﬁxed-length window of lexical

items around the word, in Skipgram lexical items are

grouped in skipgrams, that is, ngrams with holes.

use and in a majority of documents.

Glove (Global Vectors for Word Representation)

(Pennington et al., 2014) is a global log-bilinear re-

gression model designed by the NLP team at Stanford

University. Paraphrasing the authors, this model com-

bines the advantages of the global matrix factorization

with windowing local context methods. Context is a

ﬁxed-length window of lexical items centered on the

word.

Both models represent the state of the art. One

should note that though they use ﬁxed-length win-

dows, they use a continuous function for distance,

thus introducing a un-bag-of-word approach (even in

CBOW) which had rarely been used before.

3 CORPUS

To extract semantic categories from general corpora,

a large corpus is needed. We merged the entire avail-

able Arabic corpora and found it to be small, even

adding Arabic Wikipedia and Wiktionary. This static

corpus was the starting point of our large corpus,

but the bulk comes from the Alshamela library on-

line resource (http://shamela.ws) and, over all, crawl-

ing/converting web sites (more than 120, mostly news

web sites) and their documents.

Open-source web crawlers as HTTtrack failed

to ﬁt our purpose: no queuing, hard resuming

of download, thousands of blank pages in the re-

sult and no easy way to convert documents on the

ﬂy. We created our own Arabic Corpus Builder,

that crawls queues of sites, merges them in plain

text format with the outcome of previous corpus,

and imports it in a database. It also converts on

the ﬂy usual encodings of Arabic characters (Mi-

crosoft and MacOS) in unicode. It can be found on

https://sites.google.com/site/georgeslebboss.

Our corpus is mostly dynamic as the results of

crawling varies in time. In its present state, it con-

tains about 85,000 documents and is described in the

following table:

Table 1: General corpus.

Source Word number Unique words

Static corpus 207 878 809 3 589 374

Arabic

Wikipedia

+ Arwiki-

tionary

6 242 131 2 376 805

Alshamela li-

brary

1 862 000 347 4 007 846

Corpus

Builder

5 543 097 123 5 987 391

Total 7 619 218 410 6 894 986

Arabic writing conventions, especially concerning

vowels, entail that every word can have several writ-

ings (not counting errors). Orthographic normaliza-

tion is usual in Arabic language processing systems,

even though it introduces ambiguities.

As Arabic language has a rather complex mor-

phology, especially in derivation, and as writing con-

ventions do not separate morphemes inside accented

words, lemmatization of Arabic text, though difﬁcult,

is much more useful than it is in European languages

(with the exception of German).

In our case, orthographic normalization and

lemmatization are left to the choice of the experimen-

tor. We used the best lemmatizer according to the re-

sults of Al Hajjar (Al Hajjar, 2010), Khoja lemmatizer

(see section 1).

4 GRAPAVEC

For our own method of word vectorization, we ex-

plored the idea of semantic clustering build on gram-

matical context found in (Bernard, 1997), but with

important modiﬁcations, mainly due to our aim to

develop a method as independant from speciﬁc lan-

guages as possible. So the stopword list and the stop-

word categories used in this paper were out of the pic-

ture. The left window (or any ﬁxed-length window for

that matter) was also too restrictive as we did not want

to make any assumption as to order of parts of speech

or type of syntax rules. Instead, we wanted to em-

pirically discover recurrent patterns of very general

words.

So the context we take into account is composed

of (ordered) patterns of such words in the vicinity of a

given word, inside sets that are delimited by punctua-

tion markers. We called our method Grammatical Pat-

tern Vector, or GraPaVec, though the relation of this

algorithm to grammar is indirect (see subsection 4.2).

GraPaVec has four steps:

• Trie preparation

• Pattern element selection

• Pattern discovery

• Word vectorization

4.1 Trie Preparation

We begin by importing every word in the corpus in a

preﬁx Trie. A Trie is a structure that can represent a

very large number of words in a format that is both

Figure 2: General view of GraPaVec.

economical and fast to explore

; it is more efﬁcient

here than hash-code or binary trees.

Each path from the root to a leaf of the Trie rep-

resents a word (see ﬁgure 3). Each node contains a

unicode character. A node is marked as leaf the ﬁrst

time a word ends there, and its occurrences are incre-

mented each time. Thus a node is a simple structure

with a unicode character, a ﬁeld indicating the number

of occurrences (if not zero, the node is a leaf), point-

ers towards its sons and towards its brothers. A leaf

can have sons, as words can be part of other words.

Figure 3: A Trie.

4.2 Pattern Element Selection

This is the most important step and the one where

a human eye is necessary (for now). If the corpus

is big enough, the most frequent words are markers

with grammatical or very abstract function (with no

independant meaning or referent, the syncategoremes

of Aristotle); we tested this on English as well as on

Arabic. The user – which just needs to know the lan-

guage – has to set the frequency threshold that sepa-

rates markers and “ordinary” words (lower frequency

words). This is done by looking for the most frequent

lexical item appearing in the list displayed by our sys-

tem and establishing its frequency as threshold.

Its maximal depth is given by the longest word in the

corpus and its maximal breadth is given by the number

of possible characters at any point. As shown by (Harris,

1968), in language the number of successors is constrained,

so the tree quickly shrinks.

With the subcorpus used for evaluation (section

6), a threshold of 3,500 selected 155 markers. With

the whole corpus, a threshold of 9,000 separated 196

markers. The whole corpus is about 980 times big-

ger than the evaluation corpus, the number of unique

words is about 17.5 times bigger, but the number of

markers is only 1,26 times bigger, and the threshold

only 0.26 bigger. Thus there does not seem to be a

clear relationship between the size of the corpus and

this threshold.

Looking more closely, with the whole corpus, 53

markers were added, 12 were lost. Half of these 12

were combinations of markers that had correctly been

classiﬁed; with better lemmatization, those would

be eliminated, leaving an error margin of 3% rela-

tively to the number of markers detected. On the

whole, the biggest the corpus, the more homogeneous

marker distribution is, and more neat its identiﬁcation

to grammatical words.

We compared these 196 words with the hand-

made Arabic Stopwords Sourceforge resource: half

of them (97) were not included in the resource (77

words on the 155 list). Most of them should have been

included as stopwords; others were combinations of

stopwords. More generally, though Arabic Stopwords

includes 449 words, it seems rather incomplete, and

could easily be enriched by our method.

4.3 Pattern Building

A pattern is a sequence of markers including se-

quences of ordinary words. For instance:

• the red book of Peter

• the car of George

• the heart of London

As the and of are higher frequency words, these

phrases are instances of the same pattern: “the * of

*”. The star (joker) represents a sequence of ordinary

words. Patterns are build according to the following

principles (m represents a marker, x an ordinary word,

p a punctuation):

• A pattern does not contain p.

• A pattern is a sequence of m and *.

• A pattern contains at least one *.

• * is a string of x with n as maximum length.

• * contains at least one x.

The maximum length n is called JokerLength. Let us

take the following sequence, representing an extract

from the corpus:

xmmxxmxxxpmxmmxxxxmpmmxxxm

Our objective is to generate all possible patterns com-

patible with this sequence. These patterns will be rep-

resented by sequences of m and *, as in <*mm*m>.

Supposing that JokerLength = 3, we ﬁrst obtain the

following patterns:

• *mm*m* (followed by p)

• m*mm* (followed by more than 3 x)

• *m (followed by p)

• mm*m (end-of-ﬁle considered as p)

From each of these patterns all potential patterns in-

cluded are deduced. For instance, <*mm*m*> con-

tains the following sub-patterns:

• *m

• *mm

• *mm*

• *mm*m

• *mm*m*

Then we skip the ﬁrst element and do the same with

<mm*m*> and its subpatterns, and recurse until the

pattern is ﬁnished. Of course, in real patterns, m is

replaced by true markers; thus pattern <*m*m> is in

reality a set of patterns differing by the nature of both

‘m’.

In the actual implementation, patterns are read

from the corpus in a preﬁx Trie similar to the one used

for words. Every star is a node that permits back ref-

erence from the ordinary word in the database to the

positions it can occupy in the pattern Trie.

4.4 Word Vectorization

As the preceding process builds all possible patterns

in the vicinity of a word, most of them will not be

relevant and will not be repeated. We need a fre-

quency threshold to eliminate spurious patterns that

won’t discriminate words.

We compute for each word the number of times it

occurs in every selected pattern. This process yields

a (sparse) matrix {words × patterns}. We then elim-

inate from this matrix all patterns whose frequency is

less than the pattern threshold selected.

Thus word vectorisation depends on three pa-

rameters: marker threshold, JokerLength and pattern

threshold.

5 SELF ORGANIZING MAP

Self Organizing Map is an unsupervised neural net-

work model designed by Kohonen (Kohonen, 1995).

In its standard version, it projects the space of input

data on a two-dimension map. It implicitly does a

dual clustering: on one hand, the data is clustered into

neurons, and on the other hand the clusters themselves

are grouped by similarity in the map. Its operation is

in “best matching unit” mode: all neurons compete

for each input vector and the best matching neuron is

the winner.

X being the input vector, j an index on the n neu-

rons in the map, W

the memory vector of the neuron

j, the winner, j∗, is determined by equation 1, where

d(x, y) is a distance measure:

d(X,W

∗

) = min

j∈{1−n}

d(X,W

) (1)

The distance can be euclidian (usual value), Manhat-

tan, or some other. It can be replaced with a similarity

measure as cosine (normalized dot product, eq. 2), if

min is replaced by max in eq. 1. With sparse vectors

cosine similarity drastically reduce computation time.

CosSim(X,W

) =

∑

i=1

i, j

k X k × k W

(2)

Every neuron has a weigth vector W

of the dimension

of the input vector, initialized randomly and maybe

pre-tuned to the set of possible values. In the learning

phase the winner and every neuron in its neighbour-

hood learn the input vector, according to eq. 3, where

(i, j) is the neighbourhood in radius σ; the brack-

eted superscript indicates the epoch.

(t+1)

= W

(t)

+ α

(t)

( j, j

∗

)(X

(t)

−W

(t)

) (3)

The learning rate α decrease in time following equa-

tion 4, where α

(0)

is its initial value.

(t)

= α

(0)

(1 −

max

) (4)

Learning in the neighbourhood of the winner decrease

in space following here the gaussian in equation 5,

which yields better results than mexican hat or other

variants. M(i, j) is the Manhattan distance between

indexes.

(t)

( j, j

∗

) = e

−

M( j, j

∗

)

2σ

2(t)

(5)

σ obeys equation 6, where σ

(0)

is the radius initial

value, typically the radius of the map, and σ

ax)

its ﬁnal value, typically 1.

(t)

= σ

(0)

(

(0)

max

)

max

(6)

Our implementation gives the choice of euclid-

ian distance, Manhattan, cosine similarity; different

topologies for the neighbourhood (square or hexag-

onal), initialize memory to the center of learning set

values or randomly.

6 EVALUATION

As our ﬁnal objective was to produce new synsets,

we wanted to check whether AWN existing synsets

were correctly retrieved, that is, whether the words

of a synset were all clustered together. At ﬁrst we

thoroughly assessed the quality of the 11,269 existing

AWN 2.1 synsets. This study yielded the following

issues:

A) 4,712 synsets are singletons.

B) 1,110 are subsets of others.

C) A non-negligible number of synsets are false.

Type (A) synsets would have artiﬁcially increased

the recall value of any method (they would always

be in the same cluster). As synsets of type (B)

do not form a complete partition of their supersets,

some words would not have been taken into account

and the number of singletons would have increased.

After eliminating these synsets, we were left with

5,807 synsets. It is easy to see why type (C) synsets

were not to be used, but much less easy to elimi-

nate them, as it has to be done by hand. For our

experiments, we controled and choose 900 synsets

grouping 2,107 words. Those synsets can be found

at https://sites.google.com/site/georgeslebboss.

The evaluation corpus contained the documents of

our large corpus containing at least one of the words

of these synsets. We ended up with an evaluation cor-

pus of 7,787,525 words and 395,014 unique words.

Quality of evaluation will increase as AWN itself in-

creases in quality.

In order to compute the F-score (harmonic mean

of recall and precision) of the four methods tested

here, we run them on the evaluation corpus, insert re-

sult in database, then cluster vectors obtained from

each method with SOM model. The resulting clusters

are compared to our synsets: we count the number

of synsets whose at least two words are clustered to-

gether. Let us call C this number, T the total number

of synsets, and S SOM number of output cells; recall

is computed as

and precision is computed as

We tuned parameters to their best values sepa-

rately for each method with many tests.

• Best values for common parameters

1. orthographic normalization: true

2. lemmatization: true

3. SOM topology: hexagonal

4. SOM measure: cosine similarity

5. SOM mapsize: 35 x 26

• Best values for GraPaVec parameters

1. marker threshold: 3500 in evaluation corpus,

9000 in the large one

2. JokerLength: 4

3. pattern threshold: 300

These parameter values yield the following

vector dimensions: 1,571 in evaluation corpus,

1,869 in the large one.

• Best values for Word2Vec (Skipgram and CBOW)

and Glove parameters

Table 2: Word2Vec and Glove Parameter Values.

Skipgram CBOW GloVe

Vector dim. 300 300 50

Window size 10 7 15

Sample 1e-3 5 N/A

Hier. softmax 0 0 N/A

Negative samp. 10 5 N/A

Iterations 10 5 15

Min count 5 5 5

Learning rate 0.025 0.05 N/A

X Max N/A N/A 10

Normalization and lemmatization were the param-

eters that, along with the SOM map size most inﬂu-

enced the results. Normalizing increments the F-score

of any method by 12%; lemmatizing increments it by

17%; adding both increments it by 28%. The table

3 shows F-scores with raw, normalized, lemmatized

and both (normalized and lemmatized) corpus.

Table 3: Effect of preprocessing on F-scores.

Corpus GraPaVec Skipgram CBOW GloVe

raw 54.1 27.2 24.3 28.6

norm. 65.5 38.5 35.6 40.0

lemm. 70.7 43.7 40.8 45.1

both 82.1 55.1 52.2 56.6

When SOM mapsize is 10% bigger than the num-

ber of synsets to be retrieved, it downgrades the re-

sult of more than 20% with all three methods; it is

also the case when the map size is smaller, which is

expected, but the effect is less drastic. The table 4

shows some results (all with lemmatization and nor-

malization); we have indicated also the results with

Table 4: Effect of map sizes on F-scores.

Mapsize GraPaVec Skipgram GloVe

34 x 24 80.72 54.82 56.17

Chosen 82.1 55.1 56.6

35 x 27 79.78 52.47 53.98

35 x 28 77.23 51.81 52.23

35 x 30 61.54 41.44 42.05

the chosen value (35 x 26), corresponding to the num-

ber of synsets.

The ﬁnal evaluation using the best parameters for

each method yields a F-score of 82.1% for GrapaVec,

55.1% for Word2Vec’s Skipgram, 52.2% for CBOW

and 56.6% for Glove.

7 CONCLUSION AND

DISCUSSION

To be completely open, we did not expect such gap

between the results of GraPaVec and the other meth-

ods. We looked for biases in our procedure. The only

parameters that are common to all methods are those

of SOM (topology, distance and map size). The map

size mostly depends on the number of clusters to be

found and we had to be close to the number of synsets.

The choice of hexagonal topology, recommended by

Kohonen himself on general grounds, gave the best

results for all methods. Cosine similarity has no rea-

son to favor one method. That leaves SOM choice,

but there is no clear reason why it should biase in fa-

vor of GraPaVec.

Could sparsity be a factor? The below results

show vector sparsity in the evaluation corpus; while

GraPaVec vectors are indeed six times sparser than

Word2Vec’s, these are in turn six times sparser than

GloVe’s, with no noticeable effect on the results:

• GraPaVec : 0.27 %

• Skipgram : 1.65 %

• CBOW : 1.65 %

• GloVe : 9.89 %

One bias is clear, though: GraPaVec is twice as

time consuming as Skipgram, the most time consum-

ing of the other methods, as shown by the below re-

sults in minutes (using Laptop Core i5 with 8 GB

RAM); pattern construction consumes 4/5 of Gra-

PaVec time:

Method big corpus evaluation corpus

GraPaVec 483 125

Skipgram 242 63

Glove 66 16

CBOW 48 12

But this does not explain the F-score gap.

Perhaps the solution is to look at what Levy et al.

(Levy et al., 2015) call hyperparameters. Here the

type of context could have played the major role. It

would be interesting to twist Word2Vec and Glove to

apply them to such contexts. Another element could

have played some role: the corpus itself and the lan-

guage under study. As Goldberg (Goldberg, 2014)

puts it,

It is well known that the choice of corpora

and contexts can have a much stronger effect

on the ﬁnal accuracy than the details of the

machine-learning algorithm being used [...]

Either way we achieved here two aims: building

Arabic word clusters on the basis of Arabic corpora,

a ﬁrst step in enriching AWN, and showing that pat-

terns of higher frequency words, mostly grammatical

words, thrown away as “empty words” by most meth-

ods, are operative in semantic lexical clustering at

least in Arabic. More work on the contexts is needed

here.

There are still a number of questions to be

adressed. Is it possible to automatize the selection

marker threshold? What impact on the results would

have moving this threshold down or up? Reducing the

computational cost of GraPaVec is a must in order to

be able to do more extensive tests and is one of our

ﬁrst objectives for now.

In a near future we also aim to produce synsets

based on our work; to try our hand at other languages,

in order to see if those results are language speciﬁc,

and to use a dynamic growing neural model that can

ﬁnd by itself the number of categories.

REFERENCES

Abdelali, B. and Tlili-Guiassa, Y. (2013). Extraction des

relations s

emantiques

a partir du Wiktionnaire arabe.

Revue RIST, 20(2):47–56.

Abdulhay, A. (2012). Constitution d’une ressource

emantique arabe

a partir d’un corpus multilingue

align

e. PhD thesis, Universit

e de Grenoble.

Abouenour, L., Bouzoubaa, K., and Rosso, P. (2008). Im-

proving Q/A using Arabic WordNet. In Proc. of the

2008 International Arab Conference on Information

Technology (ACIT’2008), Tunisia.

Abouenour, L., Bouzoubaa, K., and Rosso, P. (2010). Using

the Yago ontology as a resource for the enrichment

of named entities in Arabic WordNet. In Proceed-

ings of The 7th International Conference on Language

Resources and Evaluation (LREC 2010) Workshop on

Language Resources and Human Language Technol-

ogy for Semitic Languages, pages 27–31.

Abouenour, L., Bouzoubaa, K., and Rosso, P. (2013). On

the evaluation and improvement of Arabic WordNet

coverage and usability. Lang Resources & Evaluation,

47(3):891–917.

Al-Barhamtoshy, H. M. and Al-Jideebi, W. H. (2009). De-

signing and implementing Arabic WordNet semantic-

based. In the 9th Conference on Language Engineer-

ing, pages 23–24.

Al Hajjar, A. E. S. (2010). Extraction et gestion de

l’information

a partir des documents arabes. PhD the-

sis, Paris 8 University.

Alkhalifa, M. and Rodriguez, H. (2008). Automatically ex-

tending named entities coverage of Arabic WordNet

using Wikipedia. International Journal on Informa-

tion and Communication Technologies, 1(1):1–17.

Bernard, G. (1997). Experiments on distributional catego-

rization of lexical items with Self Organizing Maps.

In International Workshop on Self Organizing Maps

WSOM’97, pages 304–309.

Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M.,

Vossen, P., Pease, A., and Fellbaum, C. (2006). Intro-

ducing the Arabic WordNet project. In Sojka, Choi, F.

and Vossen, editors, In Proceedings of the third Inter-

national WordNet Conference, pages 295–300.

Goldberg, Y. (2014). On the importance of comparing ap-

ples to apples: a case study using the GloVe model.

Google docs.

Hajjar, M., Al Hajjar, A. E. S., Abdel Nabi, Z., and Lebboss,

G. (2013). Semantic enrichment of the iSPEDAL cor-

pus. In 3rd World Conference on Innovation and Com-

puter Science (INSODE).

Harris, Z. S. (1954). Distributional structure. Word, 10(2-

3):146–162.

Harris, Z. S. (1968). Mathematical structures of language.

John Wiley & Sons.

Honkela, T., Kaski, T., Lagus, K., and Kohonen, T. (1997).

WEBSOM–Self-Organizing Maps of document col-

lections. In Proceedings of WSOM’97, Workshop on

Self-Organizing Maps, Espoo, Finland, pages 310–

315. Helsinki University of Technology.

Khoja, S., Garside, R., and Knowles, G. (2001). An Ara-

bic tagset for the morphosyntactic tagging of Arabic.

A Rainbow of Corpora: Corpus Linguistics and the

Languages of the World, 13:341–350.

Kohonen, T. (1995). Self-Organizing Maps. Springer,

Berlin.

Lebboss, G. (2016). Contribution

a l’analyse s

emantique

des textes arabes. PhD thesis, University Paris 8.

Levy, O., Goldberg, Y., and Dagan, I. (2015). Improv-

ing distributional similarity with lessons learned from

word embeddings. Transactions of the Association for

Computational Linguistics, 3:211–225.

Lund, K., Burgess, C., and Atchley, R. A. (1995). Semantic

and associative priming in high-dimensional semantic

space. In Proceedings of the 17th annual conference

of the Cognitive Science Society, volume 17, pages

660–665.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vec-

tor space. In Proceedings of the International Con-

ference on Learning Representation, Workshop Track,

page 1301.

Miller, G. A. (1995). Wordnet: A lexical database for En-

glish. Communications of the ACM, 38(11):39–41.

Miller, G. A., Fellbaum, C., Tengi, R., Wolff, S., Wakeﬁeld,

P., Langone, H., and Haskell, B. (2005). WordNet 2.1.

Cognitive Science Laboratory, Princeton University.

Niles, I. and Pease, A. (2003). Linking lexicons and on-

tologies: Mapping WordNet to the Suggested Upper

Merged Ontology. In Proceedings of the International

Conference on Information and Knowledge Engineer-

ing (IKE 03), Las Vegas, Nevada, volume 2, pages

412–416, Las Vegas, Nevada, USA.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors forword representation. In

EMNLP, volume 14, pages 1532–1543.

Raafat, H., Zahran, M., and Rashwan, M. (2013). Arabase:

A database combining different arabic resources with

lexical and semantic information. In Proceedings

of the International Conference on Knowledge Dis-

covery and Information Retrieval, pages 233 – 240.

Scitepress.

Regragui, Y., Abouenour, L., Krieche, F., Bouzoubaa, K.,

and Rosso, P. (2016). Arabic WordNet: New con-

tent and new applications. In Proceedings of the

Eighth Global WordNet Conference, pages 330–338,

Bucharest, Romania.

Rodriguez, H., Farwell, D., Farreres, J., Bertran, M.,

Alkhalifa, M., Mart

ı, M. A., Black, W., Elkateb, S.,

Kirk, J., Pease, A., Vossen, P., and Fellbaum, C.

(2008). Arabic WordNet: Current state and future ex-

tensions. In Proceedings of The Fourth Global Word-

Net Conference, Szeged, Hungary, number 387–405,

Marrakech (Morocco).

Salton, G., Wong, A., and Yang, C.-S. (1975). A vector

space model for automatic indexing. Communications

of the ACM, 18(11):613–620.

Vossen, P. (2004). EuroWordNet: a multilingual database

of autonomous and language-speciﬁc WordNets con-

nected via an Inter-Lingual Index. International Jour-

nal of Lexicography, 17(2):161–173.