Method big corpus evaluation corpus
GraPaVec 483 125
Skipgram 242 63
Glove 66 16
CBOW 48 12
But this does not explain the F-score gap.
Perhaps the solution is to look at what Levy et al.
(Levy et al., 2015) call hyperparameters. Here the
type of context could have played the major role. It
would be interesting to twist Word2Vec and Glove to
apply them to such contexts. Another element could
have played some role: the corpus itself and the lan-
guage under study. As Goldberg (Goldberg, 2014)
puts it,
It is well known that the choice of corpora
and contexts can have a much stronger effect
on the final accuracy than the details of the
machine-learning algorithm being used [...]
Either way we achieved here two aims: building
Arabic word clusters on the basis of Arabic corpora,
a first step in enriching AWN, and showing that pat-
terns of higher frequency words, mostly grammatical
words, thrown away as “empty words” by most meth-
ods, are operative in semantic lexical clustering at
least in Arabic. More work on the contexts is needed
here.
There are still a number of questions to be
adressed. Is it possible to automatize the selection
marker threshold? What impact on the results would
have moving this threshold down or up? Reducing the
computational cost of GraPaVec is a must in order to
be able to do more extensive tests and is one of our
first objectives for now.
In a near future we also aim to produce synsets
based on our work; to try our hand at other languages,
in order to see if those results are language specific,
and to use a dynamic growing neural model that can
find by itself the number of categories.
REFERENCES
Abdelali, B. and Tlili-Guiassa, Y. (2013). Extraction des
relations s
´
emantiques
`
a partir du Wiktionnaire arabe.
Revue RIST, 20(2):47–56.
Abdulhay, A. (2012). Constitution d’une ressource
s
´
emantique arabe
`
a partir d’un corpus multilingue
align
´
e. PhD thesis, Universit
´
e de Grenoble.
Abouenour, L., Bouzoubaa, K., and Rosso, P. (2008). Im-
proving Q/A using Arabic WordNet. In Proc. of the
2008 International Arab Conference on Information
Technology (ACIT’2008), Tunisia.
Abouenour, L., Bouzoubaa, K., and Rosso, P. (2010). Using
the Yago ontology as a resource for the enrichment
of named entities in Arabic WordNet. In Proceed-
ings of The 7th International Conference on Language
Resources and Evaluation (LREC 2010) Workshop on
Language Resources and Human Language Technol-
ogy for Semitic Languages, pages 27–31.
Abouenour, L., Bouzoubaa, K., and Rosso, P. (2013). On
the evaluation and improvement of Arabic WordNet
coverage and usability. Lang Resources & Evaluation,
47(3):891–917.
Al-Barhamtoshy, H. M. and Al-Jideebi, W. H. (2009). De-
signing and implementing Arabic WordNet semantic-
based. In the 9th Conference on Language Engineer-
ing, pages 23–24.
Al Hajjar, A. E. S. (2010). Extraction et gestion de
l’information
`
a partir des documents arabes. PhD the-
sis, Paris 8 University.
Alkhalifa, M. and Rodriguez, H. (2008). Automatically ex-
tending named entities coverage of Arabic WordNet
using Wikipedia. International Journal on Informa-
tion and Communication Technologies, 1(1):1–17.
Bernard, G. (1997). Experiments on distributional catego-
rization of lexical items with Self Organizing Maps.
In International Workshop on Self Organizing Maps
WSOM’97, pages 304–309.
Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M.,
Vossen, P., Pease, A., and Fellbaum, C. (2006). Intro-
ducing the Arabic WordNet project. In Sojka, Choi, F.
and Vossen, editors, In Proceedings of the third Inter-
national WordNet Conference, pages 295–300.
Goldberg, Y. (2014). On the importance of comparing ap-
ples to apples: a case study using the GloVe model.
Google docs.
Hajjar, M., Al Hajjar, A. E. S., Abdel Nabi, Z., and Lebboss,
G. (2013). Semantic enrichment of the iSPEDAL cor-
pus. In 3rd World Conference on Innovation and Com-
puter Science (INSODE).
Harris, Z. S. (1954). Distributional structure. Word, 10(2-
3):146–162.
Harris, Z. S. (1968). Mathematical structures of language.
John Wiley & Sons.
Honkela, T., Kaski, T., Lagus, K., and Kohonen, T. (1997).
WEBSOM–Self-Organizing Maps of document col-
lections. In Proceedings of WSOM’97, Workshop on
Self-Organizing Maps, Espoo, Finland, pages 310–
315. Helsinki University of Technology.
Khoja, S., Garside, R., and Knowles, G. (2001). An Ara-
bic tagset for the morphosyntactic tagging of Arabic.
A Rainbow of Corpora: Corpus Linguistics and the
Languages of the World, 13:341–350.
Kohonen, T. (1995). Self-Organizing Maps. Springer,
Berlin.
Lebboss, G. (2016). Contribution
`
a l’analyse s
´
emantique
des textes arabes. PhD thesis, University Paris 8.
Levy, O., Goldberg, Y., and Dagan, I. (2015). Improv-
ing distributional similarity with lessons learned from
word embeddings. Transactions of the Association for
Computational Linguistics, 3:211–225.