Authors:
Georges Lebboss
1
;
Gilles Bernard
1
;
Noureddine Aliane
1
and
Mohammad Hajjar
2
Affiliations:
1
LIASD and Paris 8 University, France
;
2
Lebanese University and IUT, Lebanon
Keyword(s):
Semantic Relations, Semantic Arabic Resources, Arabic WordNet, Synsets, Arabic Corpus, Data Preprocessing, Word Vectors, Word Classification, Self Organizing Maps.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Biomedical Engineering
;
Biomedical Signal Processing
;
Computational Intelligence
;
Health Engineering and Technology Applications
;
Human-Computer Interaction
;
Learning Paradigms and Algorithms
;
Methodologies and Methods
;
Neural Networks
;
Neurocomputing
;
Neurotechnology, Electronics and Informatics
;
Pattern Recognition
;
Physiological Computing Systems
;
Self-Organization and Emergence
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Theory and Methods
Abstract:
This paper presents a method aiming to enrich Arabic WordNet with semantic clusters extracted from a large general corpus. As the Arabic language is poor in open digital linguistic resources, we built such a corpus (more than 7.5 billion words) with ad-hoc tools. We then applied GraPaVec, a new method for word vectorization using automatically generated frequency patterns, as well as state-of-the-art Word2Vec and Glove methods. Word vectors were fed to a Self Organizing Map neural network model; the clusterings produced were then compared for evaluation with Arabic WordNet existing synsets (sets of synonymous words). The evaluation yields a F-score of 82.1 % for GrapaVec, 55.1 % for Word2Vec's Skipgram, 52.2 % for CBOW and 56.6 % for Glove, which at least shows the interest of the context that GraPaVec takes into account. We end up by discussing parameters and possible biases.