Authors:
Adelle Abdallah
1
;
Hussein Awdeh
1
;
Youssef Zaki
1
;
Gilles Bernard
1
and
Mohammad Hajjar
2
Affiliations:
1
LIASD Lab, Paris 8 University, 2 rue de la Liberté 93526 Saint-Denis, Cedex, France
;
2
Faculty of Technology, Lebanese University, Hisbeh Street, Saida, Lebanon
Keyword(s):
Arabic Language, Arabic Natural Language Process, Validation Information Retrieval, Silver Standard Corpus.
Abstract:
Many methods have been applied to automatic construction or expansion of lexical semantic resources. Most follow the distributional hypothesis applied to lexical context of words, eliminating grammatical context (stopwords). This paper will show that the grammatical context can yield information about semantic properties of words, if the corpus be large enough. In order to do this, we present an unsupervised pattern-based model building semantic word categories from large corpora, devised for resource-poor languages. We divide the vocabulary between high-frequency and lower frequency items, and explore the patterns formed by high-frequency items in the neighborhood of lower frequency words. Word categories are then created by clustering. This is done on a very large Arabic corpus, and, for comparison, on a large English corpus; results are evaluated with direct and indirect evaluation methods. We compare the results with state-of-the-art lexical models for performance and for computa
tion time.
(More)