4 CONCLUSIONS
This paper is an attempt to provide Arabic NLP with
new probabilistic measures supporting issues in-
volved in generating the most probable word pat-
terns and roots on the lexical level. Based on statisti-
cal analysis of morphologically pre-processed cor-
pus containing 50544830 Arabic word-forms, root
and pattern predictive values have been estimated
and assigned to 6860 Arabic roots in association
with 650 patterns. In this context Root and Pat-
tern Predictive Values were introduced, which
might be interpreted as uncertain binary relations.
The applicability of theses measures is wide-
ranging such as supporting morphological analysis,
word-sense disambiguation and non-word detection
and correction on the lexical level, whereas syntacti-
cal cases of the pattern can also be considered.
These values have successfully been utilized in a
hybrid approach to detect and correct Arabic Non-
Words.
One interesting aspect of introducing these
measures lies in the fact that root-pattern phenome-
non of Arabic has directly been considered within a
statistical model, which might reflect more natural
result than pure and general purpose N-Gram analy-
sis, used by different Arabic researchers.
On the other side, despite the fact that this model
has considered the isolated morpho-syntactical pat-
tern forms and their expected roots, it needs to be
integrated within a discourse representational sta-
tistical language model to support more context
depended applications such as deep semantic
analysis and others. Presenting a comprehensive
model based on the introduced measures exceeds the
scope of this paper. The author is working on
pursuing this objective considering more aspects,
which can benefit from the presented measures
besides investigating additional measures based on
exploring the semantic potential of the introduced
measures statistically.
REFERENCES
Beesley K. B, 2001. Finite-State Morphological Analysis
and Generation of Arabic at Xerox Research: Status
and Plans 2001. In ACL/EACL01, Conference of the
European Chapter, Workshop: Arabic Language
Processing: Status and Prospect. France, Morgan
Kaufman Publisher 2001.
Dichy Joseph and Farghaly A., 2003. Roots vs. Stems plus
Grammar-Lexis Specifications: on what basis should a
multilingual lexical databases centred on Arabic be
built? In MT SUMMIT IX, Workshop on Machine
Translation for Semitic Languages: Issues and
Approaches. New Orleans, USA 2003, AMTA.
Ditters E., 2001. A Formal Grammar for the Description
of Sentences Structures in Modern Standard Arabic.
In ACL/EACL01, Conference of the European
Chapter, Workshop: Arabic Language Processing:
Status and Prospect. France, Morgan Kaufman
Publisher 2001.
Fischer W., 19972. Grammatik des Klassischen Arabisch.
Otto Harrassowitz, Wiesbaden.
Haddad B., 2007. Semantic Representation of Arabic: A
logical Approach towards Compositionality and
Generalized Arabic Quantifiers. In International
Journal of computer processing of oriental languages,
IJCPOL 20(1) 2007. World Scientific Publishing.
Haddad Bassam and Yaseen M., 2007. Detection and
Correction of Non-Words in Arabic: A Hybrid
Approach. In International Journal of Computer
Processing of Oriental Languages, IJCPOL, Vol. 20,
Number 4, December 2007. World Scientific
Publishing.
Haddad B., 2002. Representing the Ignorance about the
Uncertainty in Associative Medical Relationships: An
IBFR Approach. In The 6th World MultiConference
on Systemics. IIIS, SCI 2002, Florida, USA 2002.
Shaalan K., 2005. GramChek: a grammar checker for
Arabic. In Software Practice and Experience. John
Wiley & sons Ltd., UK, 35(7):643-665, June 2005.
Shafer Charles and Yarowsky David, 2003. A Two-Level
Syntax-Based Approach to Arabic-English Statistical
Machine Translation. In MT SUMMIT IX, Workshop
on Machine Translation for Semitic Languages: Issues
and Approaches, new Orleans, USA 2003, AMTA.
Yaseen
M., Atiyya
M., Bendahman
C., Maegaard
B.,
Choukri
K., Paulsson
N., Haamid
S., Fersøe
H.,
Krauwer
S., Rashwan
M., Haddad
B., Mukbel
C.,Mouradi
A., Ali
A., Shahin
M., Ragheb A., 2006.
Building Annotated Written and Spoken Arabic Cor-
pora Resources in NEMLAR Project. In The Fifth
International Conference on Language Resources and
Evaluation. LREC-2006, Genoa-Italy.
APPENDIX A
Transcription of Arabic Letters based on DIN and
(Fischer, 1972). Long vowels are represented
through the letters ( ا , ā), (ى, ī) and (و, ū), while short
vowels as follows: ( fatḥa,
—َ , a), (kasrah, —ِ , i)
and (
ḍammah, —ُ , u).
Letter Transcription Name
ﺀ
‘
hamza
Ā ’alif
B bā’
REPRESENTATION OF ARABIC WORDS - An Approach Towards Probabilistic Root-Pattern Relationships
151