Table1: The letter < ف/”f” > in its various forms.
Arabic is a Semitic language. The grammatical
system of Arabic language is based on a root-and-
pattern structure and considered as a root-based
language with not more than 10000 roots and 900
patterns (Hayder and al., 2005).The root is the bare
verb form. It is commonly three or four letters and
rarely five. Pattern can be thought of as template
adhering to well-known rules.
Arabic words are divided into nouns, verbs and
particles. Nouns and verbs are derived from roots by
applying templates to the roots to generate stems and
then introducing prefixes and suffixes (Darwish,
2002). Table 2 lists some templates (patterns) to
generate stems from roots. The examples given
below are based on the root / < drs >.
Table 2: Some templates to generate stems from the
root / < drs >. C indicate a consonant, A a vowel.
Template Stem
CCC
< drs >/ Study
CACC
< dArs >/ Student
mCCwC
< mdrws >/ Studied
Many instances of prefixes and suffixes
correspond to entire words in other languages. In
table 3, we present the different components of a
single word
which corresponds to the
phrase "and she repeats it".
Table 3: An example of an Arabic word.
French Arabic English
et
And
répéter
Repeat
elle
She
la
It
Arabic contains three genders (much like English):
masculine, feminine and neuter. It differs from Indo-
European languages in that it contains three numbers
instead of the common two numbers (singular and
plural). The third one is the dual that is used for
describing the action of two people.
3 THE FRENCH LANGUAGE
French is a descendant of the Latin language of the
Roman Empire, as are languages such as Portuguese,
Spanish, Italian, Catalan and Romanian.
The French language is written with a modern
variant of the Latin alphabet of 26 letters. French
word order is Subject Verb Object, except when the
object is a pronoun, in which case the word order is
Subject Object Verb.
French is today spoken around the world by 72
to 160 million people as a native language, and by
about 280 to 500 million people as a second or third
language (Wikipedia, 2008).
French is mostly a second language in Africa. In
Maghreb, it is an administrative language and
commonly used though not on an official basis in the
Maghreb states, Mauritania, Algeria, Morocco and
Tunisia.
In Algeria, French is still the most widely
studied foreign language, widely spoken and also
widely used in media and commerce.
4 N-GRAM MODELS
The goal of a language model is to determine the
probability of a word sequence
n
w
1
, )(
1
n
wP . This
probability is decomposed as follows:
(1)
The most widely-used language models are n-
gram models (Stanley and Goodman, 1998). In n-
gram language models, we condition the probability
of a word
i
w on the identity of the last
)1(
n
words
1
1
−
−+
i
ni
w
.
)/()/(
1
1
1
1
−
−+
−
=
i
nii
i
i
wwPwwP
(2)
The choice of
n
is based on a trade-off between
detail and reliability, and will be dependent on the
available quantity of training data (Stanley and
Goodman, 1998).
5 DATA DESCRIPTION
Currently, the availability of Arabic corpora is
somewhat limited. This is due to the relative recent
interest for Arabic applications.
For our experiments, the corpora used for Arabic are
extracted from the CAC corpus compiled by Latifa
Isolated Beginning Middle End
COMPARATIVE STUDY OF ARABIC AND FRENCH STATISTICAL LANGUAGE MODELS
157