
 
Table1: The letter < ف/”f” > in its various forms. 
Arabic is a Semitic language. The grammatical 
system of Arabic language is based on a root-and-
pattern structure and considered as a root-based 
language with not more than 10000 roots and 900 
patterns (Hayder and al., 2005).The root is the bare 
verb form. It is commonly three or four letters and 
rarely five. Pattern can be thought of as template 
adhering to well-known rules. 
Arabic words are divided into nouns, verbs and 
particles. Nouns and verbs are derived from roots by 
applying templates to the roots to generate stems and 
then introducing prefixes and suffixes (Darwish, 
2002). Table 2 lists some templates (patterns) to 
generate stems from roots. The examples given 
below are based on the root / < drs >. 
Table 2: Some templates to generate stems from the 
root /   < drs >.  C indicate a consonant, A a vowel. 
Template  Stem 
  
CCC 
 
< drs >/ Study 
 
CACC 
 
< dArs >/ Student 
 
mCCwC 
 
< mdrws >/ Studied 
Many instances of prefixes and suffixes 
correspond to entire words in other languages. In 
table 3, we present the different components of a 
single word 
   which corresponds to the 
phrase "and she repeats it".
 
Table 3: An example of an Arabic word. 
French Arabic English 
et 
 
And 
répéter 
 
Repeat 
elle 
 
She 
la 
 
It 
Arabic contains three genders (much like English): 
masculine, feminine and neuter. It differs from Indo-
European languages in that it contains three numbers 
instead of the common two numbers  (singular and 
plural). The third one is the dual that is used for 
describing the action of two people. 
3 THE FRENCH LANGUAGE 
French is a descendant of the Latin language of the 
Roman Empire, as are languages such as Portuguese, 
Spanish, Italian, Catalan and Romanian.  
The French language is written with a modern 
variant of the Latin alphabet of 26 letters. French 
word order is Subject Verb Object, except when the 
object is a pronoun, in which case the word order is 
Subject Object Verb. 
French is today spoken around the world by 72 
to 160 million people as a native language, and by 
about 280 to 500 million people as a second or third 
language (Wikipedia, 2008).  
French is mostly a second language in Africa. In 
Maghreb, it is an administrative language and 
commonly used though not on an official basis in the 
Maghreb states, Mauritania, Algeria, Morocco and 
Tunisia.  
In Algeria, French is still the most widely 
studied foreign language, widely spoken and also 
widely used in media and commerce.  
4  N-GRAM MODELS 
The goal of a language model is to determine the 
probability of a word sequence
n
w
1
, )(
1
n
wP . This 
probability is decomposed as follows: 
 (1) 
The most widely-used language models are n-
gram models (Stanley and Goodman, 1998). In n-
gram language models, we condition the probability 
of a word 
i
w on the identity of the last 
)1(
n
 
words
1
1
−
−+
i
ni
w
. 
)/()/(
1
1
1
1
−
−+
−
=
i
nii
i
i
wwPwwP
 (2) 
The choice of 
n
 is based on a trade-off between 
detail and reliability, and will be dependent on the 
available quantity of training data (Stanley and 
Goodman, 1998). 
5 DATA DESCRIPTION 
Currently, the availability of Arabic corpora is 
somewhat limited. This is due to the relative recent 
interest for Arabic applications.  
For our experiments, the corpora used for Arabic are 
extracted from the CAC corpus compiled by Latifa 
Isolated  Beginning  Middle  End 
       
COMPARATIVE STUDY OF ARABIC AND FRENCH STATISTICAL LANGUAGE MODELS
157