for applying the lexicon/stemming algorithm to the
corpus, showing Arabic morphology and describing
other work that has contributed this strategy. Section
two describes the approach that had been followed
to solve the problem and the affixes used to enhance
the performance of the stemmer. Section three
explains the algorithm used to apply the new
approach. Section four presents a statistical analysis
of the new method. Finally, section five describes
the planned future development and uses of the
approach and presents some conclusions.
1.1 Motivation
Natural Language Processing (NLP) is the use of
computer technologies for the creation, archiving,
processing and retrieval of machine processed
language data and is a common research topic
involving computer science and linguistics
(Maynard et al., 2002). Research in the NLP of
Arabic are very limited (AbdelRaouf et al., 2010).
So, for instance, the Arabic language lacks a robust
Arabic corpus. The creation of a well-established
Arabic corpus encourages Arabic language research
and enhances the development of Arabic OCR
applications.
This paper presents a new approach which
extends and develops that reported in (AbdelRaouf
et al., 2008, AbdelRaouf et al., 2010). An Arabic
corpus of 6 million Arabic words containing
282,593 unique words was constructed. In order to
check the performance and accuracy of this corpus, a
testing dataset of 69,158 words was also created.
Upon searching, 89.8% of the testing dataset was
found to exist in the corpus. We considered this
accuracy very low. To improve this the system was
enhanced using a lexicon/stemming algorithm. A
combination of stemming and lexicon lookup was
used to provide a list of alternatives for the missing
words.
We designed our stemmer to avoid two common
errors. The first error occurs when the stemmer fails
to find the relevant words (words derived from the
same root word) and hence fails to increase the
corpus accuracy. The second error occurs when the
stemmer uses many affixes to create a very long list
of alternative words, and hence detects irrelevant
words (words not related in meaning to the original
word). This also makes it slower.
Our stemmer increases the accuracy of the
corpus and simultaneously improves the reporting of
relevant words.
1.2 Arabic Language Morphology
The Arabic language depends mainly on the root of
a word. The root word can produce either a verb or a
noun, for instance “” - a root word – can be a
noun as in “ ” or a verb as in “ ”.
Stemmers, in general, tend to extract the root of
the word by removing affixes. English stemmers
remove only suffixes whereas Arabic stemmers
mainly remove prefixes and suffixes, some of them
also remove infixes.
Lexica on the other hand create a list of
alternative words that can be produced by that root
(Al-Shalabi and Evens, 1998, Jomma et al., 2006).
Arabic words change according to the following
variables: (Al-Shalabi and Evens, 1998, Al-Shalabi
and Kanaan, 2004)
Gender: Male or female, as in ( ).
Tense (verbs only): Past, present or future, as
in ( ).
Number: Singular, pair or plural, as in (
).
Person: First, second or third, as in (
).
Imperative verb: as in ( ).
Definiteness: Definite or indefinite, as in (
).
The Arabic language, in addition to verbs and nouns,
contains prepositions, adverbs, pronouns and so on.
1.3 Related Work
The Arabic language is rich and has a large variety
of grammar rules. Research in Arabic linguistics is
varied and can be categorized into four main types.
1.3.1 Manually Constructed Dictionaries
A custom Arabic retrieval system is built depending
on a list of roots and creates lists of alternative
words depending on those roots. This method is
limited by the number of roots collected (Al-
Kharashi and Evens, 1994).
1.3.2 Morphological Analysis
This is an important topic in natural language
processing. It is mainly concerned with roots and
stemming identification and is related more to the
grammar of the word and its positioning (Al-Shalabi
and Evens, 1998, Al-Shalabi and Kanaan, 2004,
Jomma et al., 2006).
ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods
436