Arabic Corpus Enhancement using a New

Lexicon/Stemming Algorithm

Ashraf AbdelRaouf

, Colin A. Higgins

, Tony Pridmore

and Mahmoud I. Khalil

School of Computer Science, The University of Nottingham, Nottingham, U.K.

Faculty of Engineering, Ain Shams University, Cairo, Egypt

Keywords: Arabic Corpus, Optical Character Recognition, Data Retrieval, Morphological Analysis, Lexicon, Stemming

Algorithm.

Abstract: Optical Character Recognition (OCR) is an important technology and has many advantages in storing

information for both old and new documents. The Arabic language lacks both the variety of OCR systems

and the depth of research relative to Roman scripts. An authoritative corpus is beneficial in the design and

construction of any OCR system. Lexicon and stemming tools are essential in enhancing corpus retrieval

and performance in an OCR context. A new lexicon/stemming algorithm is presented based on the Viterbi

path method which uses a light stemmer approach. Lexicon and stemming lookup is combined to obtain a

list of alternatives for uncertain words. This list removes affixes (prefixes or suffices) if there are any;

otherwise affixes are added. Finally, every word in the list of alternatives is verified by searching the

original corpus. The lexicon/stemming algorithm also assures the continuous updating of the contents of the

corpus presented by (AbdelRaouf et al., 2010), which copes with the innovative needs of Arabic OCR

research.

1 INTRODUCTION

A corpus is a structured collection of text covering a

large number of words from different domains of a

given language. The first modern corpus was the

Brown Corpus. It was collected and compiled in

1967 (Kuera and Francis, 1967) and contains

almost 1 million English words from different

disciplines. Currently, the three best known English

corpora are: The Corpus of Contemporary American

English, containing over 410 million words, created

in 1990; the British National Corpus, containing 100

million words, created in 1980 (Corpus, 2007); and

the Time Archive, created in 1923 and based on

Time magazine. The Time Archive contains more

than 275,000 articles with around 100 million words

(Time, 2008). All of these continue to be updated.

Stemming is the process of determining the

morphological root of a word as well as a tool for

data retrieval from a corpus to minimize

mismatching errors (Larkey et al., 2002). This is

acheived by removing affixes (prefixes, infixes or

suffixes) from the word. The word may have many

forms while retaining the same meaning, for

example, in English “works, working, worker and

worked” are all derived from the root word "work".

All languages contain nouns and verbs, for example

in English the nouns “book, books” are derived from

the root word “book” but one is singular and the

other plural. Also verbs like “play, played, playing”

have the same root word “play” but in different

tenses. A sophisticated lexicon need not make all

these alternative forms of words explicit, but may be

optimized to store only the root words and methods

to obtain and report the derived words.

The Arabic language has similar rules to English.

Arabic nouns like “    ” are derived from

the root word “” but the first is singular, the

second is for a pair and the third is plural. Verbs like

“   ” are also derived from the root word

“” but in different tenses.

This paper presents an automated way to

increase the number of words in an Arabic corpus. It

consists of applying a light stemming algorithm to

remove all affixes from a word, and then growing

the word with affixes to obtain multiple alternative

words. These are then either checked against the

original corpus or submitted, for review, to a user.

The paper is arranged as follows: Section one is an

introduction to the paper describing the motivation

435

AbdelRaouf A., A. Higgins C., Pridmore T. and I. Khalil M. (2013).

Arabic Corpus Enhancement using a New Lexicon/Stemming Algorithm.

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 435-440

DOI: 10.5220/0004260704350440

 SciTePress

for applying the lexicon/stemming algorithm to the

corpus, showing Arabic morphology and describing

other work that has contributed this strategy. Section

two describes the approach that had been followed

to solve the problem and the affixes used to enhance

the performance of the stemmer. Section three

explains the algorithm used to apply the new

approach. Section four presents a statistical analysis

of the new method. Finally, section five describes

the planned future development and uses of the

approach and presents some conclusions.

1.1 Motivation

Natural Language Processing (NLP) is the use of

computer technologies for the creation, archiving,

processing and retrieval of machine processed

language data and is a common research topic

involving computer science and linguistics

(Maynard et al., 2002). Research in the NLP of

Arabic are very limited (AbdelRaouf et al., 2010).

So, for instance, the Arabic language lacks a robust

Arabic corpus. The creation of a well-established

Arabic corpus encourages Arabic language research

and enhances the development of Arabic OCR

applications.

This paper presents a new approach which

extends and develops that reported in (AbdelRaouf

et al., 2008, AbdelRaouf et al., 2010). An Arabic

corpus of 6 million Arabic words containing

282,593 unique words was constructed. In order to

check the performance and accuracy of this corpus, a

testing dataset of 69,158 words was also created.

Upon searching, 89.8% of the testing dataset was

found to exist in the corpus. We considered this

accuracy very low. To improve this the system was

enhanced using a lexicon/stemming algorithm. A

combination of stemming and lexicon lookup was

used to provide a list of alternatives for the missing

words.

We designed our stemmer to avoid two common

errors. The first error occurs when the stemmer fails

to find the relevant words (words derived from the

same root word) and hence fails to increase the

corpus accuracy. The second error occurs when the

stemmer uses many affixes to create a very long list

of alternative words, and hence detects irrelevant

words (words not related in meaning to the original

word). This also makes it slower.

Our stemmer increases the accuracy of the

corpus and simultaneously improves the reporting of

relevant words.

1.2 Arabic Language Morphology

The Arabic language depends mainly on the root of

a word. The root word can produce either a verb or a

noun, for instance “” - a root word – can be a

noun as in “   ” or a verb as in “   ”.

Stemmers, in general, tend to extract the root of

the word by removing affixes. English stemmers

remove only suffixes whereas Arabic stemmers

mainly remove prefixes and suffixes, some of them

also remove infixes.

Lexica on the other hand create a list of

alternative words that can be produced by that root

(Al-Shalabi and Evens, 1998, Jomma et al., 2006).

Arabic words change according to the following

variables: (Al-Shalabi and Evens, 1998, Al-Shalabi

and Kanaan, 2004)

 Gender: Male or female, as in (   ).

 Tense (verbs only): Past, present or future, as

in (   ).

 Number: Singular, pair or plural, as in (  

  ).

 Person: First, second or third, as in (  

 ).

 Imperative verb: as in (   ).

 Definiteness: Definite or indefinite, as in ( 



The Arabic language, in addition to verbs and nouns,

contains prepositions, adverbs, pronouns and so on.

1.3 Related Work

The Arabic language is rich and has a large variety

of grammar rules. Research in Arabic linguistics is

varied and can be categorized into four main types.

1.3.1 Manually Constructed Dictionaries

A custom Arabic retrieval system is built depending

on a list of roots and creates lists of alternative

words depending on those roots. This method is

limited by the number of roots collected (Al-

Kharashi and Evens, 1994).

1.3.2 Morphological Analysis

This is an important topic in natural language

processing. It is mainly concerned with roots and

stemming identification and is related more to the

grammar of the word and its positioning (Al-Shalabi

and Evens, 1998, Al-Shalabi and Kanaan, 2004,

Jomma et al., 2006).

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

436

1.3.3 Statistical Stemmers

These do not depend on the language involved but

on the similarity of rules among different languages.

A stemmer generator has been developed using a

parallel corpus which is a collection of sentence

pairs with the same meaning in different languages

(Rogati et al., 2003).

1.3.4 Light Stemmers

These depend on affix removal. An affix is defined

here as one or more letters added to the root word.

This type of stemmer needs less knowledge of

Arabic grammar (Aljlayl and Frieder, 2002, Larkey

et al., 2002).

2 APPROACH

Lexicon and stemming tools are used in enhancing

corpus retrieval and performance in an OCR context.

This section presents a lexicon/stemming algorithm

based on the use of a light stemmer. Lexicon and

stemming lookup is combined to obtain a list of

alternatives for uncertain words. This list removes

affixes (prefixes or suffixes) if there are any, if not,

adds affixes to the uncertain word. Finally, it tries to

verify every word in the list of alternatives by

searching the original corpus. A tool is added to

continually improve the corpus by adding new

words and justifying them using the

lexicon/stemming algorithm.

2.1 Problem Definition

The approach aims to improve the viability of the

Multi-Modal Arabic Corpus (MMAC) (AbdelRaouf,

Higgins et al. 2010). This approach helps to

facilitate an easy and acceptable way to continuously

update MMAC with new words. The approach adds

new words using the different linguistic information

available for these words. For every new word, the

approach gives the following information.

 The existence of the new word in MMAC.

This means that the word can be added to the

corpus without any problems.

 The existence of an alternative word in

MMAC. This is less accurate but it strongly

suggests the word might be correct.

 Finally, the non-existence of neither the word

nor its alternatives in MMAC. As a result, this

word must be carefully checked manually.

The approach offers a tool that allows adding a

list of new words to MMAC corpus and regenerates

MMAC content files automatically.

2.2 Stemmer used

Of the two main types of stemmer the light stemmer

was chosen for the following reasons.

 Light stemmers give more accurate results

(Aljlayl and Frieder, 2002).

 Morphological stemmers depend on regular

language rules. This may be applicable to the

English language because of the relatively

small number of irregular rules, but is not

applicable to the Arabic language because of

the large number of irregular rules.

 Morphological stemmers sometimes depend

on diacritics to extract the root. Diacritics are

not included in this research.

 A light stemmer does not need good

knowledge of the language grammar rules.

2.3 The Proposed Light Stemmer

Affixes

In the Arabic language, the root may be prefixed

(e.g.  ), infixed (e.g.   ) or suffixed

(e.g.   ) by one or more letters. These letters

have no meaning if written separately. Most of the

light stemmers remove prefixes and suffixes only.

The affixes used are shown in Table 1 (Larkey,

Ballesteros et al., 2002). Shalabi (Al-Shalabi and

Kanaan, 2004) created a lexicon that included

affixes using morphological rules which are not used

in other light stemmers but are used here. These

affixes where chosen as they proved most useful

during testing, however, it was clear that more

affixes were needed to improve performance. Table

1 lists the different types of light stemmers. It also

shows all affixes used to enable more words to be

detected by the corpus.

Table 1: Affixes of different light stemming types.

Stemmer Prefixes Suffixes

Shalabi

        

Light 1

    

None

Light 2

     

None

Light 3

"  

Light 8

"       

  

Light 10

      "

Proposed

Stemmer

     

  

       

      

  

ArabicCorpusEnhancementusingaNewLexicon/StemmingAlgorithm

437

3 THE PROPOSED LEXICON/

STEMMING ALGORITHM

The proposed lexicon/stemmer algorithm creates a

list of alternative words instead of the missing word

from the corpus. It searches the corpus for the words

from the alternative list using a binary search.

3.1 The Alternative List

Generating an alternative list of words is the most

important part of the approach. These alternatives

are a list of all the words that might be derived from

the original word. This part of the algorithm deals

with the words from the testing dataset (see section

1.1) which are missing from the corpus. It applies

the following steps (see Figure 1):

 It checks the first and last letters of a word and

searches for the prefixes and suffixes letters

shown in Table 1.

 It creates a two-dimensional array of

characters of size 25  10. 25 characters are

used because this is more than the maximum

Arabic word length and 10 characters because

this is the maximum number of affixes that

can be used with a word.

 It creates a one-dimensional array of 25

characters. This array keeps the path to be

followed to generate the alternative word from

the other two-dimensional array.

 The two dimensional array is used to generate

all possible alternatives, by adding prefixes or

suffixes to the word (see Table 1).

 In the case of affix with value ‘0’, it runs the

search without this character. For example in

Figure 1, the first character is ‘’. The

algorithm will add all the paths once with ‘’

to the alternative list and once more without it.

This means going through all the paths once

with the affix and once more without it.

 It starts getting all the possible paths from the

two-dimensional array using a Viterbi path

algorithm and stops in any column whenever

the terminating character ('\0') is found.

 It adds each word created in each possible

path to the alternative words list of the missed

word.

Figure 1 shows the algorithm used to generate

the alternative words list of a sample root word “”.

The three letters in the middle are the root word. The

letters on the right side show all the possible

prefixes. The letters on the left side show all the

possible suffixes. The counter check array keeps the

location of the path to be followed to generate the

alternative words list.

\0 1 0 0 0 0 0 0 0 0

Counter check array

        

0 0 \0 \0 \0 0 0 0 0 1

 

  

\0 2

 

\0 \0



 



 

\0 5





\0 8

Suffixes Root Prefixes

10 9 8 7 6 5 4 3 2 1 0

Figure 1: The structure of the proposed algorithm.

3.2 Implementing the Proposed

Approach

A program was implemented to apply the proposed

approach to the corpus data files. A script MS DOS

batch script was developed to generate the required

overall application. The script uses the proposed

algorithm program alongside a second program to

generate the new corpus data files.

The input is a list of the new words required to

be added to the corpus. The program generates two

lists of words, a list of unique words found in the

corpus and a list of unique words missed from the

corpus.

The list of unique words missed from the corpus

is divided into two: a trusted and an untrusted words

list. The trusted words list includes the words that

have an alternative word in the corpus whereas the

untrusted words list includes the words that have not

got alternative words in the corpus.

The user is required to edit the two lists to

manually check the words that are going to be added

to the corpus. After verifying the list of words the

application loads the new list of words into the

corpus. The application regenerates all the data files

of the corpus with the new lists of words.

4 STATISTICAL ANALYSIS OF

THE PROPOSED APPROACH

This section presents tests of the proposed approach

from two aspects. First, tests for the accuracy of the

approach and whether they enhance the accuracy of

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

438

the corpus are given. Secondly, tests on the effect of

the approach to increasing the total number of

tokens in the corpus are presented. The tests were

applied using the MMAC corpus presented by

(AbdelRaouf et al., 2010).

4.1 Accuracy of the Approach

The system was tested against the MMAC corpus

data using the testing dataset mentioned in section

1.1. The purpose of this test is to check the accuracy

of the approach. It also checks whether the generated

list of alternative words includes relevant words or

not.

The total number of unique words in the corpus

is 282,593. The total number of unique words in the

testing dataset is 17,766. The total number of words

from the testing dataset found in the corpus is

15,967 with an accuracy of 89.8%. Hence the total

number of words from the testing dataset missed

from the corpus is 1,799 with an error rate of 10.2%.

The total number of words found in the corpus

after applying the approach is 17,431 with accuracy

of 98.1%. The total number of words still missing

from the corpus is 335 with an error of 1.89%.

The total number of words found in the corpus

using the approach only is 1,464 words. These

words are either relevant to the missing word which

wasn’t found in the corpus (1,387 words with an

accuracy of 94.7%), or irrelevant to the missing

word (77 words with an error of 5.3%).

The previous statistics indicate that the proposed

algorithm reaches a high level of accuracy in finding

the words (98.1%) with a very minor missing words

error factor of 1.89% and also a negligible error in

finding irrelevant words of 0.4%.

4.2 Corpus Enhancement

The system was also tested by adding new words to

MMAC and seeing the effect of these additions upon

the performance of the corpus, with the aim of

including more new words on a regular basis.

The MMAC testing dataset was used as a list of

new words to be added to the corpus. Section 4.1

shows the number of words that are added to the

corpus using the testing dataset. The numbers

presented are for both trusted and un-trusted lists.

(AbdelRaouf, Higgins et al. 2010) included new

Arabic tokens: Piece of Arabic Word (PAW) and

Naked Piece of Arabic Word (NPAW) PAW without

dots. MMAC makes innovative use of this new

concept of connected segments of Arabic words

(PAWs) with and without diacritics marks.

Table 2 shows the number of words that are

added to the corpus using the trusted and untrusted

list of the testing dataset. Table 3 shows the number

of words that are added to the corpus using the

trusted list of the testing dataset. Table 4 shows the

number of words that are added to the corpus using

the untrusted list of the testing dataset.

Tables 2, 3 and 4 show that the proposed

approach is working well with the MMAC corpus.

They also show that the approach can easily increase

the total number of valid words in the MMAC

corpus. It is also apparent that the process of adding

new lists of words to the corpus is very easy, thus

the complicated procedures that were followed to

generate the corpus are no longer needed during

updates.

Table 2: Number and percentage of tokens using trusted and un-trusted lists.

Description

Before lexicon /

stemming

After lexicon /

stemming

Percentage

increase

Total number of Words 6,000,000 6,069,158 1.15%

Number of Unique words 282,593 284,392 0.64%

Number of Unique Naked words 211,072 212,422 0.64%

Number of Unique PAWs 66,725 67,010 0.43%

Number of Unique Naked PAWs 32,804 32,925 0.37%

Table 3: Total number and percentage of tokens using trusted list.

Description

Before lexicon /

stemming

After lexicon /

stemming

Percentage

increase

Total number of Words 6,000,000 6,069,158 1.15%

Number of Unique words 282,593 284,057 0.52%

Number of Unique Naked words 211,072 212,142 0.51%

Number of Unique PAWs 66,725 66,923 0.30%

Number of Unique Naked PAWs 32,804 32,894 0.27%

ArabicCorpusEnhancementusingaNewLexicon/StemmingAlgorithm

439

Table 4: Total number and percentage of tokens using un-trusted list.

Description

Before lexicon /

stemming

After lexicon /

stemming

Percentage

increase

Total number of Words 6,000,000 6,069,158 1.15%

Number of Unique words 282,593 282,928 0.12%

Number of Unique Naked words 211,072 211,352 0.13%

Number of Unique PAWs 66,725 66,813 0.13%

Number of Unique Naked PAWs 32,804 32,839 0.11%

5 CONCLUSIONS

AND FURTHER WORK

The proposed algorithm can be used with any Arabic

corpus, not only MMAC. The approach can easily

increase the size of the corpus with high accuracy.

The existence of this approach encourages

researchers and developers to enhance Arabic

corpora with new lists of words.

Future work will improve the proposed algorithm

to first find the most likely word to the missing one

and secondly to recommend the relative frequency

of letter pairs bigrams and trigrams frequencies.

Finally word pairs and tri-gram frequencies will be

investigated. The approach will include infix

removal in addition to prefixes and suffices. It will

be modified to use intelligent language techniques to

simply define the rules of affix addition or removal.

REFERENCES

AbdelRaouf, A., C. Higgins and M. Khalil (2008). A

Database for Arabic printed character recognition. The

International Conference on Image Analysis and

Recognition-ICIAR2008, Póvoa de Varzim, Portugal,

Springer Lecture Notes in Computer Science (LNCS)

series.

AbdelRaouf, A., C. Higgins, T. Pridmore and M. Khalil

(2010). "Building a Multi-Modal Arabic Corpus

(MMAC)." The International Journal of Document

Analysis and Recognition (IJDAR) 13(4): 285-302.

Al-Kharashi, I. A. and M. W. Evens (1994). "Comparing

words, stems, and roots as index terms in an Arabic

Information Retrieval System." Journal of the

American Society for Information Science 45(8): 548 -

560.

Al-Shalabi, R. and M. Evens (1998). A Computational

Morphology System for Arabic. Workshop on

Computational Approaches to Semitic Languages

COLING-ACL98, Montreal.

Al-Shalabi, R. and G. Kanaan (2004). "Constructing An

Automatic Lexicon for Arabic Language."

International Journal of Computing & Information

Sciences 2(2): 114-128.

Aljlayl, M. and O. Frieder (2002). On Arabic search:

improving the retrieval effectiveness via a light

stemming approach. The Eleventh International

Conference on Information and knowledge

Management, McLean, Virginia, USA, Conference on

Information and Knowledge Management archive.

Corpus, T. B. N. (2007). The British National Corpus

(XML Edition).

Jomma, H. D., M. A. Ismail and M. I. El-Adawy (2006).

An Efficient Arabic Morphology Analysis Algorithm.

The Sixth Conference on Language Engineering,

Cairo, Egypt, The Egyptian Society of Language

Engineering.

Kuera, H. and W. N. Francis (1967). "Computational

Analysis of Present-Day American English."

International Journal of American Linguistics 35(1):

71-75.

Larkey, L. S., L. Ballesteros and M. E. Connell (2002).

Improving stemming for Arabic information retrieval:

Light stemming and co-occurrence analysis. 25th

International Conference on Research and

Development in Information Retrieval (SIGIR).

Maynard, D., V. Tablan, H. Cunningham, C. Ursu, H.

Saggion, K. Bontcheva and Y. Wilks (2002).

"Architectural Elements of Language Engineering

Robustness." Journal of Natural Language

Engineering – Special Issue on Robust Methods in

Analysis of Natural Language Data: 1-20.

Rogati, M., S. McCarley and Y. Yang (2003).

Unsupervised learning of Arabic stemming using a

parallel corpus. The 41st Annual Meeting of the

Association for Computational Linguistics (ACL),

Sapporo, Japan.

Time. (2008). "Time Archive 1923 to present." from

http://www.time.com/time/archive/.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

440