process used to annotate, structure and validate our
corpus.
In Section 2, we present some of the research
effort related to corpus building. The process of
data collection, annotation, validation and
structuring is presented in Section 3. Statistics about
the corpus are given in Section 4 with an attempt to
show its usefulness. The conclusion is given in
Section 5.
2 RELATED WORK
Over the last decade, various corpora have been
built, but most of them are used for commercial
purposes or are not sufficiently large to represent the
Arabic language.
Raw text corpora consist of a collection of texts
with no added information such as tagging, parsing,
etc. This kind of corpora is divided into 1)
monolingual corpora, 2) parallel corpora, and 3)
dialectal Corpora. The European Language
Resources association (ELRA) (ELRA, 2008)
provides more than 83 Arabic corpora in several
categories (monolingual, multilingual, speech,
annotated, etc.) such as An-Nahar Corpus (An-
Nahar Corpus, 2014), an Arabic corpus collected
from the Lebanese newspaper in the period between
1995 and 2000 and stored in HTML files. This
corpus contains 45000 articles consisting of 24
million words for each year. The Al Hayat corpus
(Al Hayat corpus, 2014)
is another written corpus
collected form Al-Hayat newspaper. It was
developed at Essex University and covers articles
from 1998. The Al Hayat corpus contains more than
18 M distinct tokens and 42,591 articles distributed
into 7 domains (all punctuations and special
characters having been removed). Unfortunately the
corpora available on ELRA are not free.
Rafalovitch and Dale (2009) present a free
parallel corpus available online (Parallel Corpus,
2014)
that contains a collection of 2100 United
Nations General Assembly Resolution documents
with their parallel translations in the six UN official
languages (Arabic, Chinese, English, French,
Russian, and Spanish). The corpus contains about
3M tokens per language. Al-Sulaiti (2004), from the
university of Leeds, developed a Contemporary
Arabic free corpus (Contemporary corpus, 2014) in
which the articles are categorized into different
topics. The corpus contains written and spoken data
of 1 million words. Graff and Walker (2003), from
the University of Pennsylvania LDC, developed
Arabic Gigaword, a written corpus built from texts
taken from Agence France Press, Al Hayat
Newspaper, Al Nahar Newspaper and Xinhua News
Agency. The size of the corpus is approximately
1.1GB in compressed form and contains 391,619
tokens. Arabic Gigaword is available from the
Linguistic Data Consortium, but it is not free.
Alrabiah et al. (2013) built KSUCCA King Saud
University Corpus of Classical Arabic, which
contains over 50 Million words from classical
Arabic. The corpus was developed as part of the
PhD work on building a distributional lexical
semantic model for classical Arabic, and
investigating its applications to The Holy Quran.
KSUCCA corpus can be used in several Arabic
linguistic and computational linguistic researches.
Almeman and Lee (2013) built an Arabic multi-
dialect (Gulf, Levantine, Egyptian and North
African) text corpus from web resources. The corpus
contains 48M tokens.
Annotated corpora include POS-tagged corpora,
parsed corpora, semantically annotated corpora, etc.
LDC (LDC, 2014)
and ELRA provide a set of
Arabic annotated corpora and parallel annotated
corpora, which are unfortunately not free. Khoja
(Khoja, 2001), from Lancaster University, built an
annotated corpus that contains manually-tagged
Arabic newspaper texts. The first collection includes
50,000 tagged words using general tags (noun, verb,
particle, number). The second contains 1,700 tagged
words with more detailed tags (tense, gender,
number, etc.). American and Qatari Modeling of
Arabic (AQMAR) Wikipedia Dependency Corpus
(AQMAR, 2014) is a hand-annotated corpus. The
POS tagging and dependency parse information
were collected from Arabic Wikipedia articles,
consisting of 1262 sentences and more than 36,202
tokens. The corpus was developed as part of the
AQMAR project.
3 DATA PREPARATION
The process of development of the TALAA corpus
was divided into two main steps as presented below:
1) Data collection and 2) Data pre-processing.
3.1 Data Collection
The methodology used to build and structure the
Arabic corpus consisted in developing an automatic
system, a robot, to collect Arabic newspaper articles
from different websites (see Table 1). Figure 1
presents the process used to extract and organize the
data from the websites.
BuildingTALAA,aFreeGeneralandCategorizedArabicCorpus
285