Zarrouki. It had been manually rewritten and
vocalized by volunteers, and covers about 75
million vocalized words (Zerrouki, 2017).
Al Khaleej Corpus: collected from online
newspaper “Akhbar El Khaleej” by M. Abbas.
It contains more than five hundred articles,
distributed into 3 categories (International and
local news, Economy and sports), and covers
about 3 million words (Abbas, 2005).
King Abdulaziz City for Science and
Technology Arabic Corpus: collected from a
diversity of publishing media by Al-Thubaity
and al. It contains more than 869800 files,
distributed into several categories
(manuscripts, newspapers, books, magazines,
scientific periodicals, etc.), and covers more
than 700 million words; 7,464,396 of which
are unique (Al-Thubaity, 2015).
Contemporary Arabic Corpus: collected
between 1990 and 2004 from newspapers,
emails and websites by Al-Sulaiti and Atwell.
It is tagged in xml language and it covers
more than 842,684 words (Al-Sulaiti,, 2005).
Kalimat Corpus: collected from the Arabic
newspaper Alwatan by el-haj and koulali,
summed up into 2,057 multi document system
summaries, NER annotated, POS tagged and
full morphologically analyzed. It contains
more than 20,291 articles, distributed into six
categories (culture, economy, international
news, local news, religion and sports), and
covers about 18,167,183 million words (El
Haj, 2013).
SACS Corpus: collected from the proceedings
of the Saudi Arabian National Computer
Science Conference by Abu Salem. It covers
46,968 words tagged with title, authors,
sources and abstract (Abu Salem).
The International Corpus of Arabic: collected
from electronic books, academic research
papers, and articles of newspapers sites by
Alansary. It contains 70,022 articles,
distributed into eleven categories (strategic,
national and social sciences, sports, religion,
literature, bibliography and others), and covers
more than 80 million words; 1,272,766 of
which are unique (Alansary, 2014).
Al-Raya Corpus: collected from the articles of
Al-Raya newspaper by Hasnah. It contains
about 187 articles and 219,978 words, over
30,096 of which are unique words (Hasnah,
1996).
Arabic Modern Standard Corpus: collected
from newspaper articles from different Arabic
countries by Abdalali. It covers 102,134
articles with about 113 million words
(Abdelali, 2005).
University of Jordan Arabic Corpus: collected
from 15 Arabic newspapers and other
resources from 19 Arabic countries by
researchers from Jordan University. It is
tagged in XML, and contains 61,037 articles
with 7,522,941 words, and over 70, 7385 of
which are unique words (Hammo, 2013).
3.2 Commercially Available Arabic
Corpora
The 5 monolingual text, and annotated corpora,
which is cited below, are commercially Arabic
corpus, and covers the news domain.
LDC Corpus (Arabic Newswire): collected
from the articles of the Agency France Press
newswire published between 1994 and 2000
by Graff and Walker at the University of
Pennsylvania’s LDC. It covers more than 76
million words, 666,094 of which are unique,
distributed into 383,872 files (Graff, 2001).
An-Nahar Newspaper Text Corpus: collected
from an-Nahar newspaper from 1995 to 2000,
stored as hypertext Mark-up Language
(HTML) files. It covers about 45 hundred
articles and 24 million words (ELRA, 2001).
Al-Hayat Arabic Corpus: collected from the
al-Hayat Arabic newspaper. It contains 42,591
articles, distributed into several categories
(General, Car, Computer, News, Economics,
Science and Sport), and covers around 42,591
articles with 18,639,264 unique words
(University Essex, 2001).
Nemlar Corpus: collected from 13 different
categories (political news, Islamic text,
phrases of common words, broadcast news,
business, Arabic literature, general news,
interviews, scientific press, sports press,
dictionary entries explanation and legal
domain text) by Nemlar project. It is provided
four versions: raw, fully vowelized, with
Arabic lexical analysis, and with Arabic POS-
tags, and covers more than 500000 words
(ALP team, 2003).
Arabic Gigaword Corpus: collected from four
distinct Arabic newswire (Agency France
Press, Al-hayat, Annahar and Xinhua news
agency) by Graff. It is encoded with utf-8 and
written in SGML, and covers about 1,256,719
articles words with 391619 words (Graff,
2003).