Authors:
Essma Selab
and
Ahmed Guessoum
Affiliation:
Université des Sciences et de la Technologie Houari Boumediene (USTHB), Algeria
Keyword(s):
Corpora, Arabic Natural Language Processing, Corpus Metrics.
Abstract:
Arabic natural language processing (ANLP) has gained increasing interest over the last decade. However,
the development of ANLP tools depends on the availability of large corpora. It turns out unfortunately that
the scientific community has a deficit in large and varied Arabic corpora, especially ones that are freely
accessible. With the Internet continuing its exponential growth, Arabic Internet content has also been
following the trend, yielding large amounts of textual data available through different Arabic websites. This
paper describes the TALAA corpus, a voluminous general Arabic corpus, built from daily Arabic
newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens
contained in 57,827 different articles. A part of the TALAA corpus has been tagged to construct an
annotated Arabic corpus of about 7000 tokens, the POS-tagger used containing a set of 58 detailed tags. The
annotated corpus was manually checked by two human experts. The m
ethodology used to construct TALAA
is presented and various metrics are applied to it, showing the usefulness of the corpus. The corpus can be
made available to the scientific community upon authorisation.
(More)