loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Essma Selab and Ahmed Guessoum

Affiliation: Université des Sciences et de la Technologie Houari Boumediene (USTHB), Algeria

Keyword(s): Corpora, Arabic Natural Language Processing, Corpus Metrics.

Abstract: Arabic natural language processing (ANLP) has gained increasing interest over the last decade. However, the development of ANLP tools depends on the availability of large corpora. It turns out unfortunately that the scientific community has a deficit in large and varied Arabic corpora, especially ones that are freely accessible. With the Internet continuing its exponential growth, Arabic Internet content has also been following the trend, yielding large amounts of textual data available through different Arabic websites. This paper describes the TALAA corpus, a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles. A part of the TALAA corpus has been tagged to construct an annotated Arabic corpus of about 7000 tokens, the POS-tagger used containing a set of 58 detailed tags. The annotated corpus was manually checked by two human experts. The m ethodology used to construct TALAA is presented and various metrics are applied to it, showing the usefulness of the corpus. The corpus can be made available to the scientific community upon authorisation. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.227.72.24

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Selab, E. and Guessoum, A. (2015). Building TALAA, a Free General and Categorized Arabic Corpus. In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART 2015) - Volume 2: PUaNLP; ISBN 978-989-758-073-4; ISSN 2184-433X, SciTePress, pages 284-291. DOI: 10.5220/0005352102840291

@conference{puanlp15,
author={Essma Selab. and Ahmed Guessoum.},
title={Building TALAA, a Free General and Categorized Arabic Corpus},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART 2015) - Volume 2: PUaNLP},
year={2015},
pages={284-291},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005352102840291},
isbn={978-989-758-073-4},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART 2015) - Volume 2: PUaNLP
TI - Building TALAA, a Free General and Categorized Arabic Corpus
SN - 978-989-758-073-4
IS - 2184-433X
AU - Selab, E.
AU - Guessoum, A.
PY - 2015
SP - 284
EP - 291
DO - 10.5220/0005352102840291
PB - SciTePress