Building TALAA, a Free General and Categorized Arabic Corpus

Essma Selab; Ahmed Guessoum

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Building TALAA, a Free General and Categorized Arabic Corpus

In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: PUaNLP, 284-291, 2015 , Lisbon, Portugal

Authors: Essma Selab and Ahmed Guessoum

Affiliation: Université des Sciences et de la Technologie Houari Boumediene (USTHB), Algeria

Keyword(s): Corpora, Arabic Natural Language Processing, Corpus Metrics.

Abstract: Arabic natural language processing (ANLP) has gained increasing interest over the last decade. However, the development of ANLP tools depends on the availability of large corpora. It turns out unfortunately that the scientific community has a deficit in large and varied Arabic corpora, especially ones that are freely accessible. With the Internet continuing its exponential growth, Arabic Internet content has also been following the trend, yielding large amounts of textual data available through different Arabic websites. This paper describes the TALAA corpus, a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles. A part of the TALAA corpus has been tagged to construct an annotated Arabic corpus of about 7000 tokens, the POS-tagger used containing a set of 58 detailed tags. The annotated corpus was manually checked by two human experts. The m ethodology used to construct TALAA is presented and various metrics are applied to it, showing the usefulness of the corpus. The corpus can be made available to the scientific community upon authorisation. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 18.224.37.168

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Selab, E. and Guessoum, A. (2015). Building TALAA, a Free General and Categorized Arabic Corpus. In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART 2015) - Volume 2: PUaNLP; ISBN 978-989-758-073-4; ISSN 2184-433X, SciTePress, pages 284-291. DOI: 10.5220/0005352102840291

@conference{puanlp15,
author={Essma Selab and Ahmed Guessoum},
title={Building TALAA, a Free General and Categorized Arabic Corpus},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART 2015) - Volume 2: PUaNLP},
year={2015},
pages={284-291},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005352102840291},
isbn={978-989-758-073-4},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART 2015) - Volume 2: PUaNLP
TI - Building TALAA, a Free General and Categorized Arabic Corpus
SN - 978-989-758-073-4
IS - 2184-433X
AU - Selab, E.
AU - Guessoum, A.
PY - 2015
SP - 284
EP - 291
DO - 10.5220/0005352102840291
PB - SciTePress