mutually exclusive: a random text may belong to two
or more classes at the same time or to none of them.
This implies that we are not performing a simple mul-
ticlassification but rather a multi-label classification.
As mentioned in prior work (Abid and Zribi,
2020), deep learning architectures have shown a
strong learning capacity that led to their highly pro-
gressive and extensive use in English and Arabic
texts’ classification problems. They are continuously
proving a high capacity to take up various NLP chal-
lenges. On this basis, we decided, in our work, to
investigate and compare four famous models for clas-
sifying abusive multi-labeled texts in the Arabic lan-
guage: These models are CNN, BiLSTM, BiGRU,
and BERT.
Moreover, our study does not only exploit mod-
ern standard Arabic data but also covers dialectical
Arabic with its variety of forms used in real life and
social platforms. It is also noteworthy that the dataset
we created to conduct our experiments has a multi-
platform nature as it was retrieved from two different
social networks, which are Twitter and YouTube.
This paper is organized as follows: In section 2,
we go through the existing research on Arabic abu-
sive texts detection in social networks. Section 3 ex-
plains the steps we took to prepare our multi-label
dataset. Afterward, our baseline model is detailed to-
gether with the implemented deep learning architec-
tures in section 4. Finally, we highlight and discuss
the experimental results for future research in section
5.
2 RELATED WORKS
In this section, we review previous research studies
on abusive language detection in Arabic social media.
Works are classified by their level of granularity.
Firstly, we introduce multiclass classification contri-
butions which are relatively scarce, especially in the
Arabic language. Then, related binary and ternary
classification contributions are presented.
Multiclass Classification: Recent work by (Al-
Hassan and Al-Dossari, 2021) aimed to classify Ara-
bic tweets into 5 distinct classes: none, religious,
racial, sexism, or general hate. As classes are mutu-
ally exclusive, authors in (Al-Hassan and Al-Dossari,
2021) defined the general hate class as ”Any general
type of hate which is not mentioned in the previous
classes. Whether it contains: general hatred, obscene,
offensive and abusive words that are not related to re-
ligion, race or sex”. In the same work, the evaluation
of various deep learning models showed that adding
a layer of CNN to LTSM enhances the overall perfor-
mance of detection with 72% precision, 75% recall,
and 73% F1 score. (Duwairi et al., 2021) also took
into consideration the existence of hate speech sub-
types and created ArHS: A Multiclass Arabic Hate
Speech Dataset. They followed a lexicon-based ap-
proach using Twitter4J API for crawling and relied on
crowdsourcing for annotation. The final size of ArHS
consisted of 9833 tweets classified into Misogyny,
Racism, Religious Discrimination, Abusive, and Nor-
mal. To conduct their experiments, (Duwairi et al.,
2021) additionally investigated the performance of
two publically available datasets after reannotating
them to fit the multiclass structure of ArHS. Binary,
ternary, and multi-class classification were carried out
on both ArHS and the combined Dataset. The CNN-
LSTM and the BiLSTM-CNN architectures achieved
both the best accuracy for multi-class classification
with 73% and 65% respectively on ArHs and the com-
bined dataset.
Two-class and Three-class Classification:
(Abu Farha and Magdy, 2020) is the ”SMASH”
team submission to the OSACT4: Open-Source
Arabic Corpora and Corpora Processing Tools shared
tasks on offensive language (Subtask A) and hate
speech detection (Subtask B) in the Arabic language.
The dataset provided contains 10,000 tweets split into
training, development, and testing sets. Extremely
imbalanced, only 19% of the tweets are tagged as
offensive and 5% of the tweets are tagged as hate
speech. The authors carried out various experiments
covering a variety of approaches that include deep
learning, transfer learning and multitask learning.
Results showed that the multitask learning models
achieved the best results with a macro F1 score of
0.904 for subtask A and 0.737 for subtask B. The
same dataset was used in (Saeed et al., 2020) and
(Hassan et al., 2020). (Saeed et al., 2020) named
their own approach “ESOTP” (Ensembled Stacking
classifier over Optimized Thresholded Predictions
of multiple deep models). It is a classification
pipeline where they trained NN, BLSTM, BGRU,
and BLSTM+CNN 550 times. The predictions were
used as a new training set for an ensemble of a Na¨ıve
Bayes classifier, a Logistic Regression model, a Sup-
port Vector Machine, a Nearest Neighbours classifier,
and a Random Forest. ESTOP achieved 87.37% F1
for subtask A (ranked 6/35) and 79.85% for subtask B
(ranked 5/30). As for (Hassan et al., 2020), a system
combination of Support Vector Machines (SVMs)
and Deep Neural Networks (DNNs) achieved the best
results for offensive language detection and ranked
1st in the official results with an F1-score of 90.51%
while SVMs were more effective than DNNs for hate
Comparing Deep Learning Models for Multi-label Classification of Arabic Abusive Texts in Social Media
375