Table 1: CH-index for Arabic WSI model.
Dataset #Cluster CH on K-
Means
CH on
HAC-
Ward
CH on
HAC-
Single
CH on
HAC-
Complete
CH on
AHC-
AVG
5 238.83 192.35 7.29 11.563 11.61
OSAC 10 207.43 174.75 6.51 50.55 6.74
50 85.03 78.32 2.78 58.14 52.08
15 183.42 82.56 6.45 82.56 16.76
SemEval 34 105.49 61.39 7.68 61.39 12.67
50 81.72 75.81 7.45 51.65 13.12
where, n
k
and c
k
are the number of points and centroid
of the kth cluster respectively, c is the global centroid
of the dataset, N is the total number of data points.
4 EXPERIMENTAL RESULTS
AND EVALUATION
4.1 Dataset
We valid our approach on two Arabic datasets
:The Open Source Arabic Corpus (OSAC)(Saad and
Ashour, 2010) and the SemEval arabic task
1
.
• OSAC.
It is a corpus constructed from many websites. It
is split into three primary categories: Following
the elimination of stop words, the BBC-Arabic
Corpus has 1,860,786 (1.8M) words and 106,733
unique words, whereas the CNN-Arabic Corpus
contains 2,241,348 (2.2M) words and 144,460
unique words. After stopping words were re-
moved, OSAC, which was gathered from sev-
eral sources(Saad and Ashour, 2010) , contained
roughly 18,183,511 (18M) words and 449,600
unique words. It is divided into 10 categories.
This corpus is used in (Djaidri et al., 2018) as a
baseline.
• SemEval.
We used a new version of SemEval (2017), it is
split into three subtask: Message Polarity Classi-
fication (Subtask A), Topic-Based Message Polar-
ity Classification (Subtasks B-C) and Tweet quan-
tification (Subtasks D-E), it contains 2,278 for
training, 585 for validation and 1,518 for test. We
investigated just the training data. This data set
contains 34 classes. This corpus is used in (Pinto
et al., 2007) as a baseline.
1
https://www.dropbox.com/s/i9tkaajuq1qbgjq/
2017 Arabic train final.zip?
4.2 Experimental Results
We run the following experiments to study the differ-
ent aspects of the proposed approach. We automati-
cally clean and cluster the datasets as the following:
• Experiment 1 : the number of cluster equal to
the number of class existing in the dataset (10 for
OSAC and 34 for Arabic SemEval).
• Experiment 2: the number of clusters is greater
than the number of classes existing in the dataset.
• Experiment 3 : the number of clusters is less than
the number of classes existing in the dataset.
For the HAC clustering algorithm, we adopt four
similarity measures: ward link, single link, complete
link and average link. For all experiments, we choose
to do the same number of samples (5000), it helps
us after to do a credible comparison between the ob-
tained results.
Our main experimental results using CH-index
metric are shown in Table 1. The CH-index on K-
Means outperforms the CH-index on HAC and the
ward linkage outperforms the other linkage types,
For SemEval data and when the number of cluster
is 34, all the sentences containing the same word or
the same context as this word are put into the same
cluster, for example the sentences related to the word
" YK
ðPY
K
@ " are all put in the cluster 4.
Because the ward linkage performs better than
others, we choose to plot some clusters points.
We tested also AHC algorithm with un-predefined
number of cluster. Figure 5 plots the embedding data
with Agglomerative Clustering with 3 clusters and
with no predefined clusters.
The dendogram (with complete(1) ,ward(2) sin-
gle(3) and average(4) linkage) are presented in fig-
ure 6 for SemEval embedding data and in figure 7 for
OSAC.
Table 1 shown that the CH-index on K-Means
for OSAC performs better than on SemEval, so we
can note that our system gives better results in word
sense clustering (OSAC dataset) than in sentences
Sentence Transformers and DistilBERT for Arabic Word Sense Induction
1023