Performance comparisons between k-Nearest
Neighbors (k-NN), Rocchio, and NB as classifiers
were conducted. Results of this experiment show
that the WIDF scheme yields the best performance
when used in conjunction with the k-NN, while
TFIDF shows the best performance when used in
conjunction with Rocchio. Among the three
classifiers, the NB classifier is reported to be the
best performer yielding a Macro F1 score of
84.53%. Mesleh (2007) reported a BOW based
Arabic TC system which uses Chi-square for feature
selection. Their results show that using a SVM
classifier in this context yields better classification
performance compared to a k-NN or a NB classifier
when features are reduced using Chi-square. It
yields a Macro F1 score of 88.11%, when evaluated
using their in-house compiled Arabic text dataset.
The BOW model suffers from two main
limitations: (1) it breaks terms into their constituent
words, e.g., it breaks ‘text classification’ into the
words ‘text’ and ‘classification’; as a result the order
of the words is lost in the model and the meaning of
the terms could be changed; (2) it treats synonymous
words as independent features, e.g., ‘classification’
and ‘categorization’ are considered as two
independent words with no semantic association. As
a result of that documents which discuss similar
topics and contain synonymous words could be
considered unrelated.
Researchers have attempted to address the above
issues in English TC by representing text as
concepts rather than words, using an approach
known as Bag-of-Concepts (BOC). A concept is a
unit of knowledge with a unique meaning (ISO,
2009). To build a BOC model, semantic knowledge
bases such as WordNet
1
, Open Directory Project
(ODP)
2
, and Wikipedia
3
are used to identify the
concepts appearing within a document. By using
concepts in text representation the semantics and
associations between words appearing in the
document will be preserved. For example, Hotho et
al (2003) used the English Wordnet as a knowledge
base to represent English text. For each term in a
document, Wordnet returns an ordered list of
synonyms and the first ranked synonym will be used
as a concept for the term. In that study, three
approaches were proposed for using concepts as
features for text representation; (1) using only
concepts to represent documents; (2) Adding
Concept (AC) as complimentary features to the
1
http://wordnet.princeton.edu/.
2
http://dmoz.org.
3
http://www.wikipedia.org.
BOW model ; (3) Replacing Term words with
Concepts (RTC) in BOW model. The study shows
that AC yields better classification performance
results than the other two approaches. The results of
this study show that representing documents with
only concepts is insufficient as WordNet does not
cover all special domain vocabularies. Furthermore,
WordNet is limited as it is a manually constructed
dictionary and therefore laborious to maintain.
To deal with this problem, other researchers have
tried replacing WordNet with other knowledge bases
derived from the Internet, such as ODP and
Wikipedia, e.g., see (Gabrilovich and Markovitch,
2005, Gabrilovich and Markovitch, 2006). In these
studies, ODP categories and Wikipedia articles are
used as concepts for text representation. For each
document, a text fragment (such as word, sentence,
paragraph, or the whole document) maps to the most
relevant ODP categories or Wikipedia articles. The
mapped concepts are added to the document using
the AC approach. Using these knowledge bases for
BOC modelling improved the performance when
applied to English TC compared to BOW model.
For Arabic TC, Elberrichi and Abidi (2012) used
the Arabic WordNet (Black et al., 2006) to identify
concepts appearing within the documents. A
comparison between different text representation
models such as BOW, N-grams and BOC was
conducted using an Arabic text dataset collected by
Mesleh (2007). An RTC variation of BOC used in
conjunction with Chi-square for feature selection
and a k-NN classifier was reported to achieve higher
performance results compared to other
representations.
In a previous work, we developed a number of
new approaches, which combine the BOW and the
BOC models, and applied them to English TC
(Alahmadi et al., 2013) and Arabic TC (Alahmadi et
al., 2014). In (Alahmadi et al., 2014), a NB
classification algorithm was shown to provide
better performance in conjunction with the BOC
model compared to BOW model. This current work
focuses on this point. The remainder of the paper is
organized as follows: Section 2 describes why
Wikipedia is a suitable knowledge base for BOC
modelling of Arabic text. Section 3 outlines the pre-
processing phase and describes the BOC model in
details. The experimental set-up and results are
discussed in Section 4. The paper concludes in
Section 5.
2 WIKIPEDIA
Wikipedia is the largest electronic knowledge
ArabicTextClassificationusingBag-of-ConceptsRepresentation
375