As mentioned previously, in such problems, it is
crucial to use rigorous scientific tools and it is
important to interpret the results very carefully. That
is, knowing that authors possess specific stylistic
features making them differentiable, we tried to
make some experiments of authorship discrimination
between the Quran and some Prophet’s statements in
order to see whether the Quran was written by the
Prophet Muhammad or not. For this purpose, 7 types
of features are extracted and 2 different clustering
methods based on Visual Analytics are employed.
2 STYLOMETRIC FEATURES
Several linguistic features have been proposed in the
field of authorship attribution. We can quote four
main types:
Vocabulary based Features: In general, the
typical words, an author is used to write, can reveal
his or her identity. The problem with such features is
that the data can be faked easily. A more reliable
method would be able to take into account a large
fraction of the words in the document (Juola, 2006)
as the average sentence length.
Syntax based Features: One reason that
function words perform well is because they are
topic-independent (Juola, 2006). A person’s
preferred syntactic constructions can be cues to his
authorship. One simple way to capture this is to tag
the relevant documents for part of speech or other
syntactic constructions (Stamatatos, 2001) using a
tagger.
Orthographic based features: One weakness of
vocabulary-based approaches is that they do not take
advantage of morphologically related words. A
person who writes of “work” is also likely to write
of “working”, “worker”, etc. (Juola, 2006).
Characters based features: Some researchers
(Peng, 2003) have proposed to analyze documents as
sequences of characters. This type of parameter can
replace several other high-level linguistic features.
Furthermore, several experiments showed that
character n-gram is quite reliable in authorship
attribution (Stamatatos, 2009).
In our investigation, a mixture of different
features is proposed: Author Related Pronouns
(ARP), Father Based Surname (FBS),
Discriminative Words (DisW), COST value, Word
Length Frequency (WLF), Coordination
Conjunction (CC) and Starting Coordination
conjunction (SCC). All those features are original
and some of them are used for the first time in
stylometry. Those seven features are collected from
the two religious books and normalized by the
maximum so that the different numerical values will
range approximately between 0 and 1.
3 VISUAL ANALYTICS BASED
CLUSTERING METHODS
In pattern recognition, cluster analysis or clustering
is the task of grouping a set of objects in such a way
that objects in the same group (ie. cluster) are more
similar to each other than to those in other groups
(Wi2, 2014) (Norusis, 2008). This task is commonly
used in data mining, statistical data analysis,
machine learning and information retrieval.
On the other hand, visual Analytics (wi3, 2014)
(Ellis, 2010), which is a combination of several
fields (ie. computer science, information
visualization and graphic design) is often used in
cluster analysis to make the analyst’s judgment
easier to develop and more objective.
That is, the combination of those two research
fields can lead to a strong and efficient analysis tool
for handling some classification tasks that could be
extremely difficult to perform with conventional
analytic tools.
Furthermore, a great advantage of clustering over
conventional classification tools is its non-
supervised property (for several clustering
techniques).
Consequently, it appears that the association of
visual analytics with clustering analysis may be
interesting for solving some stylometric problems,
for which we do not possess any training possibility
or information to make a supervised classification
task. So, it should be extremely motivating to apply
them in our main task of authorship discrimination
(ie. Quran vs Hadith).
As for the clustering methods, in the present
survey, we have used two different methods
separately and tried to observe and comment the
resulting clusters thoroughly. The employed
methods are: Hierarchical Clustering and Fuzzy C-
mean Clustering.
3.1 Dataset Description
The two books have been segmented into 25 several
text segments (14 for the Quran and 11 for the
Hadith). Furthermore, there is no intersection
between them and there is no prior information on
how could be the general configuration of the
clusters (resulting clustering).
IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications
178