presents the corpus we used in this research. Section
5 presents the results we obtained and Section 6
concludes the paper with directions for future work.
2 RELATED WORK
Analyzing the user's’ posts and their behavior on
social networks has been the subject of numerous
research works. Interest in this field of research has
increased in recent years as several users misused the
anonymity they could have on social media networks
to spread threats, hate messages or false news.
By analyzing the contents produced by users, or
the activities they perform, author-profiling
researchers were able to determine several users’
characteristics such as age, gender, mother tongue
and level of education.
The shared task on Author Profiling at PAN 2013
focused on digital text forensics. Specifically, the
purpose was to determine the age and gender of the
authors of a large number of unidentified texts. In this
context, (Argamon et al., 2005) determined
experimentally that content features performed well
for age and gender profiling.
In addition to that, (Peersman et al., 2011) worked
on segments of blogs of the British National Corpus.
They used features such as punctuation, average
words, part of speech, sentence length, and word
factor analysis to predict gender at an accuracy of
80%.
The detection of the author's profile consists of
analyzing the way in which the linguistic
characteristics vary according to the profile of the
author (Koppel et al., 2014) used the SVM model
trained on English, Spanish and Dutch Twitter data
from unknown Twitter text to achieve 80% accuracy
for gender prediction. The work focuses instead on
punctuation, n-gram counts, sentence and word
length, vocabulary richness, function words, out-of-
vocabulary words, emoticons and part-of-speech.
In another study (Alwajeeh et al., 2014) the
authors worked on blog segments using features such
as speech analysis, punctuation, average word length,
sentences, and word factors. They achieved a gender
prediction rate of 72.2% (Tang et al., 2010).
In addition to that, (Estival et al., 2007) worked on
Arabic emails and reported being able to predict the
gender with a precision of 72.10%. They calculated
64 features describing psycholinguistic word
categories (e.g. family, anger, death, wealth, family,
etc.). Any feature describes the number of words
detected in the similar category divided by all words
in the text.
Although in (Mikros, 2012), the researchers
worked on the automatic classification of blogs and
emails, they obtained a precision of 81.5% of
documents well classified for the dimension of gender
and 72% for the dimension of age.
The work in (Juola, 2012) investigated the
attribution and detection of the author's genre using
Greek blogs. He chose this model of social networks
because people can express their opinions on blogs.
Juola focused on two types of features of text content.
The first type includes classic stylometric features,
which depend on vocabulary richness, word length,
and word frequency. The second type of features
depends on the bi-gram characters, and the n-gram of
words. The results of their experiments showed an
accuracy of gender identification of 82.6% with SVM
(Maharjan et al., 2014).
The work in (Koppel et al., 2003) presented an
application that detects various demographic
characteristics such as name, age, gender, level of
education. The authors used two corpora of e-mail for
the Arabic and English languages. They used a
questionnaire to check and examine the user's’ profile
including age, gender, and level of education. The
authors used many machine-learning classifiers in
their experiments such as SVM, KNN and decision
trees. For gender detection, the best accuracy was
achieved by SVM (Argamon et al., 2009).
Current author identification techniques go
beyond stylometric analysis, which opens the way to
profiling, attribution, and identification of authors. In
addition, they explore data and use digital documents
like graphics, emoticons, colors, layouts, etc. In this
context, we cite a very recent work of 2015 in which
a play "Double Falsehood" was identified as the work
of William Shakespeare where the researchers were
based on colors and graphics information for
identification because each author or artist has his
own style (Mechti et al., 2010).
The Arabic language is one of the most widely
adopted languages with hundreds of millions of
native talkers. Furthermore, it is used by more than
1.5 billion Muslims to practice their religion and
spiritual ceremonies.
Authorship attribution is another field of related
work that is concerned with the description and
identification of the true author of an anonymous text.
In the literature, authorship description is defined as
a text categorization or text analysis and classification
problem. The authorship has various potential
applications in fields such as literature, program code
authorship attribution, digital content forensics, law
enforcement, crime prevention, etc. In the context of
authorship attribution, stylometry has been used to
Arabic Twitter User Profiling: Application to Cyber-security
111