Arabic Twitter User Profiling: Application to Cyber-security
Rahma Basti
1
, Salma Jamoussi
2
, Anis Charfi
3
and Abdelmajid Ben Hamadou
4
1
Multimedia InfoRmation Systems and Advanced Computing Laboratory (MIRACL), University of Sfax, Tunis, Tunisia
2
Higher Institute of Computer Sceience and Multimedia of Sfax, 1173 Sfax 3038, Tunisia
3
Carnegie Mellon University Qatar, Qatar
4
Digital Research Center of Sfax (DRCS), Tunisia
Keywords: Author Profiling, Arabic Text Processing, Age and Gender Prediction, Dangerous Profiles, Stylometric
Features.
Abstract: In recent years, we witnessed a rapid growth of social media networking and micro-blogging sites such as
Twitter. In these sites, users provide a variety of data such as their personal data, interests, and opinions.
However, this data shared is not always true. Often, social media users hide behind a fake profile and may
use it to spread rumors or threaten others. To address that, different methods and techniques were proposed
for user profiling. In this article, we use machine learning for user profiling in order to predict the age and
gender of a user’s profile and we assess whether it is a dangerous profile using the users’ tweets and features.
Our approach uses several stylistic features such as characters based, words based and syntax based.
Moreover, the topics of interest of a user are included in the profiling task. We obtained the best accuracy
levels with SVM and these were respectively 73.49% for age, 83.7% for gender, and 88.7% for the dangerous
profile detection.
1 INTRODUCTION
Social media networks allow users to share
information, opinions and communicate with each
other. Often, social media users choose not reveal
their real identity such as name, age, and gender in
order to express their ideas freely, without risking any
retaliation. Some other users hide their real identity
for dishonest and dangerous purposes such as
threatening other social media users or spreading
rumors and lies. Therefore, it has become very
important to provide effective means for identity
tracing in the cyberspace (Argamon et al., 2003).
Twitter is one of the most popular social media
networks in the world and it has a large number of
users who post a huge amounts of data in different
languages. Posts cover a wide variety of topics such
as politic, sport, and technology.
The volume and variety of Twitter data as well as
the availability of APIs has attracted several
1
https://developer.twitter.com/
2
http://www.cs.cmu.edu/~ark/TweetNLP/
3
https://emojipedia.org/
4
https://dev.twitter.com/streaming/overview
researchers to use it including those who focus on
user profiling (Feldman and Sanger, 2006).
Research in psychology (Frank and Witten, 1998)
has revealed that the words used by an individual can
project his or her mental and, physical health. With
the advances in technology and computing,
stylometry (Georgios, 2014) has been used to
determine traits of the user’s profile and personality
based on what they write. Several stylometric features
have been proposed to date, including features based
on words, characters and punctuation.
In this article, we aim at profiling Twitter users,
i.e., determining characteristics such as age and
gender based on their tweets. This research is
applicable in several fields, such as forensics and
marketing.
The remainder of this paper is organized as
follows. Section 2 reports on existing work on user
profiling. Section 3 presents our approach and
describes the features that could be considered as
significant indicators of age and gender. Section 4
110
Basti, R., Jamoussi, S., Charfi, A. and Ben Hamadou, A.
Arabic Twitter User Profiling: Application to Cyber-security.
DOI: 10.5220/0008167401100117
In Proceedings of the 15th International Conference on Web Information Systems and Technologies (WEBIST 2019), pages 110-117
ISBN: 978-989-758-386-5
Copyright
c
2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
presents the corpus we used in this research. Section
5 presents the results we obtained and Section 6
concludes the paper with directions for future work.
2 RELATED WORK
Analyzing the user's’ posts and their behavior on
social networks has been the subject of numerous
research works. Interest in this field of research has
increased in recent years as several users misused the
anonymity they could have on social media networks
to spread threats, hate messages or false news.
By analyzing the contents produced by users, or
the activities they perform, author-profiling
researchers were able to determine several users’
characteristics such as age, gender, mother tongue
and level of education.
The shared task on Author Profiling at PAN 2013
focused on digital text forensics. Specifically, the
purpose was to determine the age and gender of the
authors of a large number of unidentified texts. In this
context, (Argamon et al., 2005) determined
experimentally that content features performed well
for age and gender profiling.
In addition to that, (Peersman et al., 2011) worked
on segments of blogs of the British National Corpus.
They used features such as punctuation, average
words, part of speech, sentence length, and word
factor analysis to predict gender at an accuracy of
80%.
The detection of the author's profile consists of
analyzing the way in which the linguistic
characteristics vary according to the profile of the
author (Koppel et al., 2014) used the SVM model
trained on English, Spanish and Dutch Twitter data
from unknown Twitter text to achieve 80% accuracy
for gender prediction. The work focuses instead on
punctuation, n-gram counts, sentence and word
length, vocabulary richness, function words, out-of-
vocabulary words, emoticons and part-of-speech.
In another study (Alwajeeh et al., 2014) the
authors worked on blog segments using features such
as speech analysis, punctuation, average word length,
sentences, and word factors. They achieved a gender
prediction rate of 72.2% (Tang et al., 2010).
In addition to that, (Estival et al., 2007) worked on
Arabic emails and reported being able to predict the
gender with a precision of 72.10%. They calculated
64 features describing psycholinguistic word
categories (e.g. family, anger, death, wealth, family,
etc.). Any feature describes the number of words
detected in the similar category divided by all words
in the text.
Although in (Mikros, 2012), the researchers
worked on the automatic classification of blogs and
emails, they obtained a precision of 81.5% of
documents well classified for the dimension of gender
and 72% for the dimension of age.
The work in (Juola, 2012) investigated the
attribution and detection of the author's genre using
Greek blogs. He chose this model of social networks
because people can express their opinions on blogs.
Juola focused on two types of features of text content.
The first type includes classic stylometric features,
which depend on vocabulary richness, word length,
and word frequency. The second type of features
depends on the bi-gram characters, and the n-gram of
words. The results of their experiments showed an
accuracy of gender identification of 82.6% with SVM
(Maharjan et al., 2014).
The work in (Koppel et al., 2003) presented an
application that detects various demographic
characteristics such as name, age, gender, level of
education. The authors used two corpora of e-mail for
the Arabic and English languages. They used a
questionnaire to check and examine the user's’ profile
including age, gender, and level of education. The
authors used many machine-learning classifiers in
their experiments such as SVM, KNN and decision
trees. For gender detection, the best accuracy was
achieved by SVM (Argamon et al., 2009).
Current author identification techniques go
beyond stylometric analysis, which opens the way to
profiling, attribution, and identification of authors. In
addition, they explore data and use digital documents
like graphics, emoticons, colors, layouts, etc. In this
context, we cite a very recent work of 2015 in which
a play "Double Falsehood" was identified as the work
of William Shakespeare where the researchers were
based on colors and graphics information for
identification because each author or artist has his
own style (Mechti et al., 2010).
The Arabic language is one of the most widely
adopted languages with hundreds of millions of
native talkers. Furthermore, it is used by more than
1.5 billion Muslims to practice their religion and
spiritual ceremonies.
Authorship attribution is another field of related
work that is concerned with the description and
identification of the true author of an anonymous text.
In the literature, authorship description is defined as
a text categorization or text analysis and classification
problem. The authorship has various potential
applications in fields such as literature, program code
authorship attribution, digital content forensics, law
enforcement, crime prevention, etc. In the context of
authorship attribution, stylometry has been used to
Arabic Twitter User Profiling: Application to Cyber-security
111
determine the authenticity of the document. It is
considered as the study of how people can judge
others according to their writing style. Therefore,
stylometry cannot only be used to identify a writing
style but can also help identify the author's gender and
age.
3 PROPOSED METHOD
The goal in this work is to analyze the profiles of
anonymous authors and predict the author’s age,
gender, and whether the profile is a dangerous
profiles. Our approach is purely statistical, i.e., it
accepts input from any profiles written in Arabic and
calculates the frequencies to identify age based
differences (between young people and adults),
gender based differences (between men and women),
and profile risks (i.e., whether the profile is dangerous
or not. We divided our work on author profiling into
two parts: The first part focused on extracting
relevant information from a user profile such as the
number of friends, number of followers, number of
retweets, etc. The second part focused on extracting
information from the user’s tweets based primarily on
stylistic information (lexical, structural, syntactic)
and semantic information.
3.1 Profiles Specific Features
User profiling allows determining the users’
characteristics such as age and gender. In order to
retrieve automatically the profile data and the users’
tweets we used Twitter API ¹. Next, we explain which
data was retrieved through the API (Peersman et al.,
2011). The relevant Twitter terms for our work are
the following:
ReTweet: each user may republish on his profile a
Tweet that was written by another user.
Followers: users that receive and follow status
updates of a given user.
Friends: users that are followed by a given.
Favorite accounts: a way to tag a tweet as a
preference in order to see it easily later.
Time of publications: for each author, her time of
publication of tweet was retrieved. The chronology
is mainly divided into four parts (from midnight to 6
am, from 6 am to noon, from noon to 6 pm and finally
from 6 pm to midnight). Based on that, we can predict
the user’s favorite time to share their tweets.
We retrieved through the API the available
profile data and examined the network formation
resulting of users and their contacts with other users,
e.g. by examining for a given user the number of
followers, number of friends, number of favorite
accounts, number of retweets and preferable time of
publications. Then, we represented the user’s profile
as a normalized number vector with numbers
corresponding to the profile features.
For each profile collected in our corpus, an expert
in sociology helps us to identify and annotate the age,
gender and whether the user has a dangerous profile.
3.2 Tweets Specific Features
What features allow to predict age, gender, and
dangerous profiles is an open research question that
several authors addressed in fields such as human
psychology. Stylometry (i.e., the study of stylistics
features shows that individuals can be classified and
identified by their writing styles. The writing style of
a person is defined by the selection of special
characters, the terms used, and the composition of
sentences...
Studies in literature (Guimaraes et al., 2017) show
that there are no one-size-fits-all features set that is
optimized and applicable to all people and to all
domains. In fact, thousands of stylometric features
have been proposed. Even though authors can
consciously modify their own style, there will always
be an unconscious use of certain stylistic features
(Sara et al., 2014). For our work, we use the following
features: (1) stylometric features (lexical, syntactic
and structure) and (2) semantics and emotional
features. Figure 1 shows the general process of our
work for the extraction of features from tweets. The
major steps of our method are as follows:
1. Pre-processing and Text Analysis: The process
starts with data cleaning; the aim of this step is to lead
to a cleaner representation of the tweets. For this
purpose we have removed noisy data such as prefixes,
suffixes, and URLs. We also transformed plural
words to singular, and we applied lemmatization.
2. Calculating Features Vector: A feature vector is
computed based on the profile data and the users’
tweets. The extracted features are divided into two
groups: training set and testing set. The training set
is used to develop a classification model whereas the
testing set is used to validate the developed model.
3. Classification: We train an SVM classifier using
our training data from Step 2 to discriminate between
various age groups, genders and dangerous profiles.
More details on this step will be given in the next
subsections.
WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies
112
Figure 1: Architecture of our author profiling method.
3.2.1 Lexical Features
A tweet can be seen as a series of characters and word
tokens grouped into sentences. A token can be a word,
a punctuation mark or a number. Some studies on
authorship attribution (Tan et al., 2009) were based
on simple models such as sentence length and word
length. The benefit of these features is that they can
be used on any corpus in any language with no
additional conditions except the availability of a
tokenizer. Lexical features can be used to learn about
the typical use of words and characters by a certain
individual.
In the following, we discuss how we used in our
work lexical features by considering different
variations of the characters included within it.
First, we calculated the total number of characters,
including all Arabic letters by Latin, digits (0-9), etc.
The other stylometric analyzed in this part, as well as
the total number of special characters, and white
spaces were counted (Rangel et al., 2013).
Second, we analyzed words by applying 12
statistical measures including the total number of
words, the average number of words, the total number
of different words in a tweet, the total number of short
words with three characters, the total number of long
words with six or more characters, the total number
of words with flooded characters (e.g. Heeeelloooo).
All the characteristics of this approach were
calculated with the ratio.
Vocabulary richness functions quantify the
variety of the vocabulary of a tweet. Some models of
this measure include the number of hapax legomena
(words occurring once) and, the number of hapax
dislegomena (words occurring twice). Various
functions were proposed to achieve stability over text
length, including Yule’s K measure, Simpson’s D
measure, Sichel’s S measure, and Honore’s R
measure (Tschuggnall et al., 2017).
3.2.2 Syntactic Features
Researchers discovered the effectiveness of syntactic
elements in identifying an author (Hsieh et al., 2018).
Syntactic features define the patterns used to form
sentences (Potthast et al., 2016).
In informal writing, it is common to use multiple
question exclamation marks to express better a
feeling or a mood. Therefore, syntactic characteristics
define the writing style of a writer. In this regard, we
counted the total number of quotation marks, periods,
semicolons, question marks, exclamation marks,
multiple exclamation or question marks (???, !!!), and
ellipses to determine the frequency with which
authors use punctuation in their tweets.
One of the most important aspects of text
classification is the word level that is used. N-grams
based approaches are often used for text classification
and they can be implemented on different levels, such
as the word level. Furthermore, some researchers
have used word n-grams to address authorship
attribution. N-grams are tokens created by a
contiguous sequence of n items. The unique and
different n-grams constitute the most important
feature for stylistic purposes. This demonstrates why
word n-grams were used as input features for
automatic methods of detecting and classifying
authors with both promising results.
Another related approach is part of speech tagging
(POS tag) which represents the tokens according to
their function in the context. Basic POS ² tags include
the functional words in a sentence, (e.g., verbs,
preposition, and pronouns) (Garciarena et al., 2015).
Authors regardless of the topics use function words
unconsciously and consistently and their use has a
low probability of being deceived.
After we calculated the frequency of use of each
grammatical category, we calculated the ratio
between the number of verbs, pronouns, and the total
number of words. Moreover, we calculated the ratio
between the frequency of used dots, quotes and the
total number of words. We also calculated the ratio
between the unique, different grams and the total
number of words. The same applied for the hashtags,
which were represented as the ratio between the
hashtags number and the total number of words
(Pennebaker et al., 203). Table 1 represents the set used
features.
Arabic Twitter User Profiling: Application to Cyber-security
113
Table 1: Set of used features.
Lexical Features
Word-based Features
Total number of words (M)
Average word length
Number of long word (than 6 characters) /M
Number of short (1-3 characters) word/M
Number of word elongation word /M
Different words frequency/M
Hapaxlegomena (unique word)
Hapaxdisplegomena
Yule's K measures
Simpson's D measures
Honore's R measure
Entropy measure
Entropy lines measure
Character-based Features
Number of Arabic characters by Latin /M
Number of letters /M
Number of digital characters /M
Number of white /M
Number of special characters (22 features) /M
Syntactic Features
Number Frequency of punctuations/ M {“'”,
“.”, “:”, “;”, “ ? ”, “! “,“ " ” ,“???” , “!!!” ,
“…”}
Number of different n-grams
Number of unique n-grams
Number of Hashtags/M
POS tagging
Number of adjective/M
Number of adverbs/M
Number of abbreviations/M
Number of conjunctions/M
Number of gender-specific word/M
Number of interjections/M
Number of names/M
Number of particles/M
Number of prepositions/M
Number of pronouns/M
Number of proper names/M
Number of verbs/M
Structural Features
Average phrases
Average number of words per sentence
Average words length
Average sentences length
Number of lines /M
Number of blank lines /M
Number of paragraphs /M
Semantics and Emotional Features
Emotion
Number of angry word/M
Number of disappointed word/M
Number of disgusted word/M
Number of gleeful word/M
Number of happy word/M
Number of romantic word/M
Number of sad word/M
Number of satisfied word/M
Number of surprised word/M
Topic
Number of sports word/M
Number of political word/M
Number of military and weapons
word/M
Number of family and friends
word/M
Number of economic, money, work
and social word/M
Number of death and religion
word/M
Number of body and sexual word/M
Profiles Features
Number of friends
Number of followers
Number of retweet
Number of favourite is accounts
Time of publication
3.2.3 Structural Features
Structural features (or structure-based features) are
about the organization and format of a text.
They assess the overall impression of the
document's writing style. These features can be define
at the paragraph-level, message-level or according to
the technical construction of the document.
Rendering a large number of features does not
necessarily produce excellent results, as some
features give very little information. Nevertheless, in
the authorship verification of computer-mediated
online information such as tweets and blogs the
structural features seem to be encouraging (Modak et
al., 2014).
In this context, the structural features describe the
process an author follows to create a tweet. People
have different habits when creating a publication.
This is even more important in the context of online
texts, which have limited content
WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies
114
Frequently, structural features include the length
of the paragraph from a tweet, separators used in
sentences like lank lines, length of a sentence, word
length, length of words per sentence, the layout of the
whole document.
3.2.4 Semantical and Emotional Features
Sentiment analysis has become one of the most active
areas of research in natural language processing. It is
also generally, used in the domains of data mining
and document extraction. Sentiment analysis aims to
predict or extract the antithesis of people opinions in
specific fields. It is regarded as a challenging task for
feeling analysis. Most current approaches in the
research distinguish two main types of attributes that
can be used to predict the author’s profile: the stylistic
and the content based of their tweets.
The basic types of features that can be used for
content-based authorship profiling are the emotional
and semantic features. We looked for the similarities
that can group a set of terms in the same class. The
corpus of the Arabic text is much larger. Therefore,
we manually grouped the terms belonging to the same
class of attributes. We identified nine classes of
emotion and emoticons ³ namely: surprise,
satisfaction, joy, romance, sadness, anger, pleasure,
disgust and disappointment. The use of emoticons
aims at making the text messages more expressive.
For the semantic features, we manually collected
seven dictionaries of the following domains: sports,
politics, army and arms, family and friends, economic
and social, death and religion, body and sexuality.
In total, we built 8000+ terms, explicitly
conveying 16 classes (sports, political, angry…).
Moreover, we obtained the root for each word and
then calculated the probability factor (ratio) of using
words in the tweets for each class. For example, a user
talks about sport in 80% of his/her tweets and about
politics in 20% of her tweets.
4 DATASET
In each authorship classification problem, there exists
a collection of authors, a set of users profiles of
known authors (training dataset), and a collection of
user profiles of unknown authorship (test dataset).
For each user profile, we retrieved the first 200 tweets
that were written in Arabic, as well as the maximum
number of words up to 140 words. In total, we
collected about 32032 tweets from 422 users.
We used cross-validation with the 422 profiles for
training. Then, we used another 232 user profiles to
test in order to evaluate the finished model. The data
was balanced by gender and dangerous v.s non-
dangerous. However, it was not balanced with respect
to age. The distribution of the number of user’s
profiles per dataset is shown in Table 2.
For each profile we calculate a numerical vector,
whose elements represents all extracted features from
the profile and the respective tweets, which help us
discriminate the relevant classes.
Table 1: Data size for each classification task.
Classes
Attributes
Number of profiles
for learning
Number of profiles
for the test
Age
Adult
136
88
Young
147
84
Gender
Women
89
52
Man
89
52
Terrorist
Terrorist
31
11
Not Terrorist
31
11
5 RESULTS
The step of selection classes was important and
became a great influence on the results. We computed
the ratio of all characteristics used and then we
normalized all used features. In total, we made use of
the 143 attributes.
In this research, we applied classifiers to select the
relevant attributes and to predict the performance for
age, gender and dangerous profiles of a Twitter
profile. As shown in Table 3, the SVM classifier
outperforms the other two classifiers multilayer
perceptron and random forest. SVM provides the
highest accuracy with 73.49% for age, 83.70% for
gender and 88.70% for dangerous profiles.
Table 3: Classification accuracy using 10-fold cross-
validation with the training partition.
Classes
SVM
Multilayer
Perceptron
Age
73.49 %
70.31 %
Gender
83.70 %
76.40 %
Terrorist
88.70 %
82.25 %
Based on our results, the style features that prove
most useful for age discrimination are the use of
joyful and happy emotion in the writing of young
people. While adults use more 'angry' emotions, long
words, prepositions with a high entropy measurement
Arabic Twitter User Profiling: Application to Cyber-security
115
(the more the words of a Tweet are varied the higher
the entropy is).
The features that prove to be most useful for
gender discrimination are military terms and
weapons, a high measure of tweets diversity with the
use of multiple question marks (markers of male
writer). In counterparts, the markers of female writers
are a large number of followed accounts, in addition
to the use of the first personal pronoun.
The most discriminating style features indicate
that dangerous profiles tend to write their Tweets
after midnight with more emotions of satisfaction.
They also tend, to use two different grams and to
write their publications with a large number of semi-
colons. Moreover, they usually have a large number
of friends. Concerning non-dangerous profiles, we
can notice that their posts are enriched with adverbs
and adjectives. Moreover, they are interested in using
more sports terms with long, unique words. In
addition, non-dangerous accounts often use in their
posts syntactic characters such as double quotes,
multiple question marks, and seedling-colonists.
Table 4 shows the results obtained with the test
data set for the three classes. Our dataset contains 232
new Twitter user profiles that not be seen from
before. As shown in that table, the Random Forest
classifiers outperforms other classifiers with 80.81%
for age and 75.00% for gender both accuracy. On the
other hand, SVM classifier gave the better accuracy
of 81.81% for detecting dangerous profiles.
Table 4: Supplied test set classification accuracy.
Classes
SVM
Multilayer
perceptron
Random
Forest
Age
69.18 %
72.03 %
80.81 %
Gender
73.07 %
74.03 %
75.00 %
Terrorist
81.81 %
68.18 %
54.54 %
6 CONCLUSIONS
In this paper, we tackled the problem of automatically
determining the age, the gender of users on the
Twitter social network, and the detection of terrorist
profiles, focusing mainly on Arab profiles that have
not occupied research, the place they deserve.
We used our own body of user profiles, we started
by extracting Tweets from these profiles by proposing
a set of characteristics that allow us to predict the
three classes considered, namely (age, gender,
danger). We considered three families of
characteristics: the stylistic family (syntactic, lexical
and structural characteristics), the semantic family
where we collected several dictionaries manually to
characterize different available themes and the family
of information about the profiles themselves. Finally,
we have shown how the right combination of
stylometric characteristics and automatic learning
methods allows an automated system to effectively
determine the desired aspects of an anonymous
author.
The results show the stylometric characteristics
were more efficient and accurate, according to what
is accepted and believed in the literature. To be more
precise, the best performance obtained on our
database was 73.49% for age detection and was
obtained using the SVM classifier. On the other hand,
the best performance in terms of gender detection was
83.70% and was obtained using the SVM classifier as
well. Finally, the best performance for detecting
terrorists was 88.70% and it was still obtained using
the SVM classifier.
After analyzing the experimental results, we
found that the SVMs seem to be the best classifier
among those tested for the identification of the three
classes of profiles adopted.
In the future, our goal is to explore new
multilingual author profile detection techniques by
adopting more sophisticated features such as those
based on the user's geographic location, for example.
Similarly, we are considering increasing the size of
the dictionaries used to predict the feelings and
different themes considered.
ACKNOWLEDGEMENTS
This publication was made possible by NPRP 9-175-
1-033 from the Qatar National Research Fund (a
member of Qatar Foundation). The findings achieved
herein are solely the responsibility of the authors.
REFERENCES
S. Argamon, M. Koppel, J. Fine, and A. R. Shimoni,
“Gender, Genre, and Writing Style in Formal Written
Texts,” J. Lang. Soc. Psychol. December 2003.
R. Feldman, J. Sanger. The text-mining handbook:
advanced approaches in analyzing unstructured data.
Cambridge University Press; 2006.
E. Frank, L. Witten. Generating accurate rule sets without
global optimization. In: Proceedings of the fifteenth
international conference on machine learning, 1998.
K. Georgios. Anonymity and closely related terms in the
cyberspace: an analysis by example, 2014.
S. Argamon, S. Dhaule, M. Koppel, J. Pennebaker, Lexical
Predictors of Personality Type. In Proceedings of
WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies
116
Classification Society of North America, St. Louis MI,
June 2005.
C. Peersman, W. Daelemans, V. Van, L. Predicting Age and
Gender in Online Social Networks. SMUC’11 (2011).
M. Koppel, J. Schler, K. Zigdon. Determining an author's
native language by mining a text for errors. In
Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in
data mining (pp. 625-628). ACM (2005, August).
A. Alwajeeh, M. Al-Ayyoub, I. Hmeidi “On authorship
authentication of arabic articles,” in the fifth
International Conference on Information and
Communication Systems (ICICS 2014), 2014.
J. Tang, L. Yao, D. Zhang, J. Zhang “A Combination
Approach to Web User Profiling” in ACM Transactions
on Knowledge Discovery from Data, 2010.
D. Estival, T. Gaustad, S. Pham, W. Radford, “TAT: an
author profiling tool with application to Arabic emails”.
In Proceedings of the Australasian Language
Technology Workshop, (pp. 21-30), 2007.
G. Mikros, “Authorship Attribution and Gender
Identification in Greek Blogs”, Methods and
Applications of Quantitative Linguistics, (pp. 2132),
2012.
P. Juola. Large-scale experiments in authorship attribution.
Eng Stud (pp.27681), 2012.
S. Maharjan, P. Shrestha, T. Solorio. A Simple Approach to
Author Profiling in MapReduce. England. CLEF 2014.
M. Koppel, S. Argamon, A. Shimoni, Automatically
categorizing written texts by author, gender, Literary
and Linguistic Computing (pp 401-411), 2003.
S. Argamon, M. Koppel., J. Pennebaker, Jonathan Schler
Automatically detection the author of an anonymous
text. Communications of the ACM, pp (119-123), 2009.
S. Mechti, M.Jaoua, R. Faiz, H.Bouhamed, L. Hadrich.
Author profiling: Age prediction based on advanced
Bayesian networks. University of Massachusetts
Amherst, USA, 2010.
C. Peersman, W. Daelemans, V. Van., L. Predicting age and
gender in online social networks. In Proceedings of the
3rd international workshop on Search and mining user-
generated contents, SMUC 11, pages 38-44, New York,
NY, USA, ACM.2011.
R. Guimaraes, R. Rosa, D. Gaetano, D. Rodriguez. Age
groups’ classification in social network using deep
learning. IEEEAccess, 2017.
M. Sara, K. Ismail Authorship analysis studies: A survey”.
International Journal of Computer Applications, 2014.
G. Tan, C. Gaudin, A. Kot. Automatic writer identification
framework for online handwritten documents using
character prototypes. SMUC, ACM, New York, NY,
USA, December 2009. .
F. Rangel, P. Rosso, M. Potthast, B.SteinOverview of the
Author Profiling Task at PAN 2013. In: Forner P.,
Navigli R., Tufis D.(Eds.), Notebook
Papers of CLEF 2013 LABs and Workshops, CLEF
2013, Valencia, Spain, 2013.
M. Tschuggnall , E. Stamatatos, B. Verhoeven, W.
Daelemans, G. Specht, B. Stein, M. Potthast, Overview
of the author-profiling task at PAN 2017: Style breach
detection and author clustering. In Working Notes of
CLEF 2017 - Conference and Labs of the Evaluation
Forum, Dublin, 2017.
F. Hsieh, R. Dias, I. Paraboni. Author Profiling from
Facebook Corpora. IEEE Latin America (toappear),
2018.
M. Potthast, F. Schremmer, M. Hagen, B. Stein. Overview
of the Author Obfuscation Task at PAN 2018: A New
Approach to Measuring Safety. International
Conference of the CLEF Initiative (CLEF 16), Berlin
Heidelberg New York. Springer, 2016.
M. Garciarena, M. Villegas, D. Funez, L. Cagnina, M.
Errecalde, G. Ram, E. Villatoro. Profile-based
Approach for Age and Gender Identification Notebook
for PAN at CLEF 2016Knowledge-Based Systems 89,
(pp.134147), 2015.
J. Pennebaker, W. Mehl, Psychological aspects of
natural language use: Our words, our selves. Annual
Review of Psychology. (pp. 547577), 2003.
S. Modak, A. Mondal. “A Comparative study of Classifiers
Performance for Gender Classification”, IJIRCCE, (pp
4214-4222), 2014.
Arabic Twitter User Profiling: Application to Cyber-security
117