Arabic Twitter User Profiling: Application to Cyber-security

Rahma Basti

, Salma Jamoussi

, Anis Charfi

and Abdelmajid Ben Hamadou

Multimedia InfoRmation Systems and Advanced Computing Laboratory (MIRACL), University of Sfax, Tunis, Tunisia

Higher Institute of Computer Sceience and Multimedia of Sfax, 1173 Sfax 3038, Tunisia

Carnegie Mellon University Qatar, Qatar

Digital Research Center of Sfax (DRCS), Tunisia

Keywords: Author Profiling, Arabic Text Processing, Age and Gender Prediction, Dangerous Profiles, Stylometric

Features.

Abstract: In recent years, we witnessed a rapid growth of social media networking and micro-blogging sites such as

Twitter. In these sites, users provide a variety of data such as their personal data, interests, and opinions.

However, this data shared is not always true. Often, social media users hide behind a fake profile and may

use it to spread rumors or threaten others. To address that, different methods and techniques were proposed

for user profiling. In this article, we use machine learning for user profiling in order to predict the age and

gender of a user’s profile and we assess whether it is a dangerous profile using the users’ tweets and features.

Our approach uses several stylistic features such as characters based, words based and syntax based.

Moreover, the topics of interest of a user are included in the profiling task. We obtained the best accuracy

levels with SVM and these were respectively 73.49% for age, 83.7% for gender, and 88.7% for the dangerous

profile detection.

1 INTRODUCTION

Social media networks allow users to share

information, opinions and communicate with each

other. Often, social media users choose not reveal

their real identity such as name, age, and gender in

order to express their ideas freely, without risking any

retaliation. Some other users hide their real identity

for dishonest and dangerous purposes such as

threatening other social media users or spreading

rumors and lies. Therefore, it has become very

important to provide effective means for identity

tracing in the cyberspace (Argamon et al., 2003).

Twitter is one of the most popular social media

networks in the world and it has a large number of

users who post a huge amounts of data in different

languages. Posts cover a wide variety of topics such

as politic, sport, and technology.

The volume and variety of Twitter data as well as

the availability of APIs has attracted several

https://developer.twitter.com/

http://www.cs.cmu.edu/~ark/TweetNLP/

https://emojipedia.org/

https://dev.twitter.com/streaming/overview

researchers to use it including those who focus on

user profiling (Feldman and Sanger, 2006).

Research in psychology (Frank and Witten, 1998)

has revealed that the words used by an individual can

project his or her mental and, physical health. With

the advances in technology and computing,

stylometry (Georgios, 2014) has been used to

determine traits of the user’s profile and personality

based on what they write. Several stylometric features

have been proposed to date, including features based

on words, characters and punctuation.

In this article, we aim at profiling Twitter users,

i.e., determining characteristics such as age and

gender based on their tweets. This research is

applicable in several fields, such as forensics and

marketing.

The remainder of this paper is organized as

follows. Section 2 reports on existing work on user

profiling. Section 3 presents our approach and

describes the features that could be considered as

significant indicators of age and gender. Section 4

110

Basti, R., Jamoussi, S., Charﬁ, A. and Ben Hamadou, A.

Arabic Twitter User Proﬁling: Application to Cyber-security.

DOI: 10.5220/0008167401100117

In Proceedings of the 15th International Conference on Web Information Systems and Technologies (WEBIST 2019), pages 110-117

ISBN: 978-989-758-386-5

presents the corpus we used in this research. Section

5 presents the results we obtained and Section 6

concludes the paper with directions for future work.

2 RELATED WORK

Analyzing the user's’ posts and their behavior on

social networks has been the subject of numerous

research works. Interest in this field of research has

increased in recent years as several users misused the

anonymity they could have on social media networks

to spread threats, hate messages or false news.

By analyzing the contents produced by users, or

the activities they perform, author-profiling

researchers were able to determine several users’

characteristics such as age, gender, mother tongue

and level of education.

The shared task on Author Profiling at PAN 2013

focused on digital text forensics. Specifically, the

purpose was to determine the age and gender of the

authors of a large number of unidentified texts. In this

context, (Argamon et al., 2005) determined

experimentally that content features performed well

for age and gender profiling.

In addition to that, (Peersman et al., 2011) worked

on segments of blogs of the British National Corpus.

They used features such as punctuation, average

words, part of speech, sentence length, and word

factor analysis to predict gender at an accuracy of

80%.

The detection of the author's profile consists of

analyzing the way in which the linguistic

characteristics vary according to the profile of the

author (Koppel et al., 2014) used the SVM model

trained on English, Spanish and Dutch Twitter data

from unknown Twitter text to achieve 80% accuracy

for gender prediction. The work focuses instead on

punctuation, n-gram counts, sentence and word

length, vocabulary richness, function words, out-of-

vocabulary words, emoticons and part-of-speech.

In another study (Alwajeeh et al., 2014) the

authors worked on blog segments using features such

as speech analysis, punctuation, average word length,

sentences, and word factors. They achieved a gender

prediction rate of 72.2% (Tang et al., 2010).

In addition to that, (Estival et al., 2007) worked on

Arabic emails and reported being able to predict the

gender with a precision of 72.10%. They calculated

64 features describing psycholinguistic word

categories (e.g. family, anger, death, wealth, family,

etc.). Any feature describes the number of words

detected in the similar category divided by all words

in the text.

Although in (Mikros, 2012), the researchers

worked on the automatic classification of blogs and

emails, they obtained a precision of 81.5% of

documents well classified for the dimension of gender

and 72% for the dimension of age.

The work in (Juola, 2012) investigated the

attribution and detection of the author's genre using

Greek blogs. He chose this model of social networks

because people can express their opinions on blogs.

Juola focused on two types of features of text content.

The first type includes classic stylometric features,

which depend on vocabulary richness, word length,

and word frequency. The second type of features

depends on the bi-gram characters, and the n-gram of

words. The results of their experiments showed an

accuracy of gender identification of 82.6% with SVM

(Maharjan et al., 2014).

The work in (Koppel et al., 2003) presented an

application that detects various demographic

characteristics such as name, age, gender, level of

education. The authors used two corpora of e-mail for

the Arabic and English languages. They used a

questionnaire to check and examine the user's’ profile

including age, gender, and level of education. The

authors used many machine-learning classifiers in

their experiments such as SVM, KNN and decision

trees. For gender detection, the best accuracy was

achieved by SVM (Argamon et al., 2009).

Current author identification techniques go

beyond stylometric analysis, which opens the way to

profiling, attribution, and identification of authors. In

addition, they explore data and use digital documents

like graphics, emoticons, colors, layouts, etc. In this

context, we cite a very recent work of 2015 in which

a play "Double Falsehood" was identified as the work

of William Shakespeare where the researchers were

based on colors and graphics information for

identification because each author or artist has his

own style (Mechti et al., 2010).

The Arabic language is one of the most widely

adopted languages with hundreds of millions of

native talkers. Furthermore, it is used by more than

1.5 billion Muslims to practice their religion and

spiritual ceremonies.

Authorship attribution is another field of related

work that is concerned with the description and

identification of the true author of an anonymous text.

In the literature, authorship description is defined as

a text categorization or text analysis and classification

problem. The authorship has various potential

applications in fields such as literature, program code

authorship attribution, digital content forensics, law

enforcement, crime prevention, etc. In the context of

authorship attribution, stylometry has been used to

Arabic Twitter User Proﬁling: Application to Cyber-security

111

determine the authenticity of the document. It is

considered as the study of how people can judge

others according to their writing style. Therefore,

stylometry cannot only be used to identify a writing

style but can also help identify the author's gender and

age.

3 PROPOSED METHOD

The goal in this work is to analyze the profiles of

anonymous authors and predict the author’s age,

gender, and whether the profile is a dangerous

profiles. Our approach is purely statistical, i.e., it

accepts input from any profiles written in Arabic and

calculates the frequencies to identify age based

differences (between young people and adults),

gender based differences (between men and women),

and profile risks (i.e., whether the profile is dangerous

or not. We divided our work on author profiling into

two parts: The first part focused on extracting

relevant information from a user profile such as the

number of friends, number of followers, number of

retweets, etc. The second part focused on extracting

information from the user’s tweets based primarily on

stylistic information (lexical, structural, syntactic)

and semantic information.

3.1 Profiles Specific Features

User profiling allows determining the users’

characteristics such as age and gender. In order to

retrieve automatically the profile data and the users’

tweets we used Twitter API ¹. Next, we explain which

data was retrieved through the API (Peersman et al.,

2011). The relevant Twitter terms for our work are

the following:

ReTweet: each user may republish on his profile a

Tweet that was written by another user.

Followers: users that receive and follow status

updates of a given user.

Friends: users that are followed by a given.

Favorite accounts: a way to tag a tweet as a

preference in order to see it easily later.

Time of publications: for each author, her time of

publication of tweet was retrieved. The chronology

is mainly divided into four parts (from midnight to 6

am, from 6 am to noon, from noon to 6 pm and finally

from 6 pm to midnight). Based on that, we can predict

the user’s favorite time to share their tweets.

We retrieved through the API the available

profile data and examined the network formation

resulting of users and their contacts with other users,

e.g. by examining for a given user the number of

followers, number of friends, number of favorite

accounts, number of retweets and preferable time of

publications. Then, we represented the user’s profile

as a normalized number vector with numbers

corresponding to the profile features.

For each profile collected in our corpus, an expert

in sociology helps us to identify and annotate the age,

gender and whether the user has a dangerous profile.

3.2 Tweets Specific Features

What features allow to predict age, gender, and

dangerous profiles is an open research question that

several authors addressed in fields such as human

psychology. Stylometry (i.e., the study of stylistics

features shows that individuals can be classified and

identified by their writing styles. The writing style of

a person is defined by the selection of special

characters, the terms used, and the composition of

sentences...

Studies in literature (Guimaraes et al., 2017) show

that there are no one-size-fits-all features set that is

optimized and applicable to all people and to all

domains. In fact, thousands of stylometric features

have been proposed. Even though authors can

consciously modify their own style, there will always

be an unconscious use of certain stylistic features

(Sara et al., 2014). For our work, we use the following

features: (1) stylometric features (lexical, syntactic

and structure) and (2) semantics and emotional

features. Figure 1 shows the general process of our

work for the extraction of features from tweets. The

major steps of our method are as follows:

1. Pre-processing and Text Analysis: The process

starts with data cleaning; the aim of this step is to lead

to a cleaner representation of the tweets. For this

purpose we have removed noisy data such as prefixes,

suffixes, and URLs. We also transformed plural

words to singular, and we applied lemmatization.

2. Calculating Features Vector: A feature vector is

computed based on the profile data and the users’

tweets. The extracted features are divided into two

groups: training set and testing set. The training set

is used to develop a classification model whereas the

testing set is used to validate the developed model.

3. Classification: We train an SVM classifier using

our training data from Step 2 to discriminate between

various age groups, genders and dangerous profiles.

More details on this step will be given in the next

subsections.

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

112

Figure 1: Architecture of our author profiling method.

3.2.1 Lexical Features

A tweet can be seen as a series of characters and word

tokens grouped into sentences. A token can be a word,

a punctuation mark or a number. Some studies on

authorship attribution (Tan et al., 2009) were based

on simple models such as sentence length and word

length. The benefit of these features is that they can

be used on any corpus in any language with no

additional conditions except the availability of a

tokenizer. Lexical features can be used to learn about

the typical use of words and characters by a certain

individual.

In the following, we discuss how we used in our

work lexical features by considering different

variations of the characters included within it.

First, we calculated the total number of characters,

including all Arabic letters by Latin, digits (0-9), etc.

The other stylometric analyzed in this part, as well as

the total number of special characters, and white

spaces were counted (Rangel et al., 2013).

Second, we analyzed words by applying 12

statistical measures including the total number of

words, the average number of words, the total number

of different words in a tweet, the total number of short

words with three characters, the total number of long

words with six or more characters, the total number

of words with flooded characters (e.g. Heeeelloooo).

All the characteristics of this approach were

calculated with the ratio.

Vocabulary richness functions quantify the

variety of the vocabulary of a tweet. Some models of

this measure include the number of hapax legomena

(words occurring once) and, the number of hapax

dislegomena (words occurring twice). Various

functions were proposed to achieve stability over text

length, including Yule’s K measure, Simpson’s D

measure, Sichel’s S measure, and Honore’s R

measure (Tschuggnall et al., 2017).

3.2.2 Syntactic Features

Researchers discovered the effectiveness of syntactic

elements in identifying an author (Hsieh et al., 2018).

Syntactic features define the patterns used to form

sentences (Potthast et al., 2016).

In informal writing, it is common to use multiple

question exclamation marks to express better a

feeling or a mood. Therefore, syntactic characteristics

define the writing style of a writer. In this regard, we

counted the total number of quotation marks, periods,

semicolons, question marks, exclamation marks,

multiple exclamation or question marks (???, !!!), and

ellipses to determine the frequency with which

authors use punctuation in their tweets.

One of the most important aspects of text

classification is the word level that is used. N-grams

based approaches are often used for text classification

and they can be implemented on different levels, such

as the word level. Furthermore, some researchers

have used word n-grams to address authorship

attribution. N-grams are tokens created by a

contiguous sequence of n items. The unique and

different n-grams constitute the most important

feature for stylistic purposes. This demonstrates why

word n-grams were used as input features for

automatic methods of detecting and classifying

authors with both promising results.

Another related approach is part of speech tagging

(POS tag) which represents the tokens according to

their function in the context. Basic POS ² tags include

the functional words in a sentence, (e.g., verbs,

preposition, and pronouns) (Garciarena et al., 2015).

Authors regardless of the topics use function words

unconsciously and consistently and their use has a

low probability of being deceived.

After we calculated the frequency of use of each

grammatical category, we calculated the ratio

between the number of verbs, pronouns, and the total

number of words. Moreover, we calculated the ratio

between the frequency of used dots, quotes and the

total number of words. We also calculated the ratio

between the unique, different grams and the total

number of words. The same applied for the hashtags,

which were represented as the ratio between the

hashtags number and the total number of words

(Pennebaker et al., 203). Table 1 represents the set used

features.

Arabic Twitter User Proﬁling: Application to Cyber-security

113

Table 1: Set of used features.

Lexical Features

Word-based Features

Total number of words (M)

Average word length

Number of long word (than 6 characters) /M

Number of short (1-3 characters) word/M

Number of word elongation word /M

Different words frequency/M

Hapaxlegomena (unique word)

Hapaxdisplegomena

Yule's K measures

Simpson's D measures

Honore's R measure

Entropy measure

Entropy lines measure

Character-based Features

Number of Arabic characters by Latin /M

Number of letters /M

Number of digital characters /M

Number of white /M

Number of special characters (22 features) /M

Syntactic Features

Number Frequency of punctuations/ M {“'”,

“.”, “:”, “;”, “ ? ”, “! “,“ " ” ,“???” , “!!!” ,

“…”}

Number of different n-grams

Number of unique n-grams

Number of Hashtags/M

POS tagging

Number of adjective/M

Number of adverbs/M

Number of abbreviations/M

Number of conjunctions/M

Number of gender-specific word/M

Number of interjections/M

Number of names/M

Number of particles/M

Number of prepositions/M

Number of pronouns/M

Number of proper names/M

Number of verbs/M

Structural Features

Average phrases

Average number of words per sentence

Average words length

Average sentences length

Number of lines /M

Number of blank lines /M

Number of paragraphs /M

Semantics and Emotional Features

Emotion

Number of angry word/M

Number of disappointed word/M

Number of disgusted word/M

Number of gleeful word/M

Number of happy word/M

Number of romantic word/M

Number of sad word/M

Number of satisfied word/M

Number of surprised word/M

Topic

Number of sports word/M

Number of political word/M

Number of military and weapons

word/M

Number of family and friends

word/M

Number of economic, money, work

and social word/M

Number of death and religion

word/M

Number of body and sexual word/M

Profiles Features

Number of friends

Number of followers

Number of retweet

Number of favourite is accounts

Time of publication

3.2.3 Structural Features

Structural features (or structure-based features) are

about the organization and format of a text.

They assess the overall impression of the

document's writing style. These features can be define

at the paragraph-level, message-level or according to

the technical construction of the document.

Rendering a large number of features does not

necessarily produce excellent results, as some

features give very little information. Nevertheless, in

the authorship verification of computer-mediated

online information such as tweets and blogs the

structural features seem to be encouraging (Modak et

al., 2014).

In this context, the structural features describe the

process an author follows to create a tweet. People

have different habits when creating a publication.

This is even more important in the context of online

texts, which have limited content

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

114

Frequently, structural features include the length

of the paragraph from a tweet, separators used in

sentences like lank lines, length of a sentence, word

length, length of words per sentence, the layout of the

whole document.

3.2.4 Semantical and Emotional Features

Sentiment analysis has become one of the most active

areas of research in natural language processing. It is

also generally, used in the domains of data mining

and document extraction. Sentiment analysis aims to

predict or extract the antithesis of people opinions in

specific fields. It is regarded as a challenging task for

feeling analysis. Most current approaches in the

research distinguish two main types of attributes that

can be used to predict the author’s profile: the stylistic

and the content based of their tweets.

The basic types of features that can be used for

content-based authorship profiling are the emotional

and semantic features. We looked for the similarities

that can group a set of terms in the same class. The

corpus of the Arabic text is much larger. Therefore,

we manually grouped the terms belonging to the same

class of attributes. We identified nine classes of

emotion and emoticons ³ namely: surprise,

satisfaction, joy, romance, sadness, anger, pleasure,

disgust and disappointment. The use of emoticons

aims at making the text messages more expressive.

For the semantic features, we manually collected

seven dictionaries of the following domains: sports,

politics, army and arms, family and friends, economic

and social, death and religion, body and sexuality.

In total, we built 8000+ terms, explicitly

conveying 16 classes (sports, political, angry…).

Moreover, we obtained the root for each word and

then calculated the probability factor (ratio) of using

words in the tweets for each class. For example, a user

talks about sport in 80% of his/her tweets and about

politics in 20% of her tweets.

4 DATASET

In each authorship classification problem, there exists

a collection of authors, a set of users profiles of

known authors (training dataset), and a collection of

user profiles of unknown authorship (test dataset).

For each user profile, we retrieved the first 200 tweets

that were written in Arabic, as well as the maximum

number of words up to 140 words. In total, we

collected about 32032 tweets from 422 users.

We used cross-validation with the 422 profiles for

training. Then, we used another 232 user profiles to

test in order to evaluate the finished model. The data

was balanced by gender and dangerous v.s non-

dangerous. However, it was not balanced with respect

to age. The distribution of the number of user’s

profiles per dataset is shown in Table 2.

For each profile we calculate a numerical vector,

whose elements represents all extracted features from

the profile and the respective tweets, which help us

discriminate the relevant classes.

Table 1: Data size for each classification task.

Classes

Attributes

Number of profiles

for learning

Number of profiles

for the test

Age

Adult

136

Young

147

Gender

Women

Man

Terrorist

Not Terrorist

5 RESULTS

The step of selection classes was important and

became a great influence on the results. We computed

the ratio of all characteristics used and then we

normalized all used features. In total, we made use of

the 143 attributes.

In this research, we applied classifiers to select the

relevant attributes and to predict the performance for

age, gender and dangerous profiles of a Twitter

profile. As shown in Table 3, the SVM ⁴ classifier

outperforms the other two classifiers multilayer

perceptron and random forest. SVM provides the

highest accuracy with 73.49% for age, 83.70% for

gender and 88.70% for dangerous profiles.

Table 3: Classification accuracy using 10-fold cross-

validation with the training partition.

Classes

SVM

Multilayer

Perceptron

Random

Forest

Age

73.49 %

70.31 %

66.07 %

Gender

83.70 %

76.40 %

74.71 %

Terrorist

88.70 %

82.25 %

80.64 %

Based on our results, the style features that prove

most useful for age discrimination are the use of

joyful and happy emotion in the writing of young

people. While adults use more 'angry' emotions, long

words, prepositions with a high entropy measurement

Arabic Twitter User Proﬁling: Application to Cyber-security

115

(the more the words of a Tweet are varied the higher

the entropy is).

The features that prove to be most useful for

gender discrimination are military terms and

weapons, a high measure of tweets diversity with the

use of multiple question marks (markers of male

writer). In counterparts, the markers of female writers

are a large number of followed accounts, in addition

to the use of the first personal pronoun.

The most discriminating style features indicate

that dangerous profiles tend to write their Tweets

after midnight with more emotions of satisfaction.

They also tend, to use two different grams and to

write their publications with a large number of semi-

colons. Moreover, they usually have a large number

of friends. Concerning non-dangerous profiles, we

can notice that their posts are enriched with adverbs

and adjectives. Moreover, they are interested in using

more sports terms with long, unique words. In

addition, non-dangerous accounts often use in their

posts syntactic characters such as double quotes,

multiple question marks, and seedling-colonists.

Table 4 shows the results obtained with the test

data set for the three classes. Our dataset contains 232

new Twitter user profiles that not be seen from

before. As shown in that table, the Random Forest

classifiers outperforms other classifiers with 80.81%

for age and 75.00% for gender both accuracy. On the

other hand, SVM classifier gave the better accuracy

of 81.81% for detecting dangerous profiles.

Table 4: Supplied test set classification accuracy.

Classes

SVM

Multilayer

perceptron

Random

Forest

Age

69.18 %

72.03 %

80.81 %

Gender

73.07 %

74.03 %

75.00 %

Terrorist

81.81 %

68.18 %

54.54 %

6 CONCLUSIONS

In this paper, we tackled the problem of automatically

determining the age, the gender of users on the

Twitter social network, and the detection of terrorist

profiles, focusing mainly on Arab profiles that have

not occupied research, the place they deserve.

We used our own body of user profiles, we started

by extracting Tweets from these profiles by proposing

a set of characteristics that allow us to predict the

three classes considered, namely (age, gender,

danger). We considered three families of

characteristics: the stylistic family (syntactic, lexical

and structural characteristics), the semantic family

where we collected several dictionaries manually to

characterize different available themes and the family

of information about the profiles themselves. Finally,

we have shown how the right combination of

stylometric characteristics and automatic learning

methods allows an automated system to effectively

determine the desired aspects of an anonymous

author.

The results show the stylometric characteristics

were more efficient and accurate, according to what

is accepted and believed in the literature. To be more

precise, the best performance obtained on our

database was 73.49% for age detection and was

obtained using the SVM classifier. On the other hand,

the best performance in terms of gender detection was

83.70% and was obtained using the SVM classifier as

well. Finally, the best performance for detecting

terrorists was 88.70% and it was still obtained using

the SVM classifier.

After analyzing the experimental results, we

found that the SVMs seem to be the best classifier

among those tested for the identification of the three

classes of profiles adopted.

In the future, our goal is to explore new

multilingual author profile detection techniques by

adopting more sophisticated features such as those

based on the user's geographic location, for example.

Similarly, we are considering increasing the size of

the dictionaries used to predict the feelings and

different themes considered.

ACKNOWLEDGEMENTS

This publication was made possible by NPRP 9-175-

1-033 from the Qatar National Research Fund (a

member of Qatar Foundation). The findings achieved

herein are solely the responsibility of the authors.

REFERENCES

S. Argamon, M. Koppel, J. Fine, and A. R. Shimoni,

“Gender, Genre, and Writing Style in Formal Written

Texts,” J. Lang. Soc. Psychol. December 2003.

R. Feldman, J. Sanger. The text-mining handbook:

advanced approaches in analyzing unstructured data.

Cambridge University Press; 2006.

E. Frank, L. Witten. Generating accurate rule sets without

global optimization. In: Proceedings of the fifteenth

international conference on machine learning, 1998.

K. Georgios. Anonymity and closely related terms in the

cyberspace: an analysis by example, 2014.

S. Argamon, S. Dhaule, M. Koppel, J. Pennebaker, Lexical

Predictors of Personality Type. In Proceedings of

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

116

Classification Society of North America, St. Louis MI,

June 2005.

C. Peersman, W. Daelemans, V. Van, L. Predicting Age and

Gender in Online Social Networks. SMUC’11 (2011).

M. Koppel, J. Schler, K. Zigdon. Determining an author's

native language by mining a text for errors. In

Proceedings of the eleventh ACM SIGKDD

international conference on Knowledge discovery in

data mining (pp. 625-628). ACM (2005, August).

A. Alwajeeh, M. Al-Ayyoub, I. Hmeidi “On authorship

authentication of arabic articles,” in the fifth

International Conference on Information and

Communication Systems (ICICS 2014), 2014.

J. Tang, L. Yao, D. Zhang, J. Zhang “A Combination

Approach to Web User Profiling” in ACM Transactions

on Knowledge Discovery from Data, 2010.

D. Estival, T. Gaustad, S. Pham, W. Radford, “TAT: an

author profiling tool with application to Arabic emails”.

In Proceedings of the Australasian Language

Technology Workshop, (pp. 21-30), 2007.

G. Mikros, “Authorship Attribution and Gender

Identification in Greek Blogs”, Methods and

Applications of Quantitative Linguistics, (pp. 21–32),

2012.

P. Juola. Large-scale experiments in authorship attribution.

Eng Stud (pp.276–81), 2012.

S. Maharjan, P. Shrestha, T. Solorio. A Simple Approach to

Author Profiling in MapReduce. England. CLEF 2014.

M. Koppel, S. Argamon, A. Shimoni, Automatically

categorizing written texts by author, gender, Literary

and Linguistic Computing (pp 401-411), 2003.

S. Argamon, M. Koppel., J. Pennebaker, Jonathan Schler

Automatically detection the author of an anonymous

text. Communications of the ACM, pp (119-123), 2009.

S. Mechti, M.Jaoua, R. Faiz, H.Bouhamed, L. Hadrich.

Author profiling: Age prediction based on advanced

Bayesian networks. University of Massachusetts

Amherst, USA, 2010.

C. Peersman, W. Daelemans, V. Van., L. Predicting age and

gender in online social networks. In Proceedings of the

3rd international workshop on Search and mining user-

generated contents, SMUC 11, pages 38-44, New York,

NY, USA, ACM.2011.

R. Guimaraes, R. Rosa, D. Gaetano, D. Rodriguez. Age

groups’ classification in social network using deep

learning. IEEEAccess, 2017.

M. Sara, K. Ismail “Authorship analysis studies: A survey”.

International Journal of Computer Applications, 2014.

G. Tan, C. Gaudin, A. Kot. Automatic writer identification

framework for online handwritten documents using

character prototypes. SMUC, ACM, New York, NY,

USA, December 2009. .

F. Rangel, P. Rosso, M. Potthast, B.SteinOverview of the

Author Profiling Task at PAN 2013. In: Forner P.,

Navigli R., Tufis D.(Eds.), Notebook

Papers of CLEF 2013 LABs and Workshops, CLEF

2013, Valencia, Spain, 2013.

M. Tschuggnall , E. Stamatatos, B. Verhoeven, W.

Daelemans, G. Specht, B. Stein, M. Potthast, Overview

of the author-profiling task at PAN 2017: Style breach

detection and author clustering. In Working Notes of

CLEF 2017 - Conference and Labs of the Evaluation

Forum, Dublin, 2017.

F. Hsieh, R. Dias, I. Paraboni. Author Profiling from

Facebook Corpora. IEEE Latin America (toappear),

2018.

M. Potthast, F. Schremmer, M. Hagen, B. Stein. Overview

of the Author Obfuscation Task at PAN 2018: A New

Approach to Measuring Safety. International

Conference of the CLEF Initiative (CLEF 16), Berlin

Heidelberg New York. Springer, 2016.

M. Garciarena, M. Villegas, D. Funez, L. Cagnina, M.

Errecalde, G. Ram, E. Villatoro. Profile-based

Approach for Age and Gender Identification Notebook

for PAN at CLEF 2016Knowledge-Based Systems 89,

(pp.134–147), 2015.

J. Pennebaker, W. Mehl, Psychological aspects of

natural language use: Our words, our selves. Annual

Review of Psychology. (pp. 547–577), 2003.

S. Modak, A. Mondal. “A Comparative study of Classifiers

Performance for Gender Classification”, IJIRCCE, (pp

4214-4222), 2014.

Arabic Twitter User Proﬁling: Application to Cyber-security

117