Detecting and Analyzing Privacy Leaks in Tweets

Paolo Cappellari

, Soon Chun

and Christopher Costello

College of Staten Island, City University of New York, New York, U.S.A.

Macaulay Honors College, City University of New York, New York, U.S.A.

Keywords:

Privacy, Text Analytics, Machine Learning, Social Media.

Abstract:

Social network platforms are changing the way people interact not just with each other but also with companies

and institutions. In sharing information on these platforms, users often underestimate potential consequences,

especially when such information discloses personal information. For such reason, actionable privacy aware-

ness and protection mechanisms are becoming of paramount importance. In this paper we propose an approach

to assess the privacy content of the social posts with the goal of: protecting the users from inadvertently dis-

closing sensitive information, and rising awareness about privacy in online behavior. We adopt a machine

learning approach based on a crowd-sourced deﬁnition of privacy that can assess whether messages are disclo-

sing sensitive information. Our approach can automatically detect messages carrying sensitive information, so

to warn users before sharing a post, and provides a set of analysis to rise users awareness about online behavior

related to privacy disclosure.

1 INTRODUCTION

Over the past decade, the rise of social media out-

lets such as blogs and social networks has provided

people with the means to disclose information about

themselves publicly. Cybernauts directly or indirectly

share information about themselves or others, e.g.

friends (Krishnamurthy and Wills, 2009; Mao et al.,

2011; Malandrino et al., 2013). People are aware that

their online behaviors are, to some degree, monito-

red in exchange for the fruition of the services. Ne-

vertheless, users tend to neglect the potential implica-

tion such information can have on their life, especi-

ally when interacting on social networks (Wang et al.,

2011). Awareness of the potential implications of on-

line behavior is becoming of paramount importance.

People need to realize that repercussions of online be-

havior are no longer conﬁned to the social aspect of

life, but can propagate into the professional one. As

an example, in a recent CareerBuilder’s annual so-

cial media recruitment survey,

it has been observed

that 70 percent of recruiters use social media to rese-

arch and consider job candidates; 49 percent of which

admitted to having rejected candidates or the content

they have shared.

http://press.careerbuilder.com/2017-06-15-Number-of-

Employers-Using-Social-Media-to-Screen-Candidates-at-

All-Time-High-Finds-Latest-CareerBuilder-Study

Information sharing can be achieved via social

posts, chat messages, emails, blogs, etc. Generally,

repercussions of online behavior are correlated with

the disclosure of sensitive (or private) information

about the individual user or others. Privacy is a ma-

jor challenge in the modern web-oriented society. In

this paper, we present an approach to analyze and as-

sess “the amount” of privacy content disclosed by a

user when sharing information online. Our approach

allows to analyze social media user’s activity over a

period of time, as well as to assess in real-time whet-

her a piece of information about to be shared contains

sensitive information or not. Ultimately, our approach

provides the mean to assess information leakage both

historically and momentarily: it can be used by users

or organizations to detect and assess information le-

akage before it occurs, or to analyze already occurred

leakage in order to educate users.

One major issue in detecting sensitive information

is that the deﬁnition of what is private (from what

is not) varies between individuals: people have dif-

ferent, subjective, views on what constitutes a private

content. It is also true, however, that people belonging

to the same community tend to share the similar de-

ﬁnition of what constitutes sensitive information. In

our approach we use a “societal deﬁnition” of sensi-

tive information. The societal deﬁnition we use in this

work has been built by asking groups of people to an-

Cappellari, P., Chun, S. and Costello, C.

Detecting and Analyzing Privacy Leaks in Tweets.

DOI: 10.5220/0006845602650275

In Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA 2018), pages 265-275

ISBN: 978-989-758-318-6

265

notate social posts into two basic categories: private

vs non-private. Each post has been classiﬁed by mul-

tiple people, whose votes have been democratically

coalesced in a ﬁnal annotation.

When it comes to social media users privacy

protection, there is a surprising lack of protection

tools and controls related to what Richthammer et

al. (Richthammer et al., 2014) described as “seman-

tically unspeciﬁed” data (i.e.: social media posts). On

one hand, tools such as the browser add-on NoTrace

help users against collection of personally-identiﬁable

information, web tracking and other privacy threats.

On the other hand, social media platforms provide

users with some form of control to deﬁne which au-

dience has access posted contents. However, social

media sites largely leave post contents to the discre-

tion of their users. According to Sapuppo (Sapuppo,

2012), 14 to 17 percent of social media users are un-

concerned with the privacy value of what they share,

while only 10 to 17 percent are “fundamentalists”

who are extremely reserved with what information

they share. The majority of users are “pragmatists”

who are somewhat concerned with the information

they disclose, but even they are liable to unwittingly

disclose some private information.

In the academic setting, a number of efforts are

trying to tackle the problem of information leakage.

For example, in (Acquisti and Gross, 2006) authors

focus on analyzing a number of privacy issues rela-

ted to the user of social networks, but do not address

the problem of prompting the user before an infor-

mation leakage occurs. The authors of (Mao et al.,

2011) can automatically classify social posts in set

of privacy categories, however such categories are

ﬁxed and have been deﬁned arbitrarily and upfront.

The work in (Malandrino and Scarano, 2013) prompt

users when sensitive information is about to be dis-

closed, however it limits its scope to well structured

data in emails, name or social status. In (Cappellari

et al., 2017) authors propose an initial study to generic

privacy detection and assessment, however the appro-

ach seems to be in a small scale and geared towards

the analysis of past information leakage occurrences.

Overall, a comprehensive approach helping users as-

sessing and overcoming information disclosure issues

is missing.

With this work, we present a machine learning ba-

sed approach to automatically detect messages car-

rying private information. Our main goal is to detect

messages carrying sensitive information, so to alert

users before the messages are shared with others, and

to provide online behavior analysis to rise users awa-

reness regarding online privacy behavior. Our contri-

butions are as follows:

• A semi-automated system to generate annotated

datasets, to be used directly as training set for the

machine learning model;

• A data pre-processing processor that reduces the

text to analyze to its core and essential features;

• A model to detect and assess messages carrying

sensitive information;

• A performance comparison of several machine le-

arning models;

• A privacy analytics platform to analyze individual

and population level privacy leakage behaviors in

social media message sharing.

Our contributions also include that we share a

dataset of 6000 tweets annotated with the privacy-

related labels with the research community, to enable

follow-up research of this study for veriﬁcation and

replications, as well as enhance the machine learning

models to detect the privacy-related messages.

The paper is organized as follows: in Sec. 2 we

present the related research; in Sec. 3 we describe how

we select the data that will be used to build our privacy

classiﬁcation model; in Sec. 4 we present our appro-

ach to the crowd-sourced annotation of social posts,

to create our societal deﬁnition of privacy; in Sec. 5

we present our privacy detection approach; in Sec. 6

we discuss our ﬁnding; ﬁnally, in Sec. 7 we draw our

conclusions.

2 RELATED RESEARCH

The approaches in (Islam et al., 2014; Liu and Terzi,

2010) both focus on associating a user with a pri-

vacy score, that indicates her/his tendency to disc-

lose sensitive information. In (Islam et al., 2014) the

authors analyze the Twitter users privacy disclosure

habits by building a machine learning model that as-

sociates each Twitter user with a privacy score. The

privacy score is then used to analyze how the user’s

privacy behavior is inﬂuenced by other users in the

same network. Similarly to our work, authors creates

a machine learning model by using tweets annotated

by Amazon Mechanical Turks (AMT) workers. Dif-

ferently from ours, however, AMT workers are pre-

sented with a ﬁxed set of possible privacy categories,

therefore limiting the ability of the model to recog-

nize privacy in general. Also, the authors focus on

scoring the user privacy leakage behavior by analy-

zing their timelines, and by correlating the score of

the user with the score other users the former has be-

friended or mention in posts (so to determine a pri-

vacy leakage inﬂuence pattern). Our work, instead,

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

266

focuses on assessing each message individually: each

message can be classiﬁed in real-time before it is dis-

closed. Also, by collecting series of messages, analy-

tics can be built to provide insights on the day, time,

location, and content of privacy leaks, as well as on

the likelihood that a user (or group of users) will dis-

close sensitive information.

In (Liu and Terzi, 2010), authors propose a frame-

work to associate users with a privacy risk score. The

score is determined by analyzing the user’s messages,

and can be used to: alert the user when a posted mes-

sage exceed a set privacy risk score, or to let the user

know where s/he stands compared to the rest of the

community. The approach mostly focuses on devising

a mechanism to formalize a mathematical function to

calculate the user privacy risk score. Many factors are

considered in the mathematical function that calcula-

tes the user’s score. However, the approach does not

account for the societal factor, which is the ﬁrst major

difference with respect to our approach. The second

major difference is the attention to the evaluation of

the user score after the privacy leak has occurred, si-

milarly to (Islam et al., 2014), rather than on the as-

sessment and prediction of the generic piece of text

the user is about to post, like in ours.

In (Mao et al., 2011), authors analyze information

leak associated with a number of ﬁxed categories of

interest: vacation, drug, and health condition. The

ﬁrst category is concerned with users disclosing plans

and/or locations of where they will (or not) be and

when. The second, is concerned with people posting

messages under the inﬂuence. The third, looks into

social posts disclosing medical conditions, personal

or not. Authors focus on Twitter posts, associating

tweets to the mentioned categories by relying on a set

of keywords representative of each category. The li-

mit of this approach, and the difference with respect

to ours, lies in the small number of categories of in-

formation disclosure authors look into, and in the ar-

bitrarily, ﬁxed, and subjective, set of keywords that

are associated with each category. Our approach, on

the other hand, relies on crowd wisdom to know what

should be considered as sensitive information, rather

than on a ﬁxed set of keywords. As a consequence,

we are looking at a larger spectrum of sensitive in-

formation disclosure possibilities, and to a more de-

mocratic way of classifying messages which is more

aligned with the privacy perception of the society.

In (Sleeper et al., 2013; Wang et al., 2011), aut-

hors study what messages users from Twitter and Fa-

cebook tend to regret. Authors surveyed a number of

users from both platforms to classify regretted posts

into categories, and analyze the effort and time users

spend in making amends for their posts, when possi-

ble. Both works analyze the aftermaths of information

disclosure, and focus on educating users on the use

of social media and on the implication of underesti-

mating information sharing. Differently, our work is

more focused in providing users with insights about

privacy leakage, as well as in supporting users with

actionable mechanisms that can prevent these (regret)

situations from happening altogether.

Hummingbird (Cristofaro et al., 2012) is a

Twitter-like service providing users with a high de-

gree of control over their privacy. The service offers

a ﬁne grained privacy control, including: the ability

to deﬁne access control lists (ACL) for each tweet;

and the protection of server-side user behavior iden-

tiﬁcation. With this approach, users are limited to a

speciﬁc service, and have to proactively address the

privacy issue by taking actions before using the ser-

vice itself, such as deﬁning ACL. With our approach,

users are free to use any service, do not have to conﬁ-

gure any tool, and can assess the amount of informa-

tion leakage in a message before sharing it.

The work in (Kongsg

ard et al., 2016) focuses on

sensitive information leakage detection for corporate

documents. Authors employ machine learning techni-

ques to automatically classify a document as sensi-

tive vs non-sensitive. A curated training set of do-

cuments has to be provided to create the classiﬁcation

model: an administrator has to craft, select, and anno-

tate which documents should be considered as private

vs not private. This solution can provide great degree

of customization, which is ideal of a corporation need.

On the other hand, it is impractical on a large scale,

which is on what our approach focuses, where an ad-

ministrator cannot possibly prepare the dataset(s) ma-

nually.

3 DATA SELECTION

Selecting the right data is a crucial task for both cre-

ating the training dataset for our classiﬁcation model,

and for validating our approach. Due to the lack of

available privacy related datasets, we had to devise a

mechanism to collect and create our own privacy da-

taset. Our data source is Twitter, where we have been

careful to select only tweets that users have marked

as fully public (no restrictions). We have used Twitter

as our data source due to its public and openness po-

licy. We have collected millions tweets from the live

sample Twitter stream, over multiple periods of time

during Fall 2017 & Spring 2018. Part of this dataset,

after annotation, has been used to train the machine

learning model, and to run privacy leaks analysis on

historical data. Note that our application also use the

Detecting and Analyzing Privacy Leaks in Tweets

267

the live Twitter stream to provide privacy leakage in-

formation in real-time.

To guarantee an approach as application agnostic

as possible, we decided not to consider any informa-

tion beyond the tweet text itself and its geo-reference

(as latitude and longitude). In fact, all tweets’ meta-

data, user proﬁle information, etc., and all application

speciﬁc lingo, such as the hashtags for Twitter, have

not been considered in our approach. In addition, in

order to build a higher quality training set, we focused

on the English language, and on tweets of reasonable

length, which are more likely to provide a meaningful

content. In summary, we have retained tweets with

the following characteristics:

• are in English;

• have at least 10 words (beyond stop-words);

• have no hashtags;

• have no URL;

• are not retweets.

The focus on the English language is motivated

by the availability of libraries to process the English

language and by the fact that adding additional lan-

guages would not have improved the generality of our

approach. In fact, having to support multiple langua-

ges would require more work, but would not add me-

aningful contribution to the methodology itself. The-

refore, we decided to focus on the English language

only, thus discarding tweets in any other language.

Filtering on tweets that have at least 10 words, in

addition to stop-words, maximizes our chances of col-

lecting tweets with a well formed, meaningful, sen-

tence. Tweets, and social posts in general, can be rat-

her short in nature, which poses a major challenge:

they tend to provide little-to-none “surrounding con-

textual information” about the message it is being

shared. The lack of contextual information makes it

harder to assess the content of the messages for pri-

vacy leaks. Therefore, we decided to discard all those

messages that are too short, and would not only pro-

vide little value, but could potentially generate unde-

sired “noise” in our privacy assessment approach, re-

sulting in improper classiﬁcation.

In tweets, hashtags are metadata information that

users embed in their messages. Hashtags are a simple,

yet effective, way for users to annotate their messages

with a theme or topic, so it is easier to search and fol-

low social trends for such topic. While hashtags can

provide a very valuable information on the topic and

context regarding the message, and can be used to as-

sess privacy leaks, we have decided to ignore them.

This way, we do not rely on any lingo or feature of

a speciﬁc application, Twitter in this case, thus deve-

loping an application agnostic approach that can be

used with any application and in any context.

Filtering out URLs maximizes the chances that

each message is self-contained, thus does not relying

on information located somewhere else on the web.

Such linked information should be considered as fully

part of the message, and included in the privacy leak

analysis, ideally However, URLs can link to hetero-

geneous resources, such as images or videos which,

while potentially of paramount relevance, pose a com-

pletely different challenges from a sensitive infor-

mation disclosure analysis point of view, which are

beyond the scope of this work.

Finally, we skip retweets because from a privacy

leak point of view they are duplicate information that

do not provide additional value. As a result of our data

selection criterion, we are able to generate a generic

privacy dataset that allow us to build an application

independent privacy leak classiﬁcation model.

4 DATA ANNOTATION

Our privacy assessment approach rely on a machine

learning model. In order to use such models we have

to craft a so called training set of annotated data in

order to build the classiﬁcation model.

Since we want our model to assess each message

in isolation, the training set is composed of a set

of messages, where each message is annotated with

either the following tags: private, or non-private. The

ﬁrst tag denotes a message that is disclosing sensitive

information; the second one states the opposite, that

is no privacy is leaked in the message.

From our collected data we selected just more than

6000 messages, satisfying the criterion described in

Sec. 3. These messages have then been manually

annotated as private or non-private by Amazon Me-

chanical Turk (AMT) workers. In order to minimize

the risk of an arbitrary annotation of messages (which

would introduce bias in the model), each message was

annotated by 5 different workers, where the ﬁnal an-

notation for such message is decided by a majority

vote: if at least 60% of the workers have tagged the

message as private, then message is deemed as pri-

vate; otherwise, the message is deemed as not-private.

When annotating a message, a worker is presented

with 3 mutually exclusive options to choose from:

a) not-private, the message does not contain sen-

sitive information;

b) somewhat-private, the message discloses sensi-

tive information, to some extent;

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

268

Figure 1: HTML custom template for posts annotation for Amazon Mechanical Turks.

c) private, the message discloses sensitive infor-

mation, deﬁnitively.

The ﬁrst and third options, not-private, and

private, are intended for messages that without any

doubt are either not disclosing or disclosing sen-

sitive information, respectively; the second option,

somewhat-private, is meant to capture that gray

area where users, in their subjective deﬁnition of pri-

vacy, are not sure themselves on whether the message

is or is not disclosing sensitive information. At an-

notation election time, the somewhat-private tag is

equipped to private, meaning that both counts as a

private. In fact, whether the user is ready or not to

admit it to herself, if she considers a piece of infor-

mation to be “somewhat private,” then that message

is carrying private information, indeed, thus should

be distinguished from the non-private ones.

We have developed an AMT custom HTML tem-

plate, and a program to automatically convert col-

lected messages between ours and AMT format. In

addition to speeding up the annotation process, this

also allows us to easily and semi-automatically extend

our training set, if/when necessary. The HTML tem-

plate is designed to presents AMT workers with 12

messages at the time on a single web page. Fig. 1 il-

lustrate the top part of our custom template, including

2 of the possible 12 messages. We chose 12 because

we believe that it is the right trade-off that guaran-

tees a level of consistency among the answers of a

single task (for a single worker), while also keeping

the task itself reasonably short, thus not tedious for

the worker. The template is also supplemented with

the deﬁnitions of our 3 possible annotation tags, each

supported with an example.

To maximize the quality of the annotations, we

have restricted the set of workers that are allowed to

work on tasks. Tasks are made available to workers

that satisfy the following criteria:

• are AMT masters, and have a history of task ap-

proval rate above 90%;

• are English speakers from these countries: USA,

UK, Ireland, Canada, Australia, New Zealand.

Messages to be annotated have been posted on

AMT in batches of incremental sizes. The ﬁrst ba-

tch was a test batch of about 100 messages, so that

we could verify the initial results, detect and correct

any potential issue with the annotation process. Sub-

sequently, we incremented the batch size to 300; and

ﬁnally to 5600. To further guarantee the quality of

the annotation results, we injected the 6000 messa-

ges to tag with an additional set of 140 for which the

annotation was already known. This set of 140 pre-

annotated tweets has been constructed in the same

manner as the primary dataset, however it has been

manually inspected for quality assurance. Tasks for

AMT workers would include some of these 140 ve-

riﬁed messages, so we could determine whether their

annotations were sound.

The turnaround time for individual tasks, from pu-

blishing to results availability, has been less than a

week. Overall, including our incremental batch tests,

it took about one month to have all 6000 tweets an-

notated. The resulting annotated privacy dataset can

be used in many different contexts to improve people

understanding of privacy in the modern information

sharing oriented society.

Detecting and Analyzing Privacy Leaks in Tweets

269

5 PRIVACY LEAKAGE

MODELING

We tackle the sensitive information disclosure asses-

sment problem by adopting a supervised machine le-

arning approach. Messages are classiﬁed individually

according to our two categories of interest discussed

in earlier sections: private, and non-private. Several

classiﬁcation models exit, each with pros and cons. In

choosing which model to use, we followed a pragma-

tic approach: we tested the most popular models for

text classiﬁcation and compared their performance.

Then, we selected the model showing the best perfor-

mance to develop our privacy assessment application

as detailed below. Regardless of the model, a number

of data pre-processing steps had to be put in place in

order to maximize the quality of the result.

5.1 Model Training Set

Results from AMT annotation process were reaso-

nably balanced. Out of the 6000 messages, about

60% were labeled as not-private, and the remai-

ning 40% as private. A training a model with such

a dataset would generate little-to-no bias in the classi-

ﬁer, thus providing a balanced prediction model. The-

refore, we did not sample the dataset to reduce the

data to a 50-50 perfectly balanced training set bet-

ween private and not-private tweets.

5.2 Text Preparation

Social media users often use a less formal version of a

language. In Twitter, for instance, messages are con-

tracted to a more succinct form because of the plat-

form nature (micro-blogging). In doing so, users of-

ten resort to lingo, abbreviations, and other sort of

“broken” versions of a language.

To create a model that is resilient to these varia-

tions, we pre-process collected data so that the mes-

sages are reduced to a more uniform and basic form.

This text pre-processing is applied to both the training

data and the actual input. In this pre-processing, mes-

sages undergo the following transformations:

• the text is converted to lowercase;

• common and stop words are removed;

• letters are converted to basic ASCII alphabet (e.g.

no accents, etc.);

• contractions are collapsed into one word, e.g.

“I’m” is reduced to “Im”;

• words are stemmed to remove derived or inﬂected

variations;

• the less statistically relevant terms are removed.

These data transformations have been implemen-

ted in a series of cascading scripts developed in R.

Lower-casing, removal of numbers, punctuation, ex-

tra white-space, and stop-words is achieved via com-

mon standard libraries; ASCII conversion via the

stringi package. Stemming is applied to reduce in-

ﬂected variation of words to their common root. In

doing so, we are able to reconcile multiple words to

a single semantic, thus increasing the reach of our

vocabulary. Finally, less relevant terms are removed

to further improve correct classiﬁcation as follows:

the words are ﬁltered into a document-term matrix

(DTM) where the text of each message counts as a do-

cument, and the term frequency is measured by term

frequency-inverse document frequency (TF-IDF): the

terms that appear in less than 0.1% of the documents

are removed altogether.

5.3 Model Training and Assessment

The 6000 labeled Tweets are randomly sampled to

create two partitions: a training set partition, contai-

ning 80% of the messages, to be used to train the ma-

chine learning model; and a validation set containing

the remaining 20% of messages (i.e. 1200 labeled

tweets) to be used to evaluate the quality of the model.

A variety of classiﬁcation models have been de-

veloped over the past number of years. Each has

pros and cons, and performs differently depending

on the contextual settings. To decide which classi-

ﬁcation model to adopt in our work, we followed a

pragmatic approach: instead of tampering with state-

of-the-art algorithms, we have tested the most popu-

lar models and selected the best performing one for

our case. The models we include in our evaluation

are the following: Support Vector Machine (SVM),

the Generalized Linear Model (GLM), the Maximum

Entropy (MAXENT), the Supervised Latent Dirichlet

Allocation (SLDA), as well as Bagging (BAGGING),

Boosting (BOOSTING), Decision Tree (TREE), and

Random Forest (RF) models. The models have been

trained and tested against the same training set. The

results are illustrated in Table 1

The best performing model resulted to be SVM,

with an accuracy of 70%, roughly. Therefore, we cre-

ated our privacy model using SVM, which is used in

our applications detailed later.

5.4 Privacy Topics

Besides detecting whether a message discloses sen-

sitive information, it is also of interest understating

what kind of information is being disclosed, or at least

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

270

Table 1: Classiﬁcation Models Accuracy, Precision and Recall Performance.

Model SVM GLM MAXENT SLDA BOOSTING BAGGING RF TREE

Accuracy 68.5% 66.4% 61.0% 66.7% 53.4% 58.6% 67.6% 48.6%

Precision 68.2% 64.1% 60.5% 65.5% 51.1% 54.7% 69.6% 48.6%

Recall 66.1% 70.4% 57.5% 67.1% 96.4% 87.0% 59.5% 100%

on which topic. Topic modeling (Blei, 2012) and to-

pic identiﬁcation (Stein and zu Eissen, 2007) are two

research areas that try to cluster documents in catego-

ries of topics, and to associate a label to each identi-

ﬁed category, respectively.

Latent Dirichlet Allocation (LDA) (Blei et al.,

2003) is a popular statistical model to extract a set

of topics (categories) from an unstructured set of do-

cuments. Brieﬂy, the intuition behind LDA is that a

set of documents only covers a limited set of topics,

that is the documents can be classiﬁed in a few cate-

gories, and that only the same small set of words is

used frequently in each topic.

We use LDA to analyze the topics of privacy dis-

closure across the users’ timelines, and geographical

areas. Currently, we do not try to identify the to-

pic: we limit our attention to ﬁnd the most occurring

words on each topic, delegating the labeling of the

discovered topics to users.

6 APPLICATIONS FOR PRIVACY

ANALYTICS

Analyzing the set of messages annotated from AMT

workers we have observed that almost half of the mes-

sages discloses private information. This fact alone

proves that sensitive information is being disclosed

rather frequently, which reinforce the motivation for

our work. Clearly, users should be provided with sup-

porting mechanisms to catch these information leaks

automatically, giving users both a protection against

possible “share regrets” episodes, as well as an edu-

cational tool.

We have developed a set of applications that can

assess users privacy leak behaviors by classifying

tweet messages as disclosing sensitive information or

not. Speciﬁcally, we developed a web application

composed of multiple services: a text privacy asses-

sment service, a user disclosure behavior analysis, a

geo-area disclosure analysis, and user disclosure topic

analysis. All our applications, including the privacy

dataset are available on our web-server

. The ﬁrst is

a tool that users can use to assess their message for

sensitive information before posting it. The other ser-

vices are detailed below and validate the quality of

http://isi.csi.cuny.edu/privacy

our approach. At the core, the application is develo-

ped in R, running on a Shiny server. Eventually, we

would like to release this service as a browser plug-in

so that users can have immediate privacy assessment

without having to copy-and-paste in third-party web

pages before posting their messages.

6.1 Assessing a Post for Privacy Leak

Before sharing a post, the user can use our applica-

tion to assess a message for privacy content. The ap-

plication is a simple web-page with an input text ﬁeld

and a submit button. Fig. 2 illustrates an example for

a sample post. By submitting the message, the user

is prompted with a report deeming the messages as

private, or not, and with a degree of conﬁdence re-

garding the classiﬁcation. Under the hood, the appli-

cation performs the data pre-processing as described

in Sec. 4 before passing the text to our privacy mo-

del. The conﬁdence value provide with the annota-

tion, informs the user how accurate the classiﬁcation

is. Simplifying, the user can interpret this information

as the amount of sensitive information the post would

disclose, if published.

6.2 User Disclosure Behavior

The User Behavior service analyzes a user timeline

to assess the behavior of such user. The applica-

tion scans through the user’s messages, classiﬁes each

message as either private or non-private, and return an

aggregate view on how much, and when, the user was

disclosing sensitive information.

More in detail, the application expects a user Twit-

ter handle as input; then, it fetches in real-time all

the messages posted by the user, retrieving as many

tweets as allowed by the Twitter API (currently, about

1500). Retrieved tweets are then passed to our clas-

siﬁcation model, where each tweet is assessed indi-

vidually for sensitive information content. The con-

tribution of each message is aggregated to provide a

user proﬁle that reveals: how much of the user tweets

contains sensitive information, as a percentage of all

tweets, and when such tweets are shared on the plat-

form. The application presents a breakdown of which

day of the week and which hour of the day the user has

shared tweets with sensitive information. Fig. 3 illus-

trates an sample run of this application for the Twitter

handle: @realDonaldTrump.

Detecting and Analyzing Privacy Leaks in Tweets

271

Figure 2: Privacy assessment evaluation for an individual post.

At the top of the illustration, there is a pie chart

summarizing the tendency of the user to disclosure

(or not) sensitive information. Right below it, there

is a stacked-bar chart providing information on how

much sensitive information is disclosed in which day

of the week. Finally, the application also presents the

breakdown by the time of the day, using an horizon-

tal stacked-bar chart. For the speciﬁc user we can

observe a 51-49 ratio, roughly, in terms of amount

of messages carrying sensitive information disclosed.

Peak days are Tuesday through Friday, with peak

hours occurring at the start and end of the day.

6.3 Geo-area Disclosure Analysis

The Geo-area Service collects and assess all messa-

ges within a deﬁned geographic area. The geo-area

is deﬁned as a “box” of four points indicating the

North-East, North-West, South-West, and South-East

corners. Each box’s corner is a pair latitude and longi-

tude. Tweets are analyzed by sub-area, where a sub-

area can be a country, a state, a county, or a neig-

hborhood. Fig. 4 illustrates two examples of the geo-

graphical area analysis, on the world (Fig. 4a), and on

New York City, USA (Fig. 4b).

In the application page, the slider at the top allows

to browse through the hours of the day, so to see the

amount of sensitive information disclosure varies in

countries, on the left, or neighborhoods, on the right

during the day. The lightest the blue color in an area

in the ﬁgure, the higher the percentage of messages

classiﬁed as carrying sensitive information (for such

area). During our experiments, we have observed that

sometimes there are areas for which (almost) 100%

of the messages are disclosing sensitive information.

This, while possible, is more likely caused by the an

unrepresentative sample of messages, which is due to

the limitations of the free Twitter API. The latter, in

fact, only returns a small percentage of all the tweets

posted in a selected area which, by chance, all happen

to be labeled as private by our classiﬁcation model.

6.4 Topic Analysis of Privacy Disclosure

With the Topic Analysis service we want to un-

derstand which topics a person is more likely to

share sensitive information on. The application pulls

the user timeline (currently limited to the last 1500

tweets) and tries to assess the topic for each mes-

sage. A major challenge with Twitter messages is that

they contain little amount of text, thus making it very

hard for statistical methods to reliably classify the to-

pic. Furthermore, while methods such as LDA allow

to model the topics, identifying (labeling) the topics

themselves is an additional challenge. Therefore, in

the application we limit our attention to discover the

topics, leaving the user with the task of understanding

and labeling them. Fig. 5 illustrate a use case of this

analysis.

In the ﬁgure (top-right) we can see a number of

topics, one per column, where the most representative

(key)words are listed within each topic. An observer

can (arbitrarily) identify the ﬁrst topic as about “fake

news”, the second about the “presidential motto”, the

third about jobs, and so on. At the top of each key-

word list we leave an editable input ﬁeld, initially po-

pulated with the word “Topic-#” (with # a number), so

that the user can provide a label for such topic, if s/he

wishes. For future work, it is our intention to investi-

gate this topic labeling challenge, possibly in the same

fashion we crowd-sourced the private vs non-private

annotation of tweets, so to develop a AI approach to

topic label identiﬁcation.

7 CONCLUSION

We have presented an approach to automatically de-

tect sensitive information disclosure in social posts.

The approach relies on supervised machine learning

algorithms to assess in real-time whether a text mes-

sage carries sensitive information so that the user can

be alerted before the message is shared, therefore pro-

tecting the user from sharing regrets episodes. In ad-

dition to individual message assessment, the appro-

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

272

Figure 3: User disclosure behavior evaluation.

ach can also mine through a user timeline or a geo-

graphical area to summarize the online behavior with

respect to privacy leaks, where summary information

shows the online habits of the user(s), detailing when

and where sensitive information is mostly disclosed.

Finally, the approach allows the user to analyze the

topics to which her social posts belong to, so to under-

stand what kind of sensitive information is disclosed.

As part of out contributions, we have tested multiple

machine learning models, we have provided a semi-

automatic procedure to build the training set required

by the supervised learning model, and we have pu-

blished the set of privacy annotated data.

For the future work, we would like to tackle the

following problems. First, we want to explore met-

hods to improve the privacy classiﬁcation for very

Detecting and Analyzing Privacy Leaks in Tweets

273

(a) World.

(b) New York City.

Figure 4: Snapshots for the world and the New York City area.

Figure 5: Topic disclosure analysis for an individual user.

short text: this is a challenging problem because the

less the text, the less reliable the supervised model is.

Second, we want to automate the identiﬁcation and la-

beling of the topics of privacy disclosure, so to further

reﬁne the support we can provide to the users. Finally,

we want to build an open platform for the continu-

ous reﬁnement of the privacy annotation mechanism,

were to publish social posts and collect people’s anno-

tations, to create an automatic self-improving system

for privacy classiﬁcation.

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

274

REFERENCES

Acquisti, A. and Gross, R. (2006). Imagined communi-

ties: Awareness, information sharing, and privacy on

the facebook. In Proceedings of the 6th Internatio-

nal Conference on Privacy Enhancing Technologies,

PET’06, pages 36–58, Berlin, Heidelberg. Springer-

Verlag.

Blei, D. M. (2012). Probabilistic topic models. Commun.

ACM, 55(4):77–84.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning Re-

search, 3:993–1022.

Cappellari, P., Chun, S. A., and Perelman, M. (2017). A

tool for automatic assessment and awareness of pri-

vacy disclosure. In Proceedings of the 18th Annual

International Conference on Digital Government Re-

search, DG.O 2017, Staten Island, NY, USA, June 7-9,

2017, pages 586–587.

Cristofaro, E. D., Soriente, C., Tsudik, G., and Williams, A.

(2012). Tweeting with hummingbird: Privacy in large-

scale micro-blogging osns. IEEE Data Eng. Bull.,

35(4):93–100.

Islam, A. C., Walsh, J., and Greenstadt, R. (2014). Pri-

vacy detective: Detecting private information and col-

lective privacy behavior in a large social network. In

Proceedings of the 13th Workshop on Privacy in the

Electronic Society, WPES 2014, Scottsdale, AZ, USA,

November 3, 2014, pages 35–46.

Kongsg

ard, K. W., Nordbotten, N. A., Mancini, F., and En-

gelstad, P. E. (2016). Data loss prevention based on

text classiﬁcation in controlled environments. In In-

formation Systems Security - 12th International Con-

ference, ICISS 2016, Jaipur, India, December 16-20,

2016, Proceedings, pages 131–150.

Krishnamurthy, B. and Wills, C. (2009). Privacy diffusion

on the web: A longitudinal perspective. In Procee-

dings of the 18th International Conference on World

Wide Web, WWW ’09, pages 541–550, New York,

NY, USA. ACM.

Liu, K. and Terzi, E. (2010). A framework for computing

the privacy scores of users in online social networks.

ACM Trans. Knowl. Discov. Data, 5(1):6:1–6:30.

Malandrino, D., Petta, A., Scarano, V., Serra, L., Spinelli,

R., and Krishnamurthy, B. (2013). Privacy awareness

about information leakage: Who knows what about

me? In Proceedings of the 12th ACM Workshop on

Workshop on Privacy in the Electronic Society, WPES

’13, pages 279–284, New York, NY, USA. ACM.

Malandrino, D. and Scarano, V. (2013). Privacy leakage on

the web: Diffusion and countermeasures. Computer

Networks, 57(14):2833 – 2855.

Mao, H., Shuai, X., and Kapadia, A. (2011). Loose tweets:

an analysis of privacy leaks on twitter. In Proceedings

of the 10th annual ACM workshop on Privacy in the

electronic society, WPES 2011, Chicago, IL, USA, Oc-

tober 17, 2011, pages 1–12.

Richthammer, C., Netter, M., Riesner, M., S

anger, J., and

Pernul, G. (2014). Taxonomy of social network data

types. EURASIP J. Information Security, 2014:11.

Sapuppo, A. (2012). Privacy analysis in mobile social net-

works: the inﬂuential factors for disclosure of perso-

nal data. IJWMC, 5(4):315–326.

Sleeper, M., Cranshaw, J., Kelley, P. G., Ur, B., Acquisti,

A., Cranor, L. F., and Sadeh, N. M. (2013). ”i read

my twitter the next morning and was astonished”: a

conversational perspective on twitter regrets. In 2013

ACM SIGCHI Conference on Human Factors in Com-

puting Systems, CHI ’13, Paris, France, April 27 -

May 2, 2013, pages 3277–3286.

Stein, B. and zu Eissen, S. M. (2007). Topic identiﬁcation.

Journal of Knstliche Intelligenz, 21(3):16–22.

Wang, Y., Norcie, G., Komanduri, S., Acquisti, A., Leon,

P. G., and Cranor, L. F. (2011). ”i regretted the minute

I pressed share”: a qualitative study of regrets on face-

book. In Symposium On Usable Privacy and Security,

SOUPS ’11, Pittsburgh, PA, USA - July 20 - 22, 2011,

page 10.

Detecting and Analyzing Privacy Leaks in Tweets

275