focuses on assessing each message individually: each
message can be classified in real-time before it is dis-
closed. Also, by collecting series of messages, analy-
tics can be built to provide insights on the day, time,
location, and content of privacy leaks, as well as on
the likelihood that a user (or group of users) will dis-
close sensitive information.
In (Liu and Terzi, 2010), authors propose a frame-
work to associate users with a privacy risk score. The
score is determined by analyzing the user’s messages,
and can be used to: alert the user when a posted mes-
sage exceed a set privacy risk score, or to let the user
know where s/he stands compared to the rest of the
community. The approach mostly focuses on devising
a mechanism to formalize a mathematical function to
calculate the user privacy risk score. Many factors are
considered in the mathematical function that calcula-
tes the user’s score. However, the approach does not
account for the societal factor, which is the first major
difference with respect to our approach. The second
major difference is the attention to the evaluation of
the user score after the privacy leak has occurred, si-
milarly to (Islam et al., 2014), rather than on the as-
sessment and prediction of the generic piece of text
the user is about to post, like in ours.
In (Mao et al., 2011), authors analyze information
leak associated with a number of fixed categories of
interest: vacation, drug, and health condition. The
first category is concerned with users disclosing plans
and/or locations of where they will (or not) be and
when. The second, is concerned with people posting
messages under the influence. The third, looks into
social posts disclosing medical conditions, personal
or not. Authors focus on Twitter posts, associating
tweets to the mentioned categories by relying on a set
of keywords representative of each category. The li-
mit of this approach, and the difference with respect
to ours, lies in the small number of categories of in-
formation disclosure authors look into, and in the ar-
bitrarily, fixed, and subjective, set of keywords that
are associated with each category. Our approach, on
the other hand, relies on crowd wisdom to know what
should be considered as sensitive information, rather
than on a fixed set of keywords. As a consequence,
we are looking at a larger spectrum of sensitive in-
formation disclosure possibilities, and to a more de-
mocratic way of classifying messages which is more
aligned with the privacy perception of the society.
In (Sleeper et al., 2013; Wang et al., 2011), aut-
hors study what messages users from Twitter and Fa-
cebook tend to regret. Authors surveyed a number of
users from both platforms to classify regretted posts
into categories, and analyze the effort and time users
spend in making amends for their posts, when possi-
ble. Both works analyze the aftermaths of information
disclosure, and focus on educating users on the use
of social media and on the implication of underesti-
mating information sharing. Differently, our work is
more focused in providing users with insights about
privacy leakage, as well as in supporting users with
actionable mechanisms that can prevent these (regret)
situations from happening altogether.
Hummingbird (Cristofaro et al., 2012) is a
Twitter-like service providing users with a high de-
gree of control over their privacy. The service offers
a fine grained privacy control, including: the ability
to define access control lists (ACL) for each tweet;
and the protection of server-side user behavior iden-
tification. With this approach, users are limited to a
specific service, and have to proactively address the
privacy issue by taking actions before using the ser-
vice itself, such as defining ACL. With our approach,
users are free to use any service, do not have to confi-
gure any tool, and can assess the amount of informa-
tion leakage in a message before sharing it.
The work in (Kongsg
˚
ard et al., 2016) focuses on
sensitive information leakage detection for corporate
documents. Authors employ machine learning techni-
ques to automatically classify a document as sensi-
tive vs non-sensitive. A curated training set of do-
cuments has to be provided to create the classification
model: an administrator has to craft, select, and anno-
tate which documents should be considered as private
vs not private. This solution can provide great degree
of customization, which is ideal of a corporation need.
On the other hand, it is impractical on a large scale,
which is on what our approach focuses, where an ad-
ministrator cannot possibly prepare the dataset(s) ma-
nually.
3 DATA SELECTION
Selecting the right data is a crucial task for both cre-
ating the training dataset for our classification model,
and for validating our approach. Due to the lack of
available privacy related datasets, we had to devise a
mechanism to collect and create our own privacy da-
taset. Our data source is Twitter, where we have been
careful to select only tweets that users have marked
as fully public (no restrictions). We have used Twitter
as our data source due to its public and openness po-
licy. We have collected millions tweets from the live
sample Twitter stream, over multiple periods of time
during Fall 2017 & Spring 2018. Part of this dataset,
after annotation, has been used to train the machine
learning model, and to run privacy leaks analysis on
historical data. Note that our application also use the
Detecting and Analyzing Privacy Leaks in Tweets
267