A Step to Achieve Personalized Human Centric Privacy Policy Summary

Ivan Simon

, Sherif Haggag

and Hussein Haggag

Faculty of Science Engineering and Technology, The University of Adelaide, Adelaide, South Australia, Australia

Department of Computing Science, Ume

a University, Sweden

Keywords:

Abstract:

Online users continuously come across privacy policies for the service they use. Due to the complexity and

verbosity of policies, majority of the users skip the tedious task of reading the policy and accept it. Without

reading and evaluating the document users risk giving up all kinds of rights to their personal data and for the

most part, are unaware of the data sharing and handling process. Efforts have been made to address the com-

plex and lengthy structure of privacy policies by creating a standardized machine-readable format of privacy

policies for the web browsers to process it automatically, a repository of crowdsourced summarized versions of

some privacy policies, or by using natural language processing to summarize the policies. PirvacyInterpreter

is one unique tool that acknowledges human-centric factors while summarising the policy. Thus, it generates a

personalised summary of the privacy policy for the user providing relevant information to appease their privacy

concerns. This paper presents the conceptualization of PrivacyInterpreter and implements a proof-of-concept

model using conﬁgured RoBERTa(base) model to classify a privacy policy and produce a summary based on

privacy aspects that reﬂect users’ privacy concerns.

1 INTRODUCTION

A privacy policy is a document that deﬁnes the prac-

tices of a company’s data collection, use, and sharing

practices. It legally binds a user who agrees to it. In

the Internet-driven world, it is necessary to be wary

of one’s data as it could be used to achieve economic,

political, and even nefarious gains. Privacy policies

are notoriously unclear, complex, and wordy (Sig-

mund, 2021) (Steinfeld, 2016) (McDonald and Cra-

nor, 2008). A recent data breach at OPTUS exposed

sensitive data of 10 million customers, including past

and present users. The customers were unaware that

their data would be retained for a duration of 6 years.

Such information is usually detailed in a privacy pol-

icy but with a fast-paced life, such document is over-

looked by the user. According to the Australian Com-

munity Attitudes to Privacy Survey 2020, only 20%

of Australians read and understand the privacy poli-

cies online (Lonergan Research, 2020).

The length and time to read the policies as well as

the presence of obscure words have increased in the

last 10 years, especially with the changes made by the

Regulators like GDPR and CCPA (Amos et al., 2021).

Tools have been developed to summarize privacy poli-

cies using machine learning (ML) algorithms. Privee

(Zimmeck and Bellovin, 2014), PrivacyGuide (Tesfay

et al., 2018), and PrivacyCheck (Zaeem et al., 2018)

are some examples of the available summarization

tools. They all have unique features. Privee calculates

an overall grade to a policy helping users to under-

stand quickly (Zimmeck and Bellovin, 2014), while

PrivacyGuide classiﬁes the privacy policies based on

GDPR (Tesfay et al., 2018). Polisis is an automated

framework for privacy policy analysis using neural

network classiﬁers (Harkous et al., 2018).

It is found that the current summarization tools

have limited performance (Bracamonte et al., 2019),

lack personalization (Tesfay et al., 2018), and fail

to consider the diversity in privacy concerns among

users (Bergstr

om, 2015a) (

Skrinjari

c et al., 2017a).

These drawbacks motivate the development of a new

form of policy summarization tool that acknowledges

human-centric factors. The main contributions of the

paper are:

1. Study of the diversity in online users privacy con-

cerns.

2. Analysis of the current text classiﬁcation models

used to summarize privacy policy of websites.

3. To design a personalized privacy policy summa-

rization tool and to develop its proof-of-concept

model.

Simon, I., Haggag, S. and Haggag, H.

A Step to Achieve Personalized Human Centric Privacy Policy Summary.

DOI: 10.5220/0011842600003464

In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023), pages 381-395

ISBN: 978-989-758-647-7; ISSN: 2184-4895

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

381

The report is structured as follows: Section 3 contains

a review of the literature exploring the diverse privacy

concerns and analysis of the current text classiﬁca-

tion strategies used in summarization tools. Section

4 details the implementation of the proof-of-concept

model. In Section 5, the results are presented and out-

put analysis is given. Lastly, Section 6 concludes the

paper, and future work is proposed in Section 7.

2 AIM

The current paper aims to seed the development of

a human centric privacy policy summarization tool,

i.e., PrivacyInterpreter. The output of this tool would

address the concerns of the user and help them make

an informed decision about their data.

3 LITERATURE REVIEW

3.1 Diversity in Online Privacy

Concerns

People are diverse and have different aspirations and

expectations when choosing a digital service. To pre-

dict user’s privacy concerns researchers have looked

into various factors like demography (Bergstr

om,

2015b) (Lee et al., 2019) (Yao et al., 2007), personal-

ity traits (Taylor et al., 2015) (Wortman et al., 2012)

(

Skrinjari

c et al., 2017b) (Haggag et al., 2022), the

inﬂuence of the environment (Lee et al., 2016) and

culture. Therefore, this study investigated the role

of demographic characters and personality traits in

users’ privacy concerns.

Age: Age positively inﬂuenced online privacy

concerns, people became more concerned about

their privacy as age increased (Zhang et al., 2020).

This may be attributed to increased awareness of

Information Security as people grew old (McCormac

et al., 2017). However, according to a recent study

on the awareness of Single Sign-On (SSO) Security,

Pratamal et. al. (Pratama et al., 2022) found that older

people had less SSO security awareness as compared

to young users. More precise statistics were given

by Lee et. al. (Lee et al., 2016), that is, information

privacy concerns were seen to peak in the age group

of 20-30-year-olds and after the age of 40 the level

of concern decreased. Privacy concerns about the use

of social media decreased with age while concerns

regarding the use of online debit cards for online

transactions increased with age (Yao et al., 2007).

Age was also closely related to the level of income

and internet experience. In the age of the internet

wherein the young generation are exposed to the

internet very early, were found to have more internet

experience than the older generation (Zukowski and

Brown, 2007). Being more active online, they were

highly likely to be victims of cybercrime (N

asi et al.,

2015).

Mohammad and Schreuders (Schreuders et al., 2019)

researched cybercrime victimization using police

data and found that unwanted contact and online

harassment targeted the younger generation. The

possibility of being a target of such crime decreased

with increasing age (30% for those aged 25-34 and

17% for those aged 35-44). The study showed that

the highest targeted age group was 24-34 years. It

was also seen that the victims of sexual and indecent

image reception mostly fell in the age group of

5-14 years. According to Karagiannopoulos et. al.

(Karagiannopoulos et al., 2021), people aged 60

years and above feared sharing personal information

such as banking details and email addresses. Some

of them were not well-versed with good cyber

hygiene practices too. Young people who may be

well educated about cyber hygiene were still not safe

because they engaged in various online routines that

increased their probability of encountering cyber

fraud (this was true for educated people as well)

(Whitty, 2019). For cyber scams, older adults were

more prone to investment-related scams while young

people were more prone to consumer-related scams

(Whitty, 2020). Many online activities invited com-

puter viruses to infect the system. The probability

of obtaining a computer virus decreased with an

increase in age (Ngo and Paternoster, 2011). This

was true for cybercrime regarding online defamation

too.

Gender: Research examining the role of gender in

cyber security awareness suggested that females were

more vulnerable than males to cyber threats. This

suggestion was made either due to the low-security

efﬁcacy found in females (Anwar et al., 2017) or due

to the underrepresentation of females in IT-related

majors or work (Pratama and Firmansyah, 2021).

Very small or no signiﬁcant statistical difference was

reported in the SSO security awareness (Pratama

et al., 2022), privacy concerns (N

asi et al., 2015)

and information security scores (McCormac et al.,

2017) of males and females. Interestingly, it was

observed that women have high information privacy

concerns than men until 30 years of age and men

after the age of 40 years showed more concern

for their privacy than women (Lee et al., 2016).

Women were more susceptible to phishing emails as

compared to men (McCormac et al., 2017) (Halevi

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

382

et al., 2013) but it was also found that both men

and women were equally skillful in identifying

suspicious links in an email (Goel et al., 2017).

Therefore, as a precautionary step, it is important

to warn women more about phishing emails than men.

Cybercrimes related to unwanted contact, harass-

ment, and obscene image reception victimized more

females while fraud and theft targeted more males

(Schreuders et al., 2019). Defamation and threat

were among the most frequent cybercrimes (N

asi

et al., 2015) to which males were commonly targeted.

When it came to using credit/ debit cards online,

women were more concerned than men about their

online use (Bergstr

om, 2015a). Men had a higher ten-

dency to fall for investment scams than women and

consumer scams victimized women more than men

(Whitty, 2020). Whitty’s (Whitty, 2020) research re-

vealed that men and educated people were just as

likely to be preyed on by cyber scams as women and

less educated people. Therefore, it is necessary to

educate and caution men and women equally about

cyber scams. Using the ordered probit model check,

Skrinjaric et. al. (

Skrinjari

c et al., 2017a) found that

students were less concerned about their online pri-

vacy as compared to self-employed people. As peo-

ple became more educated, their level of privacy con-

cerns decreased. Education level and income posi-

tively inﬂuenced information privacy concerns (Lee

et al., 2016).

3.1.1 Personality Traits and Privacy

Two of the Big Five personality traits were found

to be closely related to online privacy concerns, i.e.,

”extraversion” and ”neuroticism” (

Skrinjari

c et al.,

2017a). The extrovert character was inversely related

to online privacy concerns. A more extravert person

was found to be less concerned. Openness was an-

other driving trait that led people to publish informa-

tion on social media sites and was correlated to less

stringent privacy settings. Such activities suggested

that the user may be less aware of the risks related to

information leaks or the user may be blinded by the

advantages of sharing the information on social me-

dia (Halevi et al., 2013). Therefore, openness can be

related to the disclosure of extra information. Agree-

ableness, consciousness and neuroticism personality

traits related to users who gave importance to privacy

risks (Tang et al., 2020). A person who was moodier

or had a high tendency of being vexed or was driven

more by negative experiences was found to be more

concerned about his/her online privacy (Tang et al.,

2020). A high sense of SSO was found in people who

were emotionally stable and conscious (Pratama et al.,

2022). A high sense of SSO was found in people who

were emotionally stable and conscious (Pratama et al.,

2022). More agreeable people were more likely to

have less awareness about security features like SSO

but if they were to be more concerned about their

privacy, then their behaviour was opposite (Pratama

et al., 2022). People who had high conﬁdence in their

ability to cope-up with any situation had fewer online

privacy concerns (Yao et al., 2007). More conscious

and less risk-taking individuals had high information

security awareness. Bergstrom et. al. (Bergstr

om,

2015a), showed that trust in other people was an im-

portant determinant in understanding people’s worries

regarding unethical use of personal information while

using emails.

Concerning cybercrime, the impulsive nature (Ne-

upane et al., 2016) and emotional instability led to

phishing victims and romance scams (Whitty, 2020).

If self-control was described as the ability of an indi-

vidual to resist in the face of temptation, then low self-

control increased the chances of experiencing online

harassment (Ngo and Paternoster, 2011). In another

study it was seen that vulnerability to phishing attacks

was inversely related to extraversion, and positively

correlated to neuroticism and openness, especially

for females (Halevi et al., 2013). High urgency and

sensation-seeking behaviour (subcategories of impul-

sivity behaviour), correlated to engagement in routine

activities that increased the chances of victimization

by cyber frauds (Whitty, 2019). Counter-intuitively,

Whitty (Whitty, 2019) also found that people who had

perseverance, were educated and believed that events

happened because of their actions and not by fate,

were involved in online routines that made them more

vulnerable to cyber fraud. Beyond Big-5 personal-

ity traits, variables like ”privacy awareness and com-

puter anxiety” positively drove online privacy con-

cerns (

Skrinjari

c et al., 2017a). Privacy concerns re-

garding the collection of data was driven by com-

pound traits ”need for cognition” and ”risk orienta-

tion” (Taylor et al., 2015).

3.1.2 Culture and Privacy Concerns

The cultural dimensions at the country level can be

expressed broadly by using individualism versus col-

lectivism. At a macroscopic level, collectivist culture

values togetherness, group or community while indi-

vidualistic culture gives importance to uniqueness and

independence. It was found that collectivistic indi-

viduals gave more importance to group privacy and

were more particular about the audience who could

view their information than individualistic people (Li,

2022). In individualistic cultures, users were found

to have a greater likelihood of sharing images and

A Step to Achieve Personalized Human Centric Privacy Policy Summary

383

videos on the public platform while collectivistic peo-

ple were more comfortable sharing in only speciﬁc

already involved groups like a workspace (Li, 2022).

Individualism was found to have more concern

about individual privacy concerns and aspired for

control strategies to protect personal privacy (Li,

2022). The authors showed that collectivistic individ-

uals prioritise privacy risks in the use of Social Net-

working Sites (SNS) more than individualistic indi-

viduals (Trepte et al., 2017). Pertaining to the care

free characteristic of individualistic culture, privacy

implications of information shared by themselves or

others could be suggested (Li, 2022). For example,

when a user is sharing co-owned data, will all the

co-owner be notiﬁed?// A study that surveyed par-

ticipants from 38 countries, found that a lower level

of Individualism, i.e., collectivism corresponded to a

higher level of concern for errors in database storage

(Bellman et al., 2004). But data collection by the

government or users’ employers is more supported

by collectivistic cultures than individualistic cultures.

The latter tend to put more trust in entities they al-

ready have a link with or have paid for (Li, 2022).

Collectivistic users are more comfortable than indi-

vidualistic individuals with data collection practices

for automatic decision-making or customization (Li,

2022). While information on the involvement of a

third-party in data collection could retract the consent

of a user from collectivistic culture, the same infor-

mation could make the individualistic user feel more

comfortable about data collection (Li, 2022).

3.1.3 Attitudes of Australians Towards Data

Privacy

The current research mains to develop a human-

centric tool speciﬁcally for Australia, therefore it is

necessary to understand what online privacy means

to Australians. The Australian Community Attitudes

to Privacy Survey 2020 (Lonergan Research, 2020)

helped to understand the current dynamics of Aus-

tralia’s privacy concerns. Some of the key ﬁndings

used in the research are detailed below.

Australians aged 18-24 years did not think much

about data privacy (54%) when selecting an online

service while those over the age of 65 years con-

sidered data privacy as one of the top 3 priorities

(80%). Over 49% of the Australians aged 50 years

and over were faced with unwanted communication

for marketing purposes. The trend reduced by a small

margin for individuals in the age group 35-49 years

(42%) and reduced further to 35% for those aged 18-

34 years. People aged 18-34 years actively reported

about unnecessary data collection for a service. From

a gender point of view, males (35%) reported more

about unnecessary hoarding of data as compared to

females (27%).

Identity theft, fraud and data breaches were con-

sidered the biggest risk by all the age groups with a

slight reduction in percentage across younger groups.

People even considered ”sending information over-

seas” as a major risk. This risk prevailed in all age

groups with 49% for 50 years and over, 35% for 35-49

years old and 34% for those aged 18-34 years. Digi-

tal practices like data sharing, location tracking, pro-

ﬁling, and use of AI also caused a high level of dis-

comfort to the people. With location tracking, keep-

ing databases of online activities and targeted adver-

tisements in particular, older people (65 years and

above) were more uncomfortable (75%) when com-

pared to the younger generation (18-34 years old;

20%). But privacy concerns regarding social media

decreased with an increase in age (overall percent-

age hovers around 14%). 66% of individuals were

found to be more reluctant to provide biometric in-

formation compared to health information and loca-

tion data. Surprisingly, Australians were compara-

tively comfortable providing other biometric informa-

tion like voice prints (30%), facial images (35%) and

ﬁngerprints (43%) through smart devices and for gov-

ernment purposes. This comfort was only limited to

a few organizations like the government and banking

sector. An equal proportion of comfort and discom-

fort was seen for digital practices by the government

like the use of facial recognition, video surveillance,

identiﬁcation of suspects, and use of biometrics.

Knowledge about privacy varied across age

groups. 29% of the young generation (18-34 years of

age) and 53% of early adopters were conﬁdent about

their knowledge. Slightly fewer older Australians,

i.e., aged 50 - 64+ years, rated their knowledge to

be excellent or very good. Among the people who

claimed to have excellent knowledge of data protec-

tion, 79% of them faced problems regarding the han-

dling of PI. When it came to access to personal data,

younger Australians were more dissatisﬁed compared

to those aged 18-34 years and 35-49 years. A greater

number of females and older Australians were un-

comfortable sharing their location data when com-

pared to males and those aged 18-49 years.

Australians considered the social media industry

the least trustworthy, followed by search engines and

apps. They had comparatively more trust in health

care services, federal government departments, ﬁ-

nancial institutes and telecommunication providers.

Compared to the younger generation, fewer people

aged 35-49 years and 50+ years used privacy protec-

tion like the use of VPNs or privacy-preserving search

engines. Females were more likely than males to ad-

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

384

just privacy settings on social networking websites

and to change location sharing settings. But a greater

number of males used VPNs, ad blockers and incog-

nito mode than females.

The use of AI without the knowledge of the user

was among the major privacy risk, especially for those

aged 50 years and above (28%). This concern was

slightly less for those aged 35-49 years (23%) and

even less for 18-34-year-olds (20%). Younger Aus-

tralians (31%) trusted the decision made by AI. This

trust gradually decreased with the increase of age. An

almost similar trend was seen in the use of AI for

different purposes. People were more uncomfortable

with the private sector than the public sector using AI.

Attitude towards children’s privacy: Australian par-

ents were very concerned about their children’s online

privacy and expected that the online service or web-

sites would take certain steps to protect the privacy

of their children like: veriﬁcation of age before col-

lecting data, default privacy setting to be set to high-

privacy mode and location tracking to be switched off

for children, minimum collection of data.

Parents also struggled to ﬁnd out ways to protect

the personal information of their children whilst us-

ing the service. They found the task of comparing

different privacy policies to select the best service or

website difﬁcult.

3.1.4 Cyber Threats and Privacy Policy

Common privacy-related cyber threats to Australia

are: identify theft, cyber fraud/scams (consumer or in-

vestment related), online harassment (unwanted con-

tact, illicit message or image reception), defamation

and phishing emails. One of the causes of the afore-

mentioned threats is the leak of data especially per-

sonal information (PI)(Lonergan Research, 2020). In

any online activity a certain amount of user data is

saved by the hosting organization. When PI is shared

with a service provider it is always at a risk of being

exposed to the public (Goel et al., 2017). If the data

is not securely stored in the database and this data is

breached, users’ data is exposed to the world. Bad

actors can use this publicly available data against the

user to achieve their nefarious ends. Reyns (Reyns,

2013) through his research showed that routine activ-

ities like online shopping, banking, watching online

shows or listening to music or news and online activi-

ties that involve sharing of PI with third parties (Reyns

and Henson, 2016), are positively related to identity

theft victimization. Thus, handling of users’ data by

organizations is an essential factor when dealing with

identity theft (Milne et al., 2004), online fraud and

scams.

The organizations must use appropriate security

controls and impede violation of users’ privacy. Ac-

cording to the 10 Generally Accepted Privacy Princi-

ples (GAPP), a privacy policy should contain the se-

curity measures adopted by the organization. This in-

formation would help users to weigh the risk involved

in an activity (Balapour et al., 2020) and avoid deal-

ings that could lead to cyber scams/ frauds.

Therefore, an online service provider must at the

least use security controls like encryption (on both in-

transit and stored data), network security, data breach

response plan, access control (Culnan, 2019) and de-

identiﬁcation of data (Chang et al., 2018). If the data

is outsourced for any purpose, it is necessary to im-

plement the same level of security by the associated

parties. Such measures could appease users’ online

concerns against activities like theft of data (or iden-

tity theft), online defamation and cyberbullying (data

could be a user’s possibly embarrassing facts) (Chang

et al., 2018).

3.2 Text Classiﬁcation Strategies

Traditional text classiﬁcation models like the Naive

Bayes, Random Forest, and Support Vector Machine,

emphasize the feature extraction process (Solovyeva

and Abdullah, 2022). Bag-of-Words or its variant

term frequency-inverse document frequency lacks the

ability to capture the semantics of words, creates

a sparse matrix, and is computationally expensive

(Verma et al., 2021). Though static word embeddings

like Word2Vec and GloVe capture the embedding af-

ter analyzing the surrounding words, it generates only

one embedding of the word irrespective of how the

word is used in a sentence and ignores the possibil-

ity of words having more than one meaning (Verma

et al., 2021). This leads to state-of-the-art models like

BERT (Bidirectional Encoder Representations from

Transformers) (Devlin et al., 2018) that use embed-

ding like WordPiece (Wu et al., 2016). WordPiece

and Byte-Pair Encoding learn the embeddings of sub-

words and capture the context in which a word is used.

The embedding of subwords solves the problem of

countering out-of-vocabulary (OOV) words by com-

bining the representation of the subwords to form the

OOV word. Some of the strategies used for summa-

rizing and classifying privacy policies are detailed be-

low:

1) PrivacyGuide (Tesfay et al., 2018)

It outputs the privacy categories, associated risk lev-

els, along with the corresponding snippet of the pol-

icy.

In the preprocessing step, the StringToWordVector ﬁl-

ter and Term frequency-inverse document frequency

A Step to Achieve Personalized Human Centric Privacy Policy Summary

385

(TF-IDF) was used. In addition to removing stop

words, punctuation, tokenization, and stemming, the

authors used a set of keywords to ﬁlter policy frag-

ments to reduce the load on the classiﬁers. These key-

words for privacy categories were discovered during

the manual labelling of the training data set.

They used ML APIs from Waikato Environment

for Knowledge Analysis (WEKA) and Java as the pro-

gramming language. The prediction takes place in

two parts, initially, the privacy aspect is predicted fol-

lowed by the prediction of risk levels. On compar-

ing the performance of PrivacyGuide using four dif-

ferent classiﬁers (namely, Random Forest, Decision

Tree, Na

ıve Bayes, and Support Vector Machine.), the

one with the Na

ıve Bayes engine performed the best.

The authors also observed that the time for prediction

decreased with every policy check but the time to col-

lect and pre-process the policy was constant.

2) PrivacyCheck (Zaeem et al., 2018)

PrivacyCheck is a browser extension by nature that

preprocesses the privacy policy captured from the in-

put URL. The preprocessed text is sent to a data min-

ing server to predict the risk level corresponding to

10 privacy factors (Zaeem et al., 2018). The client

side was programmed using HTML and JavaScript (p.

53:5). A list of keywords was generated manually by

analyzing 400 privacy policies (Zaeem et al., 2018).

11 data mining models were trained, ten for each pri-

vacy factor, and the eleventh to validate whether the

URL corresponds to a privacy policy (Zaeem et al.,

2018). The training data set was prepared by man-

ually annotating 400 privacy policies. The Google

Prediction API was used; therefore, the authors were

unaware of the model used, as Google keeps the clas-

siﬁcation model conﬁdential.

3) Privee (Zimmeck and Bellovin, 2014)

Privee is a browser extension written using JavaScript.

The pre-processing stage is unique as it saves the po-

sition of the punctuations after removal and uses bi-

grams for better classiﬁcation. Feature selection is

conducted using regular expressions and used term

frequency-inverse document frequency (TF-IDF).

The text prediction takes place in three steps. Ini-

tially, the text is subjected to regular expression of bi-

gram. If a bigram matches, the text is classiﬁed us-

ing a rule-based classiﬁer. Otherwise, in step two,

a more general regular expression is used to extract

the feature, and then stemming is done before input

into the ML classiﬁer. If no feature was extracted, it

is concluded that the speciﬁc feature is not allowed

or not present. For the design of the ML classiﬁer,

the WEKA library is used. After testing the various

algorithms in WEKA, the na

ıve bayes algorithm in

the multinomial version was selected. The multi-label

classiﬁer is divided into multiple instances with one

classiﬁer per category.

4) Polisis (Harkous et al., 2018)

Polisis annotates privacy policies into 10 high-level

and 122 ﬁne-grained classes using hierarchical neural

network classiﬁers. The design of Polisis utilizes an

application layer that acts as a front end for the re-

ception and resolution of a user query. A data link

layer scrapes the privacy policy and pre-processes it

before feeding that queried privacy policy to the ML

layer. The authors conﬁrmed that when a webpage

contains dynamic content, this dynamic content is al-

ready loaded into the webpage with the rest of the

static content. Therefore, scraping a webpage after

it is fully loaded gives the entire content of the web-

page. Special steps were taken to prevent noisy anno-

tation due to list aggregation. The acquired content

is initially segmented according to HTML div-tags

and p-tags, and then again subdivided using an unsu-

pervised machine learning algorithm using domain-

speciﬁc word embeddings.

The creation of domain-speciﬁc word embedding

is considered the ﬁrst stage of the machine learning

layer. In the second stage, this word embedding is

used to train a neural network. To train word embed-

ding, fastText (a model that considers the composi-

tion and derivation of words) was used on a corpus of

complied by crawling 130K privacy policy of appli-

cation from Google Play Store. Polisis uses Convolu-

tion Neural Networks for multi-label classiﬁcation at

two levels. One level is to classify high-level privacy

categories and the other is to classify values for each

privacy attribute. The OPP-115 data set was used to

train the classiﬁers. The categories and attributes cor-

respond to the ones used in OPP-115.

5) BERT, RoBERTa, and PrivBERT

BERT (Bidirectional Encoder Representations for

Transformer) is a multi-layer bidirectional trans-

former encoder. It is pretrained on unlabelled data

using Masked LM (MLM) and Next Sentence Pre-

diction (NSP) (Devlin et al., 2018). In MLM task

the model must process the entire sequence before

predicting the masked token, giving a better under-

standing of the context and ﬂow of the language.

The NSP helps the model to learn the relation be-

tween two sentences. The pre-training is done on

a large dataset containing BooksCorpus and English

Wikipedia was used (Devlin et al., 2018). Fine-tuning

involves adding a layer on top of the core model.

Task-speciﬁc input and output are given to the model,

and the parameters are ﬁne-tuned on the downstream

task.

RoBERTa(Robustly optimized BERT) is a BERT

model with an optimized pre-training method. It uses

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

386

dynamic masking, i.e., changing the masking in a se-

quence every time it is fed to the model (dynamic

masking) (Liu et al., 2019). Byte-Pair encoding is

used instead of WordPiece, the NSP task is removed,

training is done over a large dataset for bigger batches

(maximum steps= 500k with a batch size of 8K) and

the input sequence length is longer than the original

BERT (Liu et al., 2019).

PrivBERT is developed by using the pretrained

RoBERTa(base) and retraining the model on privacy

policy documents. But the vocabulary developed by

the byte-level tokenization of the RoBERTa is re-

tained, and out-of-vocabulary words were later ﬁne

tuned to incorporate the privacy jargon (Srinath et al.,

2020). For ﬁne-tuning the model, the OPP-115 anno-

tated corpus was used. The performance evaluation of

PrivBERT for the classiﬁcation of data practices in a

RoBERTa (Srinath et al., 2020).

3.3 Training Dataset

1) OPP-115 (Wilson et al., 2016).

The OPP-115 corpus contains 115 privacy policies

with annotations at two levels. It contains 23K

annotated data practices First, segments of paragraph

size are annotated corresponding to 10 privacy

categories (high level) followed by annotations of

attributes: value pair (example: information type:

email address) for each privacy category. In total,

there are 10 high-level categories and 20 attributes

with 138 different values.

2) PrivaSeer (Srinath et al., 2020).

It is a corpus of 1005380 privacy policies. The

authors used Scrapy to crawl websites and Langid

(python package) to ﬁlter the documents into 97

languages (p. 6831). Boilerplate another python

package was used to remove the extra content of the

web page like the header, footer, and navigation tabs.

To classify English privacy policy from the scraped

3.2 million URLs, four machine learning models

were trained, that is, three random forest models and

ﬁne-tuned a pretrained transformer-based language

model (RoBERTa) (p. 6832). The size of the database

is approximately 64GB when uncompressed.

3) CC-NEWS (Nagel, 2016).

CC-NEWS is a dataset from the Common Crawl. It

crawls news websites around the world every day

using StromCrawler (https://stormcrawler.net/). The

data set contains news articles and is available on

AWS S3. The raw crawled data are presented in

WARC format (Web ARChive). The data set also

stores metadata (WAT format) and text data (WET

format) of the crawled sites. The size of the complete

data set is in petabytes and contains data collected

over 12 years.

Figure 1: Tool Design.

4 METHODOLOGY

4.1 Privacy Categories

Based on the ﬁndings, detailed in the literature review,

the following privacy categories have been created To

add human-centric aspects to a privacy summariza-

tion tool:

Category 0: Australians

1. What Personal Information (PI) is being col-

lected? How it is collected, stored and protected?

2. Whether or not PI is shared with overseas organi-

zations? Which country does the sharing organi-

zation belong to?

3. Reason for collecting the PI.

4. Possibility of user proﬁling for targeted advertise-

ment or business analysis purposes (it covers the

data sharing practices).

5. Duration of holding the PI.

6. How will a breach of data be handled?

7. How to access given personal data?

8. How to complain about a privacy breach? Is there

any practice that is exempted from The Privacy

Act?

9. How to deal with an organization without giving

away PI, i.e., in an anonymous manner?

Similarly, additional rights that Australians seek are:

1. Right to ask a business to delete their PI

2. Right to ask for compensation in court for a

breach of privacy

3. Right to know about the use of PI in automated

decision making

A Step to Achieve Personalized Human Centric Privacy Policy Summary

387

Figure 2: Policy Summarization Logic.

4. Right to object to certain data practices but still be

able to use the services of the organization

Necessary warnings against:

1. identify theft and cyber fraud

2. storing user’s information overseas

3. use of AI by the private sectors

4. government sector sharing information with pri-

vate sector

5. insecure storage of user’s data

6. the services related to social media and insurance

company

7. third-party data collection if present

Category 1: Young

As privacy is not their major concern when selecting

an online service or website, it is important to keep the

summary succinct, highlighting their major concerns.

Necessary details to be mentioned are:

1. irrelevant data collection

2. the possibility of targeted advertisements

3. whether and how the user can access their (sup-

plied) personal information

4. speciﬁc cybercrimes to be notiﬁed are:

Unwanted contact and online harassment, recep-

tion of Illicit images, consumer-related scams,

possibility of downloading a virus.

Category 2: Old

Majority of the old people were found to have less ex-

posure to the internet and have more concerns com-

pared to the younger generation. The focus points

based on their concerns found in the literature are

1. the possibility of unwanted communication for

marketing purposes and targeted advertisement

2. unexpected information collection

3. location and online activity tracking

4. the collection of banking details

5. combining and sharing personal data among orga-

nizations.

6. disclosure of data supplied to another user

7. storing of biometric information in the device or

organization’s database

8. the use of AI in decision making along with what

part of their PI will be used in the AI algorithm

and whether a human is reviewing the decision

made by the AI.

9. Speciﬁc cybercrime to be notiﬁed is:

Investment-related scams

Category 3 and 4: Women and Men

Very faint distinct privacy concerns of men and

women are found in the literature. This might be

because a user’s online privacy concerns are driven

more by age and personality rather than their gender.

Women are found to be vulnerable to similar privacy-

related cyber threats as the young generation.

Necessary information for Women to be included are:

1. the collection of location details

2. the collection of banking details

3. the use of AI in the decision-making process

4. speciﬁc cybercrimes to be notiﬁed are:

Unwanted contact and online harassment, ob-

scene image reception, consumer-related scams

and phishing emails.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

388

Necessary information for Men to be included are:

1. irrelevant data collection

2. status of default privacy setting

3. speciﬁc cybercrimes to be notiﬁed are:

Online fraud and theft, defamation and threat and

investment-related scams

Category 5: Parents

Parameters that parents need to know regarding their

children’s privacy are:

1. location tracking by organizations

2. proﬁling (includes both collecting and inferring

sensitive information) and selling data to third

parties

3. targeting advertisements based on online proﬁling

4. unnecessary information about children being

asked when a parent wants to use a service

Category 6 : Personality Traits

1. Extraversion: A person with high extraversion

characteristics should be made aware of the con-

sequences of sharing too many details online as

they are less concerned about their privacy. They

should also be educated about security features

like SSO. A person with low extraversion should

be cautioned about the possibility of phishing at-

tacks.

2. Neuroticism: Warning about cyber scams and

phishing attacks should be given to people with

high neuroticism.

3. Openness: People who score high on this fea-

ture should be made aware of the risk of informa-

tion leak as they, like highly extroverted persons,

freely share information online. They should be

informed about the different ways to harden pri-

vacy settings. As openness is directly related to

phishing vulnerability, a person with high open-

ness should all the more be warned about phishing

attacks.

4. Impulsivity: Highly impulsive people should be

cautioned about phishing attacks, cyber scams and

cyber fraud.

5. Need for cognition: People who are highly cog-

nitive are more sensitive towards data collection

practices, hence they should be warned about the

same.

6. Risk orientation: Like the need for cognition, a

low-risk taker should also be warned about data

collection practices.

Category 7 : Miscellaneous

1. Native/Non-Native: People who are non-native

should be cautioned about online fraud and scams

more prominently than native people.

2. Education level: Students should be educated and

informed about the security features like Single-

Sign-On and Two-Factor Authentication.

The high-level design of the policy summarization

tool is shown in ﬁgure 1. To develop the proof of con-

cept a mock customization unit is used in which four

user types are present, each with 5 privacy aspects.

The modiﬁed tool consists of three main components:

1) Policy Acquirement Unit (PAU):

The PAU scrapes the content of the privacy policy

using Selenium and the Beautiful Soup library from

the URL given by the user. After manually analyzing

several online privacy policies, it was found that the

main content is present under ’p’ tags (including the

’li’ and ’a’ tags within the p tags). Therefore, the data

within the ’p’ tags and consequently ’li’ and ’a’ tags

were scraped. To deal with dynamic pages, the unit

waits till all the JavaScript code is loaded and then

scrapes the content. At the same time, another ap-

proach was used, in which all embedded URLs were

collected ﬁrst from the page and then crawled one by

one.

2) Text Preprocessing Unit (TPU):

It takes the scraped data and converts it into a dictio-

nary of lists. The data are checked for duplicate sen-

tences and then tokenized into separate words. The

list of tokenized words is checked and sentences with

less than 3 words are removed. This procedure re-

moves redundant sentences like the language option,

”Got to top ” and ”Contact Us”.

3)Machine Learning Unit (MLU) (ﬁgure 2):

It is the core unit that generates the summary of the

stages.

• The scraped privacy policy is classiﬁed with the

help of a text classiﬁcation model. The output of

the classiﬁcation model contains segments of pri-

vacy policy corresponding to different privacy as-

pects.

• Only the segments of the privacy aspects that suit

the user are selected and fed to another text sum-

marization model to generate readable snippets

of the information regarding the privacy aspects.

The resultant snippets are combined to form the

personalized summary of the privacy policy for

the user.

Text Classiﬁcation Model: For text classiﬁcation,

the paper uses the PrivBERT language model. It is

based on the RoBERTa base model (Liu et al., 2019)

A Step to Achieve Personalized Human Centric Privacy Policy Summary

389

that is pre-trained with a dynamic Mask Language

Modelling (MLM) task on a large data set of ap-

proximately 160 GB (Srinath et al., 2020). Like the

RoBERTa base, PrivBERT also accepts a maximum

of 512 tokens per input. To tokenize the training

data, Byte-Pair Encoding is used with a vocabulary

of size 50k. The HuggingFace transformer (Wolf

et al., 2019) library provides the tokenizer and the

basic PrivBERT model.

To train this model, the OPP-115 corpus (Wil-

son et al., 2016) was used. The ”consolidated” sub-

directory in the OPP-115 corpus contains CSV ﬁles

of the annotated privacy policies with redundancies

removed (since more than one annotator worked on

each policy). The corpus contains paragraph-sized

segments classiﬁed into 11 privacy categories. The

dataset is structured in Pandas dataframe format with

columns representing the text and privacy category.

The data set was split into a 3:1: 1 ratio for training,

testing, and validation.

Text Summarization Model: An extractive sum-

marization technique is used employing Term

Frequency- Inverse Document Frequency (TF-IDF)

vectorization and cosine similarity. The cosine sim-

ilarity gives the similarity between two sentences in

vector form by calculating the angle between the two

vectors. At ﬁrst, the TF-IDF weights for each word

in a sentence are calculated after excluding the stop

words. The TF-IDF weights for each sentence are

vectorized and are used to calculate the cosine sim-

ilarity between pairs of sentences.As cosine is calcu-

lated, two sentences with a smaller angle will have a

higher similarity score. The summary of the privacy

aspect is created by selecting the top ﬁve sentences

with the highest similarity score. Similarly, the sum-

mary of all the privacy aspects is generated and then

compiled together to form the output summary.

5 RESULTS AND ANALYSIS

5.1 Results

At ﬁrst, 600 samples from the training and validation

data set were taken to train and validate the model.

The model was trained with a batch size of 8 and a

learning rate of 2e-5 for 10 epochs. The loss and ac-

curacy were calculated over the validation set after the

completion of each epoch. An accuracy of 0.7117 was

seen for all 10 epochs with an average loss of 0.9121

over the validation set. The second time 2638 sam-

ples were used to train and 647 samples to validate

the model. The model was trained for 10 epochs but

with a batch size of 12 and a learning rate of 2.5e-5

(inspired by PrivBERT (Srinath et al., 2020)). An im-

proved performance was seen after the completion of

10 epochs. The model attained an accuracy of 0.7814

and a loss of 0.6652 over the validation set (ﬁgure 4).

Sample run: The URL of the Stack Overﬂow web-

site was provided and the user type was selected as

”young female”. The classiﬁcation model took 1.50

minutes, and the time taken to extract similar text

form the classiﬁed segments to form the summary was

negligible. The scraped data consisted of 7199 words,

and the output summary consisted of 908 words.

The pre-selected privacy aspects for ”young female”

were ”Policy Change”, ”International and Speciﬁc

Audiences”, ”Practice Not Covered”, and ”Do Not

Track”. Sample run of more websites are summarised

in ﬁgure3 The average time to classify a policy into

12 labels and to select 5 essential privacy aspects was

found to be less than 2 minutes. Appendix section

contains the model metrics after each training epoch

and the privacy policy summary output.

5.2 Output Analysis

The absence of a privacy policy corpus with mi-

crolevel annotations to cover the wide range of

privacy aspects used in the conceptualization of

PrivacyInterpreter restrained the current implemen-

tation to classify the policy into 12 privacy aspects.

These privacy aspects reﬂect the labels present in

the annotated OPP-115 corpus, using which the

classiﬁcation model (PrivBERT) was trained. To

demonstrate customization of the policy summary,

a dummy customization function is created. The

function takes a string input representing the age

and gender of the user and outputs a list of IDs

corresponding to the privacy aspects.

To calculate the execution time, an ipython-

autotime module is used. It calculates the execution

time for the code in each cell of the Google-Colab

notebook. It was observed that the time taken to

scrape a privacy policy and classify it into segments

was the limiting factor to the overall time taken to

generate the output summary. Other preprocesses and

even the extractive summarization algorithm took less

than a minute to execute. The time taken to scrape

a webpage varied according to the size and design

of the webpage. The output of the scraped data is

essential, as it acts as the building block of the output

policy summary. The website scraping code obtains

the HTML document efﬁciently, but more reﬁnement

is required to obtain data between different tags (such

as ’li’ and ’a’ tags) in a well-formatted way. Also,

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

390

when crawling the embedded links within a page,

many links were found to be redundant. The presence

of links such as ’back to top’ or links presenting the

same content in a different language increased the

size of the input data. Thus, the crawling process is

removed and the page is scraped only after all the

JavaScript code is loaded to ensure that the com-

plete content from the privacy policy page is captured.

Although state-of-the-art transformer models gen-

erate an abstractive summary, it is limited by the size

of the input data and the output summary. For ex-

ample, the Pegasus model (Zhang et al., 2020) gives

optimal results but is restricted to an input token size

of 512 and produces an output of 256 tokens. There-

fore, more steps would be required to pre-process and

split the classiﬁed segments obtained from the classi-

ﬁcation model into the acceptable input size of the Pe-

gasus model. The use of cosine similarity for summa-

rization overcomes this drawback. There is no restric-

tion on the number of sentences. The input sentences

are tokenized, vectorized, and the similarity score is

calculated. The sentences with the highest similarity

score are selected to form the output. Thus, the cosine

similarity algorithm for text summarization also gives

the ﬂexibility to adjust the number of sentences in the

output summary.

Figure 3: Tool Performance Analysis.

5.3 Comparison with Existing Policy

Summarization Tools

1. 1) PrivacyCheck (Zaeem et al., 2018)

This tool uses a classiﬁcation data mining model

to inform the users about the level of risk involved

in factors relating to the privacy and security of

the user. It checks for 10 privacy factors and for

each factor it asks some basic questions. Aspects

covered by the privacy factors are based on a

study of privacy protection guidelines provided

by the Organization for Economic Co-operation

and Development (OECD), Federal Trade Com-

mission (FTC), and a survey interview of 16

employees and graduates (from the Center for

Identity at UT-Austin) working in the ﬁeld of

security and privacy. It is logical to get advice

from the experts and to refer the governing rules

to design a tool but it is equally important to

embody the diversity of the users in the design.

For example, a website’s data collection policy

for children under 13 years might not interest a

college student or a young user might not consider

sharing an email address with a service provider

as a risk. As mentioned earlier the length of the

policy is one of the demotivating factors, it might

be superﬂuous to add such information. Consider

another example, Australians are very concerned

about overseas data storage practices. If a privacy

summary did not address such essential concerns,

it is natural that an Australian user would deem

the summary not useful and would not resort to

using the privacy summarization tool again.

2. Privee (Zimmeck and Bellovin, 2014)

It combines the use of a crowdsourced repository

of summarized privacy policy along with Rule and

ML classiﬁers. The limitations of this tool is two

folds. Firstly, if the summary is gathered from

the crowdsource repository, like ”ToS;DR”, it will

be subjective to the competence and discretion of

the person evaluating the policy for the reposi-

tory. Secondly, the policy summary created us-

ing the Rule and ML classiﬁers is based on six bi-

nary categories (evaluated using yes or no). Such

binary classiﬁcation fails to give details like type

of PI collected, duration of data retention or type

of data protected, i.e., stored or in-transit data.

Along with the summary, Privee also assigns an

overall grade to the policy based on the outcome

of the binary classiﬁers. The privacy categories

just skim through the surface of the privacy policy

and do not analyse the tricky bits. For example,

the purpose of collecting the PI, where is the data

stored, how would a user access his/her data, in-

formation about data control, and options to opt

out of a particular service after initial sign-up.

A Step to Achieve Personalized Human Centric Privacy Policy Summary

391

3. PrivacyGuide (Tesfay et al., 2018)

It uses ML and Natural Language Processing

(NLP). The categories for the classiﬁcation of the

Like PrivacyCheck, PrivacyGuard also uses a risk

scale to inform the user about the level of risk

involved. Based on the interpretation of the

EU GDPR, the tools uses 11 privacy categories.

These categories overlap with the privacy aspects

used in this study. However, certain concerns spe-

ciﬁc to Australians are not covered by the pri-

vacy categories. For example, the possibility of

targeted advertisement, the use of AI in decision

making, the sectors involved in data sharing, in-

formation overseas storage of data practice and

ways of dealing anonymously with the service

provider. In all the three privacy summarization

tools, the user is not warned about the innate cyber

security attacks or threats a service might possess.

6 CONCLUSIONS

The paper presents the conceptualization of a per-

sonalized privacy policy summarization tool. To

develop a proof-of-concept model PrivBERT, a

RoBERTa-based language model pre-trained on a

large privacy policy corpus, is further trained on

a downstream task of text classiﬁcation with the

help of the OPP-115 (Wilson et al., 2016) privacy

policy corpus. A dummy customization is used

to represent the varied privacy aspects of online

users. The classiﬁed segments of the privacy pol-

icy are condensed with the help of the cosine sim-

ilarity algorithm to generate the output summary.

Despite the limitations encountered due to the

scarcity of ﬁne-level annotated privacy policy cor-

pora and computational resources, the proof-of-

concept model generates a summary in approxi-

mately 2 min with a compression ratio of 0.108

approximately (where compression ratio is de-

ﬁned as the ratio of the number of words in the

output summary to the number of words in the

input data). An online user could effortlessly

read the generated summary which would ap-

pease their privacy concerns. The paper also

suggests that incorporating human-centric aspects

into design improves the productivity and utility

of technology. The tool acts as a basic introduc-

tory version of a personalized summarization tool

that could be reproduced, experimented with, and

modiﬁed to develop better human-centric tech-

nologies.

7 FUTURE WORK

From the current vantage point, a corpus of pri-

vacy policies annotated at the micro-level is re-

quired, i.e., incorporating more diverse privacy as-

pects. Such a data set would help the summariza-

tion tool cover a wider variety of privacy aspects

and produce a more personalized summary.

An efﬁcient web crawler that is able to scrape data

while maintaining the structure of the sentence or

the format in which the sentence is presented on

the website would improve the semantics of the

output summary. Providing additional features,

such as a presentation unit, to make the output

more appealing or summarizing the ’Terms and

Conditions’ could also attract more users and en-

hance their awareness of online privacy.

Finally, the presented tool could be used as

a starting point for the development of a per-

sonalized summarization tool incorporating more

complex machine learning models like Genera-

tive Pre-trained Transformer models (version 3)

and abstractive summarization using BART (De-

noising Autoencoder for pretraining sequence-to-

sequence) models.

REFERENCES

Amos, R., Acar, G., Lucherini, E., Kshirsagar, M.,

Narayanan, A., and Mayer, J. (2021). Privacy poli-

cies over time: Curation and analysis of a million-

document dataset. In Proceedings of the Web Con-

ference 2021, pages 2165–2176.

Anwar, M., He, W., Ash, I., Yuan, X., Li, L., and Xu,

L. (2017). Gender difference and employees’ cyber-

security behaviors. Computers in Human Behavior,

69:437–443.

Balapour, A., Nikkhah, H. R., and Sabherwal, R. (2020).

Mobile application security: Role of perceived pri-

vacy as the predictor of security perceptions. In-

ternational Journal of Information Management,

52:102063.

Bellman, S., Johnson, E. J., Kobrin, S. J., and Lohse, G. L.

(2004). International differences in information pri-

vacy concerns: A global survey of consumers. The

Information Society, 20(5):313–324.

Bergstr

om, A. (2015a). Online privacy concerns: A broad

approach to understanding the concerns of different

groups for different uses. Computers in Human Be-

havior, 53:419–426.

Bergstr

om, A. (2015b). Online privacy concerns: A broad

approach to understanding the concerns of different

groups for different uses. Computers in Human Be-

havior, 53:419–426.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

392

Bracamonte, V., Hidano, S., Tesfay, W. B., and Kiyomoto,

S. (2019). User study of the effectiveness of a pri-

vacy policy summarization tool. In International Con-

ference on Information Systems Security and Privacy,

pages 186–206. Springer.

Chang, Y., Wong, S. F., Libaque-Saenz, C. F., and Lee, H.

(2018). The role of privacy policy on consumers’ per-

ceived privacy. Government Information Quarterly,

35(3):445–459.

Culnan, M. J. (2019). Policy to avoid a privacy disaster.

Journal of the Association for Information Systems,

20(6):1.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Goel, S., Williams, K., and Dincelli, E. (2017). Got

phished? internet security and human vulnerability.

Journal of the Association for Information Systems,

18(1):2.

Haggag, O., Grundy, J., Abdelrazek, M., and Haggag, S.

(2022). A large scale analysis of mhealth app user

reviews. In Empir Software Eng 27, 196 (2022).

Halevi, T., Lewis, J., and Memon, N. (2013). A pilot study

of cyber security and privacy related behavior and per-

sonality traits. In Proceedings of the 22nd interna-

tional conference on world wide web, pages 737–744.

Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin, K. G.,

and Aberer, K. (2018). Polisis: Automated analysis

and presentation of privacy policies using deep learn-

ing. In 27th USENIX Security Symposium (USENIX

Security 18), pages 531–548.

Karagiannopoulos, V., Kirby, A., Ms, S. O.-M., and Sug-

iura, L. (2021). Cybercrime awareness and victimisa-

tion in individuals over 60 years: A portsmouth case

study. Computer Law & Security Review, 43:105615.

Lee, H., Wong, S. F., and Chang, Y. (2016). Conﬁrming the

effect of demographic characteristics on information

privacy concerns.

Lee, H., Wong, S. F., Oh, J., and Chang, Y. (2019). Infor-

mation privacy concerns and demographic character-

istics: Data from a korean media panel survey. Gov-

ernment Information Quarterly, 36(2):294–303.

Li, Y. (2022). Cross-cultural privacy differences. In Modern

Socio-Technical Perspectives on Privacy, pages 267–

292. Springer, Cham.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Lonergan Research (2020). Australian community attitudes

to privacy survey 2020.

McCormac, A., Zwaans, T., Parsons, K., Calic, D., Butavi-

cius, M., and Pattinson, M. (2017). Individual differ-

ences and information security awareness. Computers

in Human Behavior, 69:151–156.

McDonald, A. M. and Cranor, L. F. (2008). The cost of

reading privacy policies. Isjlp, 4:543.

Milne, G. R., Rohm, A. J., and Bahl, S. (2004). Consumers’

protection of online privacy and identity. Journal of

Consumer Affairs, 38(2):217–232.

Nagel, S. (2016). Cc-news.

asi, M., Oksanen, A., Keipi, T., and R

anen, P. (2015).

Cybercrime victimization among young people: a

multi-nation study. Journal of Scandinavian Studies in

Criminology and Crime Prevention, 16(2):203–210.

Neupane, A., Saxena, N., Maximo, J. O., and Kana, R.

(2016). Neural markers of cybersecurity: An fmri

study of phishing and malware warnings. IEEE

Transactions on information forensics and security,

11(9):1970–1983.

Ngo, F. T. and Paternoster, R. (2011). Cybercrime victim-

ization: An examination of individual and situational

level factors. International Journal of Cyber Crimi-

nology, 5(1):773.

Pratama, A. R. and Firmansyah, F. M. (2021). Until you

have something to lose! loss aversion and two-factor

authentication adoption. Applied Computing and In-

formatics, (ahead-of-print).

Pratama, A. R., Firmansyah, F. M., and Rahma, F. (2022).

Security awareness of single sign-on account in the

academic community: the roles of demographics, pri-

vacy concerns, and big-ﬁve personality. PeerJ Com-

puter Science, 8:e918.

Reyns, B. W. (2013). Online routines and identity theft vic-

timization: Further expanding routine activity theory

beyond direct-contact offenses. Journal of Research

in Crime and Delinquency, 50(2):216–238.

Reyns, B. W. and Henson, B. (2016). The thief with a

thousand faces and the victim with none: Identify-

ing determinants for online identity theft victimiza-

tion with routine activity theory. International jour-

nal of offender therapy and comparative criminology,

60(10):1119–1139.

Schreuders, C. et al. (2019). Understanding cybercrime

victimisation: modelling the local area variations in

routinely collected cybercrime police data using latent

class analysis. International Journal of Cyber Crimi-

nology, 13(2):493–510.

Sigmund, T. (2021). Attention paid to privacy policy state-

ments. Information, 12(4):144.

Skrinjari

c, B., Budak, J., and

Zokalj, M. (2017a). The effect

of personality traits on online privacy concern. Radni

materijali EIZ-a, (2):5–29.

Skrinjari

c, B., Budak, J., and

Zokalj, M. (2017b). The effect

of personality traits on online privacy concern. Radni

materijali EIZ-a, (2):5–29.

Solovyeva, E. B. and Abdullah, A. (2022). Comparison of

different machine learning approaches to text classi-

ﬁcation. In 2022 Conference of Russian Young Re-

searchers in Electrical and Electronic Engineering

(ElConRus), pages 1427–1430. IEEE.

Srinath, M., Wilson, S., and Giles, C. L. (2020). Privacy at

scale: Introducing the privaseer corpus of web privacy

policies. arXiv preprint arXiv:2004.11131.

Steinfeld, N. (2016). ”i agree to the terms and condi-

tions”:(how) do users read privacy policies online? an

A Step to Achieve Personalized Human Centric Privacy Policy Summary

393

eye-tracking experiment. Computers in human behav-

ior, 55:992–1000.

Tang, J., Akram, U., and Shi, W. (2020). Why people need

privacy? the role of privacy fatigue in app users’ in-

tention to disclose privacy: based on personality traits.

Journal of Enterprise Information Management.

Taylor, J. F., Ferguson, J., and Ellen, P. S. (2015). From trait

to state: Understanding privacy concerns. Journal of

Consumer Marketing.

Tesfay, W. B., Hofmann, P., Nakamura, T., Kiyomoto, S.,

and Serna, J. (2018). Privacyguide: towards an im-

plementation of the eu gdpr on internet privacy policy

evaluation. In Proceedings of the Fourth ACM Inter-

national Workshop on Security and Privacy Analytics,

pages 15–21.

Trepte, S., Reinecke, L., Ellison, N. B., Quiring, O., Yao,

M. Z., and Ziegele, M. (2017). A cross-cultural per-

spective on the privacy calculus. Social Media+ Soci-

ety, 3(1):2056305116688035.

Verma, V. K., Pandey, M., Jain, T., and Tiwari, P. K.

(2021). Dissecting word embeddings and language

models in natural language processing. Journal of

Discrete Mathematical Sciences and Cryptography,

24(5):1509–1515.

Whitty, M. T. (2019). Predicting susceptibility to cyber-

fraud victimhood. Journal of Financial Crime.

Whitty, M. T. (2020). Is there a scam for everyone? psy-

chologically proﬁling cyberscam victims. European

Journal on Criminal Policy and Research, 26(3):399–

409.

Wilson, S., Schaub, F., Dara, A. A., Liu, F., Cherivirala, S.,

Leon, P. G., Andersen, M. S., Zimmeck, S., Sathyen-

dra, K. M., Russell, N. C., et al. (2016). The creation

and analysis of a website privacy policy corpus. In

Proceedings of the 54th Annual Meeting of the Associ-

ation for Computational Linguistics (Volume 1: Long

Papers), pages 1330–1340.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,

Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,

M., et al. (2019). Huggingface’s transformers: State-

of-the-art natural language processing. arXiv preprint

arXiv:1910.03771.

Wortman, J., Lucas, R. E., and Donnellan, M. B. (2012).

Stability and change in the big ﬁve personality do-

mains: evidence from a longitudinal study of aus-

tralians. Psychology and aging, 27(4):867.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi,

M., Macherey, W., Krikun, M., Cao, Y., Gao, Q.,

Macherey, K., et al. (2016). Google’s neural ma-

chine translation system: Bridging the gap between

human and machine translation. arXiv preprint

arXiv:1609.08144.

Yao, M. Z., Rice, R. E., and Wallis, K. (2007). Predict-

ing user concerns about online privacy. Journal of the

American Society for Information Science and Tech-

nology, 58(5):710–722.

Zaeem, R. N., German, R. L., and Barber, K. S. (2018). Pri-

vacycheck: Automatic summarization of privacy poli-

cies using data mining. ACM Transactions on Internet

Technology (TOIT), 18(4):1–18.

Zhang, J., Zhao, Y., Saleh, M., and Liu, P. (2020). Pegasus:

Pre-training with extracted gap-sentences for abstrac-

tive summarization. In International Conference on

Machine Learning, pages 11328–11339. PMLR.

Zimmeck, S. and Bellovin, S. M. (2014). Privee: An archi-

tecture for automatically analyzing web privacy poli-

cies. In 23rd USENIX Security Symposium (USENIX

Security 14), pages 1–16.

Zukowski, T. and Brown, I. (2007). Examining the inﬂu-

ence of demographic factors on internet users’ infor-

mation privacy concerns. In Proceedings of the 2007

annual research conference of the South African in-

stitute of computer scientists and information technol-

ogists on IT research in developing countries, pages

197–204.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

394

APPENDIX

Figure 4: Model Metrics when trained with 2683 samples, a batch size of 12 and learning rate of 2e-5 for 10 epochs.

Figure 5: Output of GitHub privacy policy for an Old Female.

Figure 6: Output of Google privacy policy for a Young Male.

A Step to Achieve Personalized Human Centric Privacy Policy Summary

395