A PUBLISH/SUBSCRIBE MODEL FOR

PERSONAL DATA ON THE INTERNET

Mark Wallis, Frans Henskens and Michael Hannaford

Distributed Computing Research Group

School of Electrical Engineering and Computer Science, University of Newcastle, Callaghan, NSW, Australia

Keywords:

Publish/Subscribe, Distributed, Storage, Personal data, Web 2.0.

Abstract:

With the recent increase in web application reliance on user-generated content, issues such as data duplication,

data age and data ownership are becoming an increasing problem. It is now common to have multiple distinct

web applications storing duplicate copies of a user’s personal information in distinct storage formats and

locations. This paper proposes a change in paradigm that places the ownership of a user’s personal data back

into their own hands by moving the storage of that data away from web applications and onto private storage

nodes exposed by 3rd party providers. Web applications can then subscribe to various pieces of data under

electronic contracts that govern the data’s usage.

1 INTRODUCTION

In recent years, the design of Internet applications has

seen a move away from owner-generated content to

user-generated content (O’Reilly, 2005). The Web 1.0

era was dominated by websites in which content was

generated by the website owner. Next-generation ap-

plications, such as Twitter (Twitter Inc, 2009) (gen-

erally viewed as Web 2.0), in comparison, rely heav-

ily on user-generated content (Vickery and Wunsch-

Vincent, 2007). With the increase in application de-

pendence on user-generated content numerous issues

are becoming prevalent in relation to the maintenance

of this data. Problems of data duplication, data age

and data ownership occur because it is now common-

place to have multiple websites storing copies of a

user’s personal information in distinct storage formats

and storage locations. This is especially common for

personal information such as postal address and credit

card number.

This paper proposes a change in paradigm that

places the ownership and responsibility for the stor-

age of a user’s personal data back into their own

hands. The change involves moving the storage of

that information onto a private data storage service.

The storage service is hosted either locally by the user

(perhaps on a device connected to their home LAN) or

externally by a data storage provider. A data storage

provider makes available storage capacity and soft-

ware to allow a user to publish and distribute access

to their information. 3rd party applications then sub-

scribe to various pieces of data under electronic con-

tracts that deﬁne the rules governing the data’s us-

age. Locating a speciﬁc user’s data storage provider is

addressed using extensions to existing Internet proto-

cols. Parallels are drawn showing how such a design

aligns with current advancements in Cloud Comput-

ing (Boss et al., 2007) and Service-Orientated Archi-

tectures (Bell, 2008).

2 PROBLEM DESCRIPTION

As explained below, three main issues were identiﬁed

with classic Web 2.0 applications in relation to per-

sonal data storage: data freshness, data duplication

and data ownership.

When the same piece of information is stored by

multiple distinct web applications the chance of data

becoming stale greatly increases. Take for example

a user’s postal address. When the user moves house

they must currently access each web application to

which they have provided their postal address, and

manually update it. Before they can complete this

task they need to recall and locate all the relevant web

applications. Such a task is increasingly non-trivial

because of the growing number of websites that re-

quest personal information during registration. Cur-

183

Wallis M., Henskens F. and Hannaford M.

A PUBLISH/SUBSCRIBE MODEL FOR PERSONAL DATA ON THE INTERNET.

DOI: 10.5220/0002792401830186

In Proceedings of the 6th International Conference on Web Information Systems and Technology (WEBIST 2010), page

ISBN: 978-989-674-025-2

rently there is no system that allows a user to push

updates for such information to multiple web applica-

tions, so a user is required to remember, locate, access

and update for each application in turn.

In addition to the issue of data freshness there is

the issue of the amount of data storage wasted by the

duplication of information. While the duplication of

a small piece of information such as a user’s address

may appear trivial, when you considering the increas-

ing inclusion of photos and video the gravity of the

problem increases. The operators of web applications

currently spend considerable money and effort on data

storage technologies to store data that duplicates what

is stored by other 3rd party systems.

Arguably the most concerning issue is the prob-

lem of data ownership. When uploading user infor-

mation to a 3rd party web application a EULA (end

user licence agreement) is generally in-place to de-

scribe who owns the information once it has been

stored by the web application. It is common for the

web application owners themselves to take owner-

ship of the data, and it has been known for users to

ﬁght back against such restrictions (AFP, 2009). Pass-

ing the responsibility of storage and access of per-

sonal data to a 3rd party provider exposes the user to

the possibility that 3rd parties could exert restrictions

over the usage of that data.

These three key issues can all be tied back to the

issue of a single piece of information, which logi-

cally belongs to the user, being stored by 3rd party

web applications. This paper describes a change in

paradigm which moves the storage responsibility to

the end-user, and in the process, addresses the above

three key issues.

3 RELATED WORK

Detailed below, multiple existing technologies at-

tempt to address a subset of the issues described

above by sourcing content from distributed data

repositories.

Web application developers are able to use such

HTML features as iframes (Raggett et al., 1999) to

build a single web page out of multiple components,

with each component being potentially provided from

a differing location. Such technologies do not scale to

large deployments due to the manual way in which

the web developer must link the various components.

iframe technology is limited to building pages from

static components that do not grow in number or relo-

cate dynamically over time.

Content Delivery Networks (CDN) (Hofmann,

2005) are able to address performance issues sur-

rounding centralised storage of information, but do

not address the data duplication, ownership or fresh-

ness issues identiﬁed by this research. While nodes

in a CDN may be viewed similarly to the concept de-

scribed below, they do not provide the capability for

direct end-user management outside of the speciﬁc

web application in which they are tied.

Content Management Systems (CMS) (Mauthe

and Thomas, 2004) are software packages designed

to store and present information from a variety of

sources under a uniﬁed interface. CMS are gener-

ally deployed in corporate environments where non-

technical users wish to have the ability to update con-

tent presented on the web without having to learn

markup languages, such as HTML. The CMS back-

end is capable of sourcing data from multiple loca-

tions, but the access method involves aggregating the

data on the server before presenting it to the user.

The links between various pieces of data are manu-

ally conﬁgured and the various data sources are non-

transparent to the data owner.

Single-signon systems such as OpenID (Founda-

tion, 2009) and Shibboleh (Cantor, 2005) attempt to

address data duplication issues in the user authenti-

cation space by providing a single repository for user

authentication information that can be used by multi-

ple 3rd party applications to authenticate users within

single, and across multiple organisations. OpenID

currently only focuses on user authentication infor-

mation. Shibboleh contains extensions to support a

subset of user attributes such as address information,

but does not scale to large dynamic environments in

which there are no concrete relationships established

between entities.

Distributed storage has long been important in

the distributed computing space, where multiple dis-

tributed entities required access to a shared storage so-

lution. Storage Area Networks (SAN) (Corporation,

2009) are capable of providing access for multiple en-

tities to a centralised data storage area and hence are

capable of addressing the data duplication and fresh-

ness issues. These deployments are currently stati-

cally tied to the applications that need to access the

SAN and no Internet-based protocol currently exists

to allow end users to dynamically locate distributed

storage over the Internet.

None of the systems presented above are capable

of addressing the three key issues identiﬁed in section

2 within a large Internet-scale deployment.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

184

4 PROPOSED SYSTEM

The system proposed by this research plans to ad-

dress the identiﬁed problems by introducing addi-

tional technology that allows storage of personal in-

formation to be ofﬂoaded from the web application

and instead handled by 3rd party storage services.

These storage services will be leased by individual

data owners, and act as a ’single version of the truth’

for that users personal data. Enhancements to the

standard web browser design will allow this data to be

accessed seamlessly. An API will be established that

governs communication between the various actors in

the system. These web browser enhancements will

be a stepping stone between existing browser tech-

nologies and a complete Super-Browser implementa-

tion as presented in our parallel research (Henskens,

2007).

The ﬁrst phase of uniﬁed personal information

storage involves the data owner subscribing to a pri-

vate storage service (PSS). This service is responsi-

ble for storing the data owner’s personal information

and is located either in-house or outsourced to a spe-

cialised data storage provider. The data owner then

publishes content to the storage service. Once the data

owner has stored shared information in their PSS the

next stage is to allow a web application to subscribe

to the information. Once a relationship is established

the data owner establishes a link between their per-

sonal data and the web application. This link is akin

to the data owner uploading content to the web ap-

plication in a standard Web 2.0 scenario, except that

in the PSS design the data owner purely provides the

unique identiﬁer of the information as opposed to up-

loading the content itself. Once the link between a

piece of data referenced by the web application and

the storage location for that data in a PSS is estab-

lished, the web application is free to request that data

directly from the PSS. The ﬁnal stage in the design is

the presentation stage. This will deﬁne how the infor-

mation is transparently presented to end users.

The PSS design addresses the issue of data fresh-

ness by ensuring that all web applications present the

latest information stored in the PSS. Only the latest

information would be made accessible to 3rd party

users and applications. Data duplication issues are ad-

dressed as the PSS becomes the only system required

to store a users personal information. This will take

the burden of data storage away from web application

hosts and reduce their overall costs.

The PSS design also aids in addressing issues of

data ownership by ensuring that the system storing the

users personal data has a direct contractual relation-

ship with the owner of the data. Users will be able to

dictate the terms of usage for their data before agree-

ing to use a speciﬁc PSS. At the moment, a user is

forced to accept the terms and conditions of a web ap-

plication if they wish to use that speciﬁc application.

In the PSS design, the user will be free to ﬁnd another

PSS to use if they do not agree with the terms and con-

ditions of a speciﬁc provider. Web applications will

no longer have any sway over a users personal data

as they will not be responsible for collecting, storing

and presenting that data. Their responsibility will end

with providing a way of ’linking’ 3rd party users to

data stored in various PSS’s using a well-deﬁned API.

5 FUTURE WORK

Work is currently underway to implement a proof-of-

concept PSS design and collect performance statistics

comparing the proposed solution and a classic Web

2.0 approach. These performance statistics will show

that such a solution can be designed with minimal

impact to the average user, while still providing so-

lutions to the three key issues presented. A detailed

comparison will also be presented comparing the PSS

design to the solutions described in section 3.

The initial PSS concept is targeted at informa-

tion stored by web applications for the purpose of

direct return on behalf of its users. For instance,

the initial prototype suits applications such as im-

age storage where the images are not altered by the

web application. This style of information has been

termed as ’inline’. Additional solutions are possible

to support ’cacheable’ data in addition to ’inline’ data.

Cacheable data is deﬁned as data of which web ap-

plications store a local copy. The publish/subscribe

model proposed will allow for the PSS to ’push’ up-

dates of such data to the linked web applications

whenever updates are made. This will allow the is-

sues of data freshness to be addressed for cacheable

data, although it does not address the issue of data

duplication and data ownership.

6 CONCLUSIONS

This paper presents three concerns identiﬁed with the

growing popularity of Web 2.0 applications:

• Data freshness is addressed using a pub-

lish/subscribe model and single version of the

truth. The data presented to the end user is always

the freshest version because it is sourced directly

from the user’s PSS.

A PUBLISH/SUBSCRIBE MODEL FOR PERSONAL DATA ON THE INTERNET

185

• Data duplication is addressed by removing the

need for data to be stored by the web application.

Appropriate web application registration and link-

ing reduces the number of copies of a piece of data

to a single instance stored on the PSS.

• Data ownership is addressed by ensuring that stor-

age is the responsibility of the personal storage

service directly engaged by the end user. PSS

providers are liable to users, not to web applica-

tions, and hence the user has control over use of

their data. Data ownership is clear-cut because the

user is responsible for both the storage of, and ac-

cess to, the data.

The presented solution ties directly into the realms

of Cloud Computing (Boss et al., 2007), Service-

Orientated architectures (Bell, 2008) and SAAS

(Software-as-a-service) (Bennett et al., 2000). In a

sense, a PSS can be seen as a SSP (Storage Ser-

vice Provider) in a Storage-as-a-service (Foley, 2009)

cloud component that allows other web applications

to publish and subscribe to data within the cloud. The

PSS system, however, will provide the necessary ad-

ditional access and presentation layers on-top of the

storage to ensure that the user experience is seamless.

It makes sense in a Cloud Computing landscape that

each application in the cloud is able to access a shared

storage repository rather than having to replicate the

same information for each web application deployed

in the cloud.

The primary output of the next stage of this re-

search will be a PSS API. A well deﬁned API will al-

low multiple vendors to implement not only their own

PSS’s, but also the required web browser enhance-

ments that are key to providing a end-to-end seamless

solution.

REFERENCES

AFP (2009). About-facebook: backﬂip on data ownership

changes. The Sydney Morning Herald.

Amazon Web Services (2009). Amazon Elastic Compute

Cloud Technical Guide. Amazon.

Bell, M. (2008). Introduction to Service-Oriented Mod-

eling, Service-Oriented Modeling: Service Analysis,

Design, and Architecture. Wiley and Sons.

Bennett, K., Layzell, P., Budgen, D., Brereton, P.,

Macaulay, L., and Munro, M. (2000). Service-based

software: the future for ﬂexible software. In Sev-

enth Asia-Paciﬁc Software Engineering Conference

(APSEC’00), volume 17th, page 214.

Boss, G., Malladi, P., Quan, D., Legregni, L., and Hall, H.

(2007). Cloud computing. Technical report, IBM Cor-

poration.

Cantor, S. (2005). Shibboleth Protocol Speciﬁcations. in-

ternet2.

Cockburn, C. and Wilson, T. (1995). Business use of the

world-wide web. Information Research.

Corporation, E. (2009). Information Storage and Manage-

ment: Storing, Managing, and Protecting Digital In-

formation. EMC.

Feldt, K. (2007). Programming Firefox: Building Rich In-

ternet Applications with XUL (Paperback). O’Reilly

Media, Inc.

Foley, J. (2009). How to get started with storage-as-

a-service. InformationWeek Business Technology

Network, http://www.informationweek.com / cloud-

computing / blog / archives / 2009 / 02 / how to get -

star.html.

Foundation, O., Openid authentication 2.0 - ﬁnal.

http://openid.net / specs / openid-authentication-2 -

0.html.

Henskens, F. (2007). Web service transaction management.

International Conference on Software and Data Tech-

nologies (ICSOFT).

Hofmann, M. (2005). Content Networking: Architecture,

Protocols and Practice. Morgan Kaufmann Publish-

ers.

Mauthe, A. and Thomas, P. (2004). Professional Content

Management Systems: Handling Digital Media As-

sets. Wiley.

OASIS (2007). Security assertion markup language (saml)

v2.0 technical overview. Technical report, Working

Group.

O’Reilly, T. (2005). What is web 2.0. O’Reilly Net.

Raggett, D., Hors, A. L., and Jacobs, I. (1999). HTML 4.01

Speciﬁcation. W3C, w3c recommendation december

1999 edition.

Twitter Inc, Twitter. www.twitter.com.

Vickery, G. and Wunsch-Vincent, S. (2007). Participative

Web And User-Created Content: Web 2.0 Wikis and

Social Networking. Organization for Economic.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

186