rently there is no system that allows a user to push
updates for such information to multiple web applica-
tions, so a user is required to remember, locate, access
and update for each application in turn.
In addition to the issue of data freshness there is
the issue of the amount of data storage wasted by the
duplication of information. While the duplication of
a small piece of information such as a user’s address
may appear trivial, when you considering the increas-
ing inclusion of photos and video the gravity of the
problem increases. The operators of web applications
currently spend considerable money and effort on data
storage technologies to store data that duplicates what
is stored by other 3rd party systems.
Arguably the most concerning issue is the prob-
lem of data ownership. When uploading user infor-
mation to a 3rd party web application a EULA (end
user licence agreement) is generally in-place to de-
scribe who owns the information once it has been
stored by the web application. It is common for the
web application owners themselves to take owner-
ship of the data, and it has been known for users to
fight back against such restrictions (AFP, 2009). Pass-
ing the responsibility of storage and access of per-
sonal data to a 3rd party provider exposes the user to
the possibility that 3rd parties could exert restrictions
over the usage of that data.
These three key issues can all be tied back to the
issue of a single piece of information, which logi-
cally belongs to the user, being stored by 3rd party
web applications. This paper describes a change in
paradigm which moves the storage responsibility to
the end-user, and in the process, addresses the above
three key issues.
3 RELATED WORK
Detailed below, multiple existing technologies at-
tempt to address a subset of the issues described
above by sourcing content from distributed data
repositories.
Web application developers are able to use such
HTML features as iframes (Raggett et al., 1999) to
build a single web page out of multiple components,
with each component being potentially provided from
a differing location. Such technologies do not scale to
large deployments due to the manual way in which
the web developer must link the various components.
iframe technology is limited to building pages from
static components that do not grow in number or relo-
cate dynamically over time.
Content Delivery Networks (CDN) (Hofmann,
2005) are able to address performance issues sur-
rounding centralised storage of information, but do
not address the data duplication, ownership or fresh-
ness issues identified by this research. While nodes
in a CDN may be viewed similarly to the concept de-
scribed below, they do not provide the capability for
direct end-user management outside of the specific
web application in which they are tied.
Content Management Systems (CMS) (Mauthe
and Thomas, 2004) are software packages designed
to store and present information from a variety of
sources under a unified interface. CMS are gener-
ally deployed in corporate environments where non-
technical users wish to have the ability to update con-
tent presented on the web without having to learn
markup languages, such as HTML. The CMS back-
end is capable of sourcing data from multiple loca-
tions, but the access method involves aggregating the
data on the server before presenting it to the user.
The links between various pieces of data are manu-
ally configured and the various data sources are non-
transparent to the data owner.
Single-signon systems such as OpenID (Founda-
tion, 2009) and Shibboleh (Cantor, 2005) attempt to
address data duplication issues in the user authenti-
cation space by providing a single repository for user
authentication information that can be used by multi-
ple 3rd party applications to authenticate users within
single, and across multiple organisations. OpenID
currently only focuses on user authentication infor-
mation. Shibboleh contains extensions to support a
subset of user attributes such as address information,
but does not scale to large dynamic environments in
which there are no concrete relationships established
between entities.
Distributed storage has long been important in
the distributed computing space, where multiple dis-
tributed entities required access to a shared storage so-
lution. Storage Area Networks (SAN) (Corporation,
2009) are capable of providing access for multiple en-
tities to a centralised data storage area and hence are
capable of addressing the data duplication and fresh-
ness issues. These deployments are currently stati-
cally tied to the applications that need to access the
SAN and no Internet-based protocol currently exists
to allow end users to dynamically locate distributed
storage over the Internet.
None of the systems presented above are capable
of addressing the three key issues identified in section
2 within a large Internet-scale deployment.
WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies
184