PROVIDING SCALABLE ACCESS TO LARGE XML DOCUMENTS

Arno Puder

San Francisco State University, Computer Science Department, 1600 Holloway Avenue, San Francisco, CA 94132, USA

Keywords:

XML, DOM, Working Set.

Abstract:

XML documents often tend to be voluminous and accessing them through a DOM (Document Object Model)

interface poses particular challenges. All the existing DOM implementations require an XML document to

be completely collocated before it can be parsed. This solution does not scale for huge XML documents.

In this paper we introduce an architecture, called VDOM (Virtual DOM) that allows scalable access to large

XML documents through a DOM interface. In the VDOM architecture, the actively used portions of an XML

document aretransferred to the application. The application can begin to traverse this portion without requiring

that the complete DOM tree is collocated. As the application traverses the DOM tree, portions of the XML

document are loaded on-demand. Using the VDOM architecture is transparent to the application which uses a

standard DOM interface to access the DOM tree.

1 INTRODUCTION

Nowadays, the eXtensible Markup Language (XML)

is largely used in various domains to structure and

mark-up data. XML allows to deﬁne data with a tree-

like extensible data structure using a software and

hardware independent language, which led to its suc-

cess. Numerous tools have been proposed to access

XML documents. Standardized APIs such as DOM

(Document Object Model (W3C, 2004)) and SAX

(Simple API for XML (SAX Project, 2004)) allow

applications to access XML documents. Other tools

such as XSL, XPath, and XQuery provide powerful

means to transform, reference, and query XML doc-

uments. These tools help to structure the XML tool-

chain into components which ultimately leads to their

better reuse (e.g., XPath is used in XSL and XQuery,

and DOM is used in some XSL implementations).

Programmatic access to an XML document from a

high-level programming language is possible through

either a SAX or a DOM interface. Both offer the con-

tents of an XML document through a standard API to

an application. The SAX interface allows an applica-

tion to read the content of an XML document from

beginning to end in the same sequence in which it ap-

pears in the document. While this is suitable in some

contexts, it cannot be used for applications that re-

quire random access to the contents of an XML doc-

ument; in such cases a DOM parser is required. A

DOM interface allows to map a whole XML docu-

ment to a tree-like data structure of a programming

language. The application is then free to traverse the

contents in any order. There exist many DOM imple-

mentations for most programming languages. E.g.,

JDOM (JDOM, 2004) is one implementation for Java.

Many applications using XML documents as well as

higher-level XML tools such as some existing XSL

implementations are built on top of JDOM.

More and more applications have to deal with

XML documents of enormous size. We will introduce

one such application in the following section. While

a DOM API provides the convenient abstraction for a

programmer to access the contents of an XML doc-

ument, all the existing DOM implementations have

the limitation that they need to ﬁrst load the complete

XML document into main memory in order to parse

and build up the DOM tree. For huge XML docu-

ments this solution does not scale.

This paper introduces the notion of a Virtual DOM

(VDOM) architecture. It allows scalable access to an

XML document of arbitrary size through a DOM API.

The complete XML document does not have to be

collocated for building the DOM tree. Nodes of the

document are loaded on-demand as the application

traverses the data structure which is done transpar-

ently to the application. The application is also able

178

Puder A. (2007).

PROVIDING SCALABLE ACCESS TO LARGE XML DOCUMENTS.

In Proceedings of the Third International Conference on Web Information Systems and Technologies - Internet Technology, pages 178-183

DOI: 10.5220/0001270801780183

 SciTePress

to access the XML document using any standardized

DOM API.

This paper is organized as follows: Section 2 ﬁrst

introduces a use case from the geo-sciences to moti-

vate our work. Section 3 introduces our VDOM archi-

tecture. Section 4 concludes this paper and presents

an outlook to future work.

2 GEO-SCIENCE USE CASE

In this section we present a use case from the geo-

sciences to serve as one example of a domain where

large XML documents are common place. In a previ-

ous project called NetBEAMS (Networked Bay Envi-

ronmental Assessment Monitoring System (San Fran-

cisco State University, 2003)) we have devised an

end-to-end infrastructure to connect web browsers to

commercially available sensors that measure environ-

mental conditions of the San Francisco Bay (Zam-

brano and Puder, 2006).

In the course of the NetBEAMS project, we de-

veloped the Sensor Data Markup Language (SDML)

that allows the description of sensor data and its meta-

data via XML (W3C, 2006a). The following SDML

document shows some sample markup resulting from

our NetBEAMS application:

<sdml>

Temperature @ Romberg Tiburon Center (RTC)

</metadata>

<metadata id="unit">Celsius</metadata>

</metadata>

11/18/05 10:45:06 GMT

</timeStamp>

</measurement>

11/18/05 10:47:42 GMT

</timeStamp>

</measurement>

11/18/05 10:48:16 GMT

</timeStamp>

</measurement>

</sdml>

Different measurements are provided in the above

sample of the SDML document and they are denoted

by the <measurement> tag. Each measurement

is associated with some meta-data marked up via the

<metadata> tag. The purpose of the meta-data

in SDML is to attach a geographic location, a time-

stamp, and an unit to the individual measurements.

The XML excerpt above shows the temperature and

relative humidity of two different locations in the San

Francisco Bay: the Romberg Tiburon Center (RTC)

and the Golden Gate Bridge (GG). Their respective

meta-data deﬁnitions are referenced through identi-

ﬁers temperatureRTC, temperatureGG, and

relHumidityRTC.

One common application in geo-sciences is to

correlate individual measurements by a domain spe-

ciﬁc function. One example is the computation of

the probability of fog (Lowe, 1977). This func-

tion requires temperature and relative humidity read-

ings from two locations within a certain time win-

dow (e.g., temperature and relative humidity readings

from different days cannot be combined). The pro-

cess of the computation is sufﬁciently complex and

goes beyond the capability of the existing implemen-

tations of XQuery and XSL and requires a custom-

implementation based on a high-level programming

language. Different sensors often report their mea-

surements in different units and it is necessary to con-

vert them to a canonical format. The information

about the units can be derived from the meta-data

speciﬁcation of the SDML. For that reason, the imple-

mentation of the fog-function requires access to dif-

ferent parts of the SDML document during the com-

putation: the meta-data and the measurements. We

refer to those parts as localities.

Because localities change in an out-of-order se-

quence during the computation of the fog-function, a

DOM interface is preferred to allow random access to

the content of a SDML document. There exist dif-

ferent DOM implementation for different program-

ming languages and scientists are free to choose their

preferred implementation to access the measurements

and meta-data from the SDML document. E.g., the

aforementioned JDOM can be used to allow a Java

program to access a SDML document, and other sci-

entists may prefer C++ in combination with the DOM

implementation called Xerces. While the existence

of these various implementations leverages the skill-

PROVIDING SCALABLE ACCESS TO LARGE XML DOCUMENTS

179

set of developers, all the implementations possess the

same drawback: SDML documents can be very vo-

luminous and easily reach several gigabytes in size.

Dealing with an XML document of this size does not

scale with any of the existing XML implementations.

While the use case described in this section is do-

main speciﬁc, we believe that huge XML documents

are also common in other areas. The work described

in this paper allows access to a large XML document

independent of its size using a familiar DOM API.

3 VDOM ARCHITECTURE

We propose a VDOM architecture that allows scalable

access to large XML documents through the DOM

interface. The VDOM architecture is based on the

client/server model where the required XML docu-

ments are at the server side and the applications ac-

cessing the documents are at the client side. Our gen-

eral idea is as follows. The XML document is par-

titioned into smaller portions that are transferred to

the application piece-by-piece. Without the need to

transfer the whole XML document, the application

can begin to explore the transferred portions of the

document through a DOM interface. The VDOM ar-

chitecture is transparent to the applications that can

access the portions using a DOM API of their choice.

In this section we ﬁrst explain the design goals of

the VDOM architecture (Section 3.1), followed by its

overall structure (Section 3.2). Section 3.3 describes

the portions of a XML document that can be trans-

ferred in the VDOM architecture. The protocol to re-

quest and transfer these portions is presented in Sec-

tion 3.4.

3.1 Design Goals

We begin our discussion of the VDOM architecture

by formulating our design goals. The purpose of the

design goals is to deﬁne what we want to accomplish

with our VDOM architecture, but not how to accom-

plish them. Subsequent sections describing the archi-

tecture will focus on how they are achieved.

The following design goals guide the deﬁnition of

our VDOM architecture:

• Client/Server architecture.

• Read-only access.

• Access transparency.

• De-coupling of client and server side technolo-

gies.

The VDOM architecture is based on the

client/server model to allow access to XML

documents. The server stores the XML document

that is accessed by clients. We already established

previously that it is not feasible to transfer a complete

XML document, but that only actively used por-

tions (the aforementioned localities) are transferred

between client and server. At this stage, to limit

the complexity of the VDOM architecture, we only

allow read-only access to the document in order

to avoid synchronization issues between different

clients. While read-only access limits the scope of

applications using our architecture, we believe that in

many cases write-access is not required.

Another design goal is access transparency.

The VDOM architecture follows the client/server

paradigm in which there are two access points: on

the client side, an application will access the VDOM

architecture and on the server side we will need to

connect to a data-source that stores the XML docu-

ment. These access points should be designed such

that neither client nor server need to be modiﬁed in

any way in order to work with our VDOM architec-

ture. The applications are free to choose a particular

DOM API and they should not notice that the whole

XML document might not be loaded. The servers can

also store the XML document in various ways such as

an XML database or in a regular ﬁle. By doing so, we

achieve access transparency that will enable existing

applications and data-sources to be easily integrated

with our architecture.

The ﬁnal design goal of our VDOM architecture is

de-coupling of technologies. While there will be dif-

ferent ways to implement our VDOM architecture, we

want to make sure that there is no technology depen-

dency between the client and the server. It should be

possible to provide independent implementations of

the client and server of the VDOM architecture, each

freely choosing their respective technologies such as

the programming language. With this design goal we

want to achieve an open architecture with indepen-

dent, interoperable clients and servers.

Figure 1: VDOM Architecture.

WEBIST 2007 - International Conference on Web Information Systems and Technologies

180

3.2 Architecture

Figure 1 depicts the VDOM architecture. The archi-

tecture is based on the client/server paradigm. The

XML document is stored on the server side at some

data source (at the right side in Figure 1). We impose

no requirements on the data sources (e.g., a source

could be a relational database or an XML database).

A speciﬁc wrapper is required for each kind of data

source to allow the data retrieval from the sources.

The wrapper also transforms data to XML format

in case it is stored using another format at the data

server. The VDOM protocol deﬁnes the PDUs (Proto-

col Data Units) exchanged between the VDOM client

and the VDOM server; the VDOM protocol will be

explained later. The DOM API wrappers on the client

side offer the underlying services of the VDOM archi-

tecture to the application. The purpose of the DOM

API wrappers is to assure that the application de-

veloper is unaware of using the VDOM architecture.

With the help of these wrappers the application is un-

aware that only a part of the whole XML document

is locally available on the client side. They also al-

low the application to access the document portions

via a familiar API. The API of each speciﬁc wrapper

is identical to that of a given XML DOM API in the

given language.

One possible end-to-end scenario is as follows:

the application on the client side uses JDOM to tra-

verse the XML document. But instead of using

JDOM directly, the application uses the JDOM wrap-

per. The wrapper translates the request expressed us-

ing JDOM API into the internal request of the VDOM

architecture. The internal request is forwarded to

the VDOM client. Based on the application’s needs,

the VDOM client will request an appropriate por-

tions of the XML document from the VDOM server.

The VDOM server uses a wrapper to access the data

source, to transform the relational data to XML for-

mat, and to retrieve the requested portion of the XML

document. Upon sending the portion back the VDOM

client, the JDOM wrapper returns to the application

with the requested data expressed using JDOM.

The application is not aware that only a portion of

the requested document is returned. As the applica-

tion eventually reads nodes from the XML document

that lie outside of the returned portion, the VDOM

client will automatically retrieve neighboring portions

including the newly requested nodes. For now we are

only using some static heuristics to determine the size

of the XML document portions to be transferred (i.e.,

the depth and breadth of the portion). We also en-

vision to adapt the size by observing the pattern by

which the application accesses the XML document.

3.3 Working Set

Instead of transferring the whole XML document, the

VDOM architecture allows to transfer smaller por-

tions of the document that are actively used by the

application. We denote the portions of an XML doc-

ument that can be transferred in our VDOM architec-

ture as working sets. The term working set is inspired

by a concept from operating systems where it refers

to a set of pages in virtual memory actively used by a

process (Tanenbaum and Woodhull, 2006). We adopt

this principle of a working set and transfer it to XML

documents. The working set of an XML document

deﬁnes those portions of the document used by an ap-

plication during a certain time interval.

Just like the working set in operating systems de-

ﬁnes a certain locality in terms of pages in virtual

memory accessed by a process, our deﬁnition of a

working set for XML documents assumes certain lo-

calities in accessing its content. The localities are

determined by the required portions of the document

during a speciﬁc time interval.

Before deﬁning the working sets used in our

VDOM architecture, we ﬁrst give a representation

of an XML document as a tree in the mathemati-

cal sense consisting of nodes and edges XML

(V, E, n

Root

, f ) where:

1. V is a ﬁnite set of nodes in the tree representing

the XML document; they represent both the ele-

ments and the attributes of the document;

2. E represents the edges: every edge (c, p) ∈ E is

an edge from the parent p to the child c;

3. n

Root

∈ V being the root node of the tree;

4. f : is an injective, partial function assigning an

order number to all nodes except the root node

Root

. If a node has n children, then those chil-

dren are assigned order numbers in the interval

[0, . . . , n − 1].

We deﬁne a working set W S of a XML document

XM L

= (V, E, n

Root

, f ) as a set of localities of

the document W S = {L

, L

, ..., L

}. Each locality

with 0 < i ≤ n is deﬁned as a subset of the nodes

V with the following conditions:

1. L

⊆ V ;

2. Graph (L

, E) is connected;

3. For every child c ∈ V \L

with parent p, (c, p) ∈

E: there does not exist c

, c

∈ L

with (c

, p) ∈

E and (c

, p) ∈ E with f(c

) < f (c) < f(c

The conditions (1) and (2) basically state that each

locality is a connected sub-tree of the tree correspond-

ing to the XML document. Condition (3) states that

PROVIDING SCALABLE ACCESS TO LARGE XML DOCUMENTS

181

all children of a node in a locality must be immediate

siblings of each other with respect to the order func-

tion f. I.e., the node may have other children that are

not part of the locality, but they can only appear “at

the border” of the locality. Different localities of the

same working set have to be disjoint. For every two

localities L

and L

of the same working set W S, the

intersection of nodes belonging to L

and L

has to

be empty.

Figure 2 shows an example of an XML document.

It also shows a working set containing one single lo-

cality whose root node is B. The locality as shown in

this example does not include all children of node B,

but note that all children belonging to the locality are

immediate siblings of each other without a gap (i.e., a

node that does not belong to the locality).

Figure 2: Working set of an XML document with one lo-

cality.

3.4 VDOM Protocol

The VDOM protocol deﬁnes the PDUs transferred be-

tween the VDOM client and the VDOM server shown

in Figure 1. The protocol is based on a simple re-

quest/response exchange, where the client makes a

request and the server responds with the appropriate

working set. We ﬁrst present the marshalling of the

working set, i.e., the PDU the server responds to a

client’s request. We will explain the parameters of the

client request further below.

We use XML notation to represent working sets.

Apart from the document content included in the

working set, it is necessary also to encode addi-

tional information about the environment of the work-

ing set inside the whole document to facilitate de-

cisions within the VDOM client during the applica-

tion’s traversal of the XML document. E.g., it is

beneﬁcial to know if there are more children besides

the one contained in a locality of the working set.

This contextual information is embedded via XML at-

tributes into the working set. For the example shown

in Figure 2, the marshalled XML document of the

working set is as follows:

<vdom:WorkingSet

xmlns:vdom="http://vdom.org/response/">

<B vdom:id="2"

vdom:prev-child="3"

vdom:num-prev-children="1"

vdom:next-child="6"

vdom:num-next-children="2">

</B>

</vdom:WorkingSet>

Every node in an XML document can be uniquely

identiﬁed via the vdom:id attribute. The value

of this attribute would usually be an XPath (W3C,

2006b) expression to make use of available XML

standards. For the sake of make the example more

readable, we use the numbers shown in Figure 2 as

the node IDs. E.g., in Figure 2, the node having

the ID 2 can be identiﬁed using the XPath expres-

sion /A[1]/B[1]. As can be seen, the additional

markup tells the client about the context of the work-

ing set in the original XML document. Table 1 sum-

marizes the different attributes available to describe

the context.

Table 1: Attributes for describing context of a locality in a

working set.

Attribute Description

id XPath expression of the node

within the XML document.

prev-childXPath expression of the child bor-

dering before the locality. This at-

tribute is missing if there are no

more previous children.

next-childXPath expression of the child bor-

dering after the locality. This at-

tribute is missing if there are no

more children following.

num-prev-

children

Number of children appearing be-

fore the locality. This attribute

is assigned if and only if the

prev-child attribute is present.

num-next-

children

Number of children appearing af-

ter the locality. This attribute

is assigned if and only if the

next-child attribute is present.

num-

children

If none of a node’s children are part

of the locality, this attribute spec-

iﬁes the number of children of a

node.

The working set is sent by the server in response

to a client’s request. The client can request any work-

ing set of the XML document. By doing so, the client

WEBIST 2007 - International Conference on Web Information Systems and Technologies

182

must provide the root node of every locality of the

working set (denoted by the node’s XPath expression)

as well as the breadth and depth of each locality. The

following request shows the markup for requesting

the working set highlighted in Figure 2:

</LocalityRoot>

</VDOMRequest>

The VDOM protocol also allows the reporting

of error conditions (e.g., when the client requests a

node with an invalid XPath expression). The VDOM

protocol must be mapped to some transport mecha-

nism. Since we use XML for the representation of

the VDOM PDUs, Web Services seem to be a natural

choice, although other transport mechanisms such as

CORBA or plain TCP-connections also are possible.

The working sets are identiﬁed depending on the

application at the client side and the XML docu-

ment at the server side. For now we only use some

static heuristics to determine the working sets but the

VDOM client can also make use of different parame-

ters to infer suitable working sets in order to minimize

communication overhead. The schema of an XML

document can be used to infer the working sets (e.g.,

the multiplicity of an element can give an indication

to the size of a working set). The application can also

be used to infer the size of working sets. E.g., differ-

ent working sets will be delivered sequentially to the

client if it prefers a breadth-ﬁrst search or if it prefers

a depth-ﬁrst search. The usage history can also be

used to help the decision of working sets.

There are two possible strategies for delivering

working sets from the VDOM server to the VDOM

client. The ﬁrst consists in delivering working sets

only when requested by the VDOM client. Every time

the application reaches a portion of the tree that are

not locally available, the VDOM client automatically

forms a request to the VDOM server for a working

set containing the needed portion. The second strat-

egy consists in estimating the needs of the applica-

tion and delivering some potential working sets before

they are required. Among the two strategies, the ﬁrst

one makes requests only when some new portions are

required by the application, the response time may in-

crease. The second one estimates the suitable working

set. It is more efﬁcient if the estimate happens to be

mostly correct; while in the inverse case, pre-fetching

several working sets without using them may lead to

lower performance.

4 CONCLUSION AND OUTLOOK

In this paper, we introduced the VDOM architecture

that allows applications to transparently access large

XML documents through a DOM API. In the VDOM

architecture, an XML document is partitioned into

working sets that are transferred individually to the

client. A protocol has been proposed to specify the

request and response PDUs of working sets. DOM

API wrappers are deﬁned to make the whole architec-

ture transparent to the user application. Server wrap-

pers have also been deﬁned to be able to connect to

different kinds of XML document data sources. We

are working on a prototype implementation that uses

JDOM as the client side DOM API and MySQL on

the server side.

Apart from validating our ideas by running some

benchmarks, we plan to generalize some internal pro-

cesses of the VDOM architecture. In particular deter-

mining the size of the working set needs to be further

investigated. We currently only use static (compile-

time) heuristics to determine the size of the requested

working set. One obvious extension would be to ob-

serve the application’s behavior (i.e., the way the ap-

plication traverses the DOM tree) to adapt the size

of the working set at runtime. Other extensions of

the work presented in this paper are read/write access

to the server, as well as generalizing the client/server

model to a peer-to-peer model where the XML docu-

ment is distributed among different peers.

REFERENCES

JDOM (2004). Java DOM-API. http://www.jdom.org/.

Lowe, P. (1977). An approximating polynomial for the

computation of saturation vapor pressure. Journal of

Applied Meterology, 16:100–103.

San Francisco State University (2003). NetBEAMS - Net-

worked Bay Environmental Assessment Monitoring

System. http://www.netbeams.org/.

SAX Project (2004). Simple API for XML (SAX).

http://www.saxproject.org/.

Tanenbaum, A. and Woodhull, A. (2006). Operating Sys-

tems Design and Implementation. Prentice Hall, third

edition.

W3C (2004). Document Object Model (DOM).

http://www.w3.org/DOM/.

W3C (2006a). eXtensible Markup Language (XML).

http://www.w3.org/XML/.

W3C (2006b). XML Path Language 2.0.

http://www.w3.org/TR/xpath/.

Zambrano, B. and Puder, A. (2006). A ﬂexible system

for real-time oceanographic monitoring. Extended ab-

stract, San Francisco State University.

PROVIDING SCALABLE ACCESS TO LARGE XML DOCUMENTS

183