RAPID XML DATABASE APPLICATION DEVELOPMENT

Albrecht Schmidt

Aalborg University

9220 Aalborg Øst, Denmark

Kjetil Nørv

Norwegian University of Science and Technology

7491 Trondheim, Norway

Keywords:

XML, databases, prototyping, design

Abstract:

This paper proposes a rapid prototyping framework for XML database application development. By splitting

up the development process into several reﬁnement steps while keeping the application programming interface

stable, the framework aims at rapid implementation of a prototype with a well-deﬁned interface and a subse-

quent implementation of more advanced concepts like business rules in several steps. The reﬁnement process

takes the form of incrementally adding domain-speciﬁc information to the application. This is achieved by

transgressing from general-purpose XML tools that do not support the deﬁnition and enforcement of con-

straints to frameworks that support domain-speciﬁc models and constraints such as E/R modeling. We have

employed this method in the development of an example application, and we give performance numbers that

illustrate the incremental improvements of each step.

1 INTRODUCTION

Since XML assumed the role as the premier data ex-

change format on the Internet, application designers

have increasingly been showing interest in coupling

XML technologies and large scale data management

techniques. One path to achieving this goal in the de-

velopment of new Web services is to modularize tasks

and to build on mature components: because Web ser-

vices are accessed through a well-deﬁned interface

that hides the actual implementation of the service,

it is possible to split the front-end, i.e., the client ap-

plications, from the database back-end.

Internet information systems are usually imple-

mented as complex multi-tier architectures. Virtu-

ally any such system that has to deal with signiﬁ-

cant amounts of data will utilize some kind of mass

storage system, most probably a database manage-

ment system (DBMS). The overall system architec-

ture depends very much on the kind of services the

mass storage back-end can deliver. Because, unfor-

tunately, the implementation of this very back-end is

a time-consuming task, it is desirable to split up de-

velopment into several steps. The ﬁrst step usually

comprises the deﬁnition and export of an the inter-

face that client applications can use. If a SOAP inter-

face (Box et al., 2000) is to be implemented, standard

XML tools can be leveraged; they greatly facilitate

setting up a prototype that provides the necessary ser-

vices but without taking into account issues like efﬁ-

ciency and consistency, which can be dealt with later.

We refer to such an approach as rapid prototyping.

According to (Kordon and Luqi, 2002) a prototype is

an executable model of a system that accurately re-

ﬂects a chosen subset of its properties, such as dis-

play formats, computed results, or response times. In

the context of our work, this implies that the proto-

type implements an abstract programming interface

(API) but uses only standard, non-performance ori-

ented tools for the back-end, which is treated as a

black-box. In subsequent steps, the back-end is then

improved until it scales up to production levels.

Furthermore, prototyping (Kordon and Luqi, 2002)

is a technique which is generally desirable in software

engineering for a number of reasons. It helps to ab-

stract from low-level details and to blend the different

components of a system to work together. A proto-

type is then reﬁned until it reaches production level.

To summarize:

1. Prototyping helps to understand the requirements

of a software systems early in the development pro-

cess: unnecessary requirements can be removed or

altered while other desiderata might be discovered.

2. Prototyping permits early feedback from users and

370

Schmidt A. and Nørvag K. (2004).

RAPID XML DATABASE APPLICATION DEVELOPMENT.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 370-375

DOI: 10.5220/0002615303700375

 SciTePress

implementors alike; this can then be used for im-

provements in planning and implementation. Ide-

ally, the process will result in a feedback loop.

3. A point that is particularly important is that proto-

typing eases the integration of subsystems. Since

software systems, and especially Web-based ones,

consist of various logical and physical layers, it is

highly desirable to develop and improve the differ-

ent subsystem independently of each other as far as

possible.

In this paper we outline a framework for XML

database application development which addresses

these issues. In a stepwise reﬁnement prototyping

process, we move from general-purpose XML tools

deployed in the ﬁrst step to tools that allow domain

modeling in the reﬁnement steps. The ﬁrst step in the

development chain consists of using XQuery (Cham-

berlin et al., 2001) on a ﬁle-based storage back-

end. Subsequent steps include switching to a DBMS

with automated XML-to-database mapping, annota-

tions with domain knowledge, and, eventually, using

modeling techniques to both ensure efﬁcient data ac-

cess as well as data integrity through the speciﬁcation

of constraints. Thus, prototyping goes through several

reﬁnement steps and employs more and more speciﬁc

tools, which is made possible by increasingly draw-

ing beneﬁt from domain-knowledge. The ﬁnal step is

then a database which does not consist of automated

mappings with surrogate identiﬁers anymore but of a

E/R-type data model (Thalheim, 2000). To achieve

this, we have deﬁned a mapping language that en-

sures smooth interaction of XML tools and relational

databases. We have employed this development pro-

cess during development of an example application,

and we give performance numbers and improvements

for the prototype through the steps in the develop-

ment process. The implementation was carried out

in a number of student projects.

The organization of the rest of this paper is as fol-

lows. In Section 2 we give an overview of related

work. Section 3 describes the general layout of our

framework. In Section 4 we describe in detail the dif-

ferent steps of the development process. In Section 5

we present a number of measurements that reﬂect the

performance characteristics of the different steps and

hints at some trade-offs. to be considered. In Sec-

tion 6 we discuss the use of the framework in the con-

text of document-centric XML documents. Finally, in

Section 7, we conclude the paper and outline topics

for future research.

2 RELATED WORK

The general validity of rapid prototyping for XML ap-

plications has been demonstrated in various industry

Frontend

Backend

WWW

Figure 1: General setting of our research.

projects (see, e.g., (e-XMLmedia, )), although usually

only one prototype is developed in order to demon-

strate both the feasibility of the undertaking and the

user interface. In our framework, this is equivalent to

the ﬁrst step as laid out below.

In (Orsini and Celentano, 2002), Orsini and Celen-

tano propose a development environment that can aid

data engineers in mapping between database schemas

and XML DTDs. This process is essentially bidirec-

tional: it enables data transfer between the two sides,

and the generation of programs and DTDs for execut-

ing, validating and safe-guarding the data exchange

process. Furthermore in (Florescu and A. Gr

unhagen,

2003), the authors present a language for implement-

ing middleware functionality like Web services that

could also play a role in the API speciﬁcation that is

part of our framework.

With respect to databases, there have been sev-

eral studies on mapping from XML to relational ta-

bles, and how to query and store in a RDBMS based

on these mappings. For example, in (Florescu and

Kossmann, 1999), Florescu and Kossmann present

mappings from XML to general relational tables;

in (Schmidt et al., 2000), Schmidt et al. present a

data and an execution model that allow for efﬁcient

storage and retrieval of XML documents in a rela-

tional database based on binary associations. The

main problem of mapping from XML to relational ta-

bles, is in order to achieve good performance differ-

ent mappings are needed for different data and work-

loads. In order to solve this problem, Bohannon et

al. (Bohannon et al., 2002) developed a cost-based

XML storage mapping engine that is based on mod-

els of XML schema, data statistics and workload tries

to ﬁnd the best mapping for a given application ac-

cording to a cost model. In (Freire and Sim

eon, 2002)

the authors propose an implementation framework for

the implementation of these considerations. In (Shan-

mugasundaram et al., 1999), a mapping that ‘imitates’

E/R modeling on top of XML documents is presented;

it is a variation of one of the mapping we also use in

our implementation and performance study.

The reverse process, generating and publishing

XML data from relational sources in addressed, for

example, in the Agora system (Manolescu et al.,

2000); there, XML is employed as the user interface

format, while relational tuples are used to represent

RAPID XML DATABASE APPLICATION DEVELOPMENT

371

the data inside the query processor. This approach

resembles the ﬁnal reﬁnement step of our architec-

ture. SilkRoute (Fern

andez et al., 2002) is a middle-

ware system for publishing XML data from relational

databases. The XML view is deﬁned using a declar-

ative query language. It accepts XML-QL queries

over the XML view, and translates them into SQL

queries. The results are tagged before being deliv-

ered to the user as XML data. An efﬁcient publish-

ing technique with a detailed discussion is also pre-

sented in (Shanmugasundaram et al., 2000). (Grabs

et al., 2002) present a complementary study of how to

extend an arbitrary XML-to-relations mapping with

transactions.

3 GENERAL SYSTEM

ARCHITECTURE

This section describes the general setting of our re-

search. We are concerned with the implementation of

very general Web-service architectures. The general

model can then easily be adapted to more speciﬁc set-

tings.

We assume a general client-server architecture, as

illustrated in Figure 1: clients issue requests to servers

by means of XML-based SOAP documents. The im-

portant feature of our architecture is now that the

front-end is well-deﬁned: it is exactly the set of docu-

ments allowed by the XML request or input language.

The implementation of the back-end can now be done

in a black-box fashion; we are free to change it as long

as it still implements the speciﬁcation imposed by the

input documents. In the rest of the paper we will focus

on a particular way of successively and systematically

altering the implementation of the back-end.

The tasks of the individual components are as fol-

lows:

1. The front-end parses the incoming XML-

documents,

2. veriﬁes certain basic constraints such as those im-

posed by XML Schema or others that may be

checked without the application context, and

3. generates input for the back-end by pre-processing

the XML documents and converting them to cus-

tom data structures which are forwarded to the

back-end. In later steps, the front-end has to pro-

vide certain basic transactional services and thus

plays a role in ensuring the transactional integrity

of the distributed system.

4. The back-end provides storage of data, query capa-

bilities, and possible additional database features.

Again, the back-end has to be front-end-aware to

some degree, so that, e.g., it is able to report back

whether a transaction was successful or not.

While the transmission of data is done entirely in

XML, we needed to extend XQuery slightly to add

basic transactional functionality.

4 THE PROTOTYPING PROCESS

IN DETAIL

This section describes the step-wise reﬁnement pro-

cess in more detail. The basic idea is to start out with

a very general framework to which we add more and

more knowledge in order to obtain better performance

and ensure data integrity.

The prototyping process consists of a number of

steps which are sketched in Figure 2. The focal point

is to move from an architecture that fulﬁlls basic con-

formance requirements imposed by the SOAP lan-

guage and that is fast to implement, to an optimized

system with many bells and whistles that can be tuned

for maximum performance, maintainability, and in-

tegrity. In each step, the codebase of the prototype is

reﬁned by switching from general tools to tools that

require additional semantic modeling. Ideally, the ad-

ditional functionality results in more control over the

system and data. In the spirit of many software engi-

neering methodologies, it also provides opportunities

for identifying problems, bottlenecks and insufﬁcien-

cies of the architecture so that this feedback can be

used to a constant improvement of the codebase even

before moving on to the next step. It can also be im-

plemented as an iterative process, where a solution to

the problems is proposed and evaluated at the current

step before entering the next step.

Technically, the four different development stages

of our framework can be outlined like this:

1. XML data are stored in ﬂat ﬁles; against these

an XQuery processor issues interface-compliant

queries as demanded by the API speciﬁcation.

Thus, the individual components are only very

lightly coupled.

2. This is the ﬁrst step where a relational database

management system is used as back-end. XML

documents are shredded and inserted into relations

using a mapping technique in the spirit of (Shan-

mugasundaram et al., 1999). Queries are executed

on the tables generated by the mapping.

3. The main purpose of this step is to add domain

knowledge for enforcing data integrity and opti-

mizing query execution. The range of technolo-

gies that are employed in this step comprise in-

dexes, constraints, views and triggers. Note that

in contrast to Step 2 the generation of the database

schema is not fully automatic anymore. The do-

main knowledge has to be added by the database

administrator.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

372

XQuery

Flat

Files

Constraints

Triggers

1)

2) 3)

4)

Relations

XML Shredder

Mapping

Language

XML Shredder

Indexes

Figure 2: System architecture in the different development steps.

4. In this step, the database administrator uses the

knowledge gathered in the ﬁrst three steps to de-

sign a database lay-out that reﬂects the require-

ments of applications. In contrast to the ﬁrst step,

which only leverages native XML tools, there is

no use of XML internally; this step could there-

fore be called ‘native relational’ with full leverage

of relational tools and opportunities for storage and

query optimization techniques common in the re-

lational world. With this step, all automation with

respect to deriving database schemas is abandoned.

The database administrators have full control over

the database and is free to deploy the tools of their

choice.

Since the system uses relational back-ends in Steps

2–4, database administrators can use the knowledge

they gathered in previous projects and, especially in

Step 4, do not require additional training. So the over-

all idea of our framework results in a smooth transi-

tion from document-oriented XML technology to the

more scalable relational technology in the back-end

of the system. This proves useful since many XML

products currently do not scale to massive data and,

on the other hand, because in this way the integration

of legacy relational systems is facilitated.

In the remainder of this section we discuss the in-

dividual steps in more detail.

1. Flat-File Storage In our context, the solution that

is fastest to implement is to store the XML data in a

ﬂat ﬁle repository, and use a freely available XQuery

processor for querying the data. By storing the XML

data in ﬂat ﬁles, the way to add information to a

database is to create a ﬁle containing the data and

registering it. Updates are performed by deleting and

creating a new ﬁle. This is simple to implement but

expensive in terms of performance and integrity main-

tenance. On the other hand, it captures well the spirit

of XML query languages like XQuery because the in-

dividual XML document is the point of reference in

these languages.

One of the most obvious disadvantages of this so-

lution is query efﬁciency. Every time a query is run,

the actual ﬁles have to be scanned and parsed into an

internal representation. Although indexing could be

possible, it is not well supported, and will typically

not be a part of this step. Transactions can be sup-

ported by the locking mechanisms that the repository

provides. However, the semantics and granularities of

the locking system are usually not aligned with the

requirements of XML.

An alternative or complementary approach to im-

plementing this step would be to take advantage of

RDBMS data types that are database equivalents ﬁles,

i.e., BLOBs, CLOBs or even XML objects. This way,

the standard tools of RDBMSs can be leveraged to

implement recovery, indexing, replication, etc. How-

ever, updates still tend to be expensive because the

storage granularity is not ﬁne enough to update only

document fragments.

2. Decomposition into Relations This reﬁnement

step features a different storage model. XML data

are now decomposed into relations. An XML shred-

der decomposes XML documents into rows and ta-

bles. Currently, the decomposition scheme is in-

dependent of the query workload. In the future,

this step may also generate workload-aware storage

schemes (Freire and Sim

eon, 2002).

The beneﬁts of the architecture in this step com-

pared with the architecture in Step 1, are the availabil-

ity of a query language like SQL which scales better,

and that the ACID properties are supported and can be

used. This includes support for recovery and support

for ﬁne-granularity locking of data and thus higher

concurrency.

A potential problem with the architecture in this

step is that, according to the storage scheme used, an

XML document is decomposed into many tables, and

many joins are needed in order to reconstruct a docu-

ment. However, in many cases reconstruction of doc-

uments will not be necessary, so that this problem will

not be an issue.

At this stage in the prototyping process a few

queries will have been issued so that further optimiza-

tion will be possible in the next step:

RAPID XML DATABASE APPLICATION DEVELOPMENT

373

3. Optimization of Decomposition The architec-

ture of the previous step has potential for both

high concurrency and scalability. However, in or-

der to achieve high query performance and consis-

tency, additional features supported by the RDBMS

should be employed. These include indexes, con-

straints and triggers. This step offers opportuni-

ties for adding them. Typically constraints like

database-wide uniqueness of XML attributes and

reference constraints are candidates for declaring

domain-knowledge to the database and thus ensure

some important integrity constraints. Although in-

dexes probably are also to be used in this step, their

main use is not to improve query performance but

constraint enforcement. In this sense, they are used

on an ad-hoc basis.

4. No Use of XML Internally For some projects,

the prototype developed in Step 3 will be the ﬁnal one.

It will satisfy many requirements. However, in gen-

eral it will not scale up to the requirements of a pro-

duction system. When an application manages large

amounts of data or features a query-intensive work-

load, it is probable that more ﬁne-tuning is needed

than possible in the framework of Steps 1–3. In such

a case, it will be necessary to develop the prototype

into a system that does not use XML internally but

that makes semantic data modeling possible and that

can take advantage of the semantics.

To bridge this gap, we have designed a map-

ping language between the XML and the relational

database schemas. In practice, the process resembles

the way E/R CASE tools are used and can be sup-

ported by a Graphical User Interface (GUI). The lan-

guage is used to glue the relational database schema to

the elements of XML documents and, at the same, to

enforce database-wide constraints on the documents.

For example, if information from an XML person

record is to be inserted into the RDBMS but the So-

cial Security Number of the person is already present

in the database, then the XML document has to be

rejected. In this way the language is used to enforce

constraints that are difﬁcult to enforce in XML-only

scenarios.

5 PERFORMANCE

IMPRESSIONS

Figure 3 sketches some performance numbers from

Steps 1–3 when different database schemas are im-

plemented, illustrating the cost of loading data into

different schemas (top ﬁgure) and the cost of query-

ing (bottom ﬁgure).

Note that adding domain knowledge to ensure in-

tegrity does not always enhance performance by it-

self. Especially, Figure 3(a) shows that automatic

constraint enforcement brings about additional update

costs. However, the gain is certainty that the database

is in a consistent state. Transaction-oriented appli-

cation are a prominent case when this is useful. In

practice, domain-speciﬁc modeling in Step 4 is when

performance gains are most probable.

6 DOCUMENT-CENTRIC

DOCUMENTS

XML documents are frequently divided into two

categories: document-centric and data-centric.

Document-centric XML document are often docu-

ments meant for human consumption, like books,

papers, etc., while data-centric are typically doc-

uments meant for computer consumption/data

transport, and that are highly structured.

Our focus in this paper has been data-centric docu-

ments. However, it should be noted that the proposed

framework is also applicable in the case of mainly

document-centric documents and/or repository ser-

vices.

In that case, it can be beneﬁcial to store the

documents in BLOBs in the database (as one of the

alternatives in Step 1), and rely on associated indexes

to improve performance. Thus, Step 2 is not appli-

cable, but instead performance improvements simi-

lar to some of those proposed for Step 3 can be im-

plemented. For XML documents special indexes are

provided by the actual commercial database systems.

Supported indexes typically include path indexes, as

well as text-index variants.

7 CONCLUSIONS AND FUTURE

WORK

We have described a framework for XML database

application development. The focal point of the de-

velopment framework is that it enables rapid proto-

typing by deploying easy-to-setup general purpose

tools in the ﬁrst steps and then reﬁnes the application

by adding more and more domain knowledge in sub-

sequent steps until it is possible to use semantic mod-

eling. Technically, the four steps comprise ﬂat-ﬁle

back-end storage, automatic XML document shred-

ding, custom XML document shredding, and, as a ﬁ-

nal step, the transition to a relational back-end.

Directions for future research include investiga-

tions into how to utilize XML Schema and Semantic

Web information for optimizing of Steps 2–4 in our

In a repository service a stored document is returned,

in contrast to a query-generated document as in the more

general case.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

374

(a) Bulkload (b) Sequential Scan

Figure 3: Two performance ﬁgures.

framework. Of particular importance is the veriﬁca-

tion of queries and mappings in the fourth step where

manual interaction can introduce errors. Furthermore,

the mapping language mentioned in Step 4 currently

produces more locks than necessary and therefore im-

pedes on parallelism. A more detailed analysis of

the declarative locking mechanism could lead to im-

proved code after a query rewriting phase.

ACKNOWLEDGMENTS

We would like to thank our students Jens Gorm

Rye-Andersen, Lasse Jensen, Jimmy Nielsen, Søren

Nøhr Christensen and Mads Wiederholt Jensen for

their implementation work and the performance mea-

surements.

REFERENCES

Bohannon, P., Freire, J., Roy, P., and Sim

eon, J. (2002).

From XML Schema to Relations: A Cost-Based Ap-

proach to XML Storage. In Proceedings of the IEEE

International Conference on Data Engineering.

Box, D., Ehnebuske, D., Kakivaya, G., Layman, A.,

Mendelsohn, N., Nielsen, H., Thatte, S., and Winer,

D. (2000). Simple Object Access Protocol (SOAP)

1.1. Available at http://www.w3.org/TR/

SOAP/.

Chamberlin, D., Florescu, D., Robie, J., Sim

eon, J., and

Stefanescu, M. (2001). XQuery: A Query Language

for XML. available at http://www.w3.org/TR/

xquery.

e-XMLmedia. Services summary. Version 3.0. Available at

http://www.e-xmlmedia.com/sol/.

Fern

andez, M., Kadiyska, Y., Suciu, D., Morishima, A., and

Tan, W.-C. (2002). SilkRoute: a framework for pub-

lishing relational data in XML. ACM TODS, 27(4).

Florescu, D. and A. Gr

unhagen, D. K. (2003). XL: a plat-

form for Web Services. In Biennial Conference on

Innovative Data Systems Research.

Florescu, D. and Kossmann, D. (1999). Storing and Query-

ing XML Data using an RDMBS. IEEE Data Engi-

neering Bulletin, 22(3).

Freire, J. and Sim

eon, J. (2002). Adaptive XML Shredding:

Architecture, Implementation, and Challenges. In Ef-

ﬁciency and Effectiveness of XML Tools and Tech-

niques and Data Integration over the Web, VLDB

2002 Workshop EEXTT and CAiSE 2002 Workshop

DTWeb. Revised Papers, volume 2590 of Lecture

Notes in Computer Science. Springer.

Grabs, T., B

ohm, K., and Schek, H.-J. (2002). XMLTM: ef-

ﬁcient transaction management for XML documents.

In Proceedings of the Eleventh International Confer-

ence on Information and Knowledge Management,

pages 142–152.

Kordon, F. and Luqi (2002). An Introduction to Rapid Sys-

tem Prototyping. IEEE Transactions on Software En-

gineering, 28(9).

Manolescu, I., Florescu, D., Kossmann, D., Xhumari, F.,

and Olteanu, D. (2000). Agora: Living with XML

and Relational. In Proceedings of the International

Conference on Very Large Data Bases.

Orsini, R. and Celentano, A. (2002). A workbench for pro-

totyping XML data exchange. In Proceedings of Sis-

temi Evoluti per Basi di Dati (SEBD).

Schmidt, A., Kersten, M., Windhouwer, M., and Waas,

F. (2000). Efﬁcient relational storage and retrieval

of XML documents. In The World Wide Web and

Databases, Third International Workshop WebDB

2000.

Shanmugasundaram, J., Shekita, E., Barr, R., Carey, M.,

Lindsay, B., Pirahesh, H., and Reinwald, B. (2000).

Efﬁciently Publishing Relational Data as XML Docu-

ments. In 2000, pages 65–76.

Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., De-

Witt, D. J., and Naughton, J. F. (1999). Relational

Databases for Querying XML Documents: Limita-

tions and Opportunities. In Proceedings of the Inter-

national Conference on Very Large Data Bases, pages

302–314.

Thalheim, B. (2000). Fundamentals of Entity-Relationship

Modeling (Foundations of Database Technology).

Springer.

RAPID XML DATABASE APPLICATION DEVELOPMENT

375