ON ANALYZING THE DATABASE PERFORMANCE FOR

DIFFERENT CLASSES OF XML DOCUMENTS BASED ON THE

USED STORAGE APPROACH

Hagen H

opfner, J

org Schad and Essam Mansour

International University Bruchsal, Campus 3, 76646 Bruchsal, Germany

Keywords:

XML Data Management, XML Performance.

Abstract:

With the increasing popularity of XML data also the need for permanent XML document storage grows.

As of today there exist number of different XML storage alternatives ranging from XML enabled relational

database systems over a new class of hybrid database systems providing native storage for XML and relational

data to pure native XML systems. This paper examines how these different storage approaches perform in

respect to the different classes of XML data by devising a new benchmark HYBE with special consideration

to certain features of hybrid database systems. First results indicate that hybrid database systems can deliver

performance which is (almost) equivalent to native XML database systems making them the optimal choice

for small to mid-size companies with the need for both XML and relational data storage.

1 INTRODUCTION

In recent years the amount and usage of XML data

grew rapidly and today it is used for many purposes

such as data transfer, conﬁguration ﬁles or to store

information. So, the safe, efﬁcient and reliable stor-

age for such documents becomes more and more im-

portant. Until a few years ago there existed - be-

sides classical ﬁle systems - only two options for stor-

ing XML documents: either native XML databases

(e.g. Apache Xindice or Tamino (Sch

oning, 2003)) or

XML enabled relational databases providing an XML

data type. In the second case XML documents are

either stored as a character large object (clobbing)

or shredded into relational database tables (object re-

lational mapping). Recently large database vendors

such as Oracle or IBM developed another alternative,

the so called hybrid database systems. They store

both, relational data and XML data natively by pro-

viding two separate storage systems. However, all

alternatives have certain advantages and drawbacks.

Choosing the appropriate technique depends on the

application and is a non trivial task. Even though

there exist a number of XML benchmarks they are

usually focused on a speciﬁc application domain (e.g.,

ﬁnancial data) and so far have not considered the ad-

vantages of hybrid systems. Therefore, we here aim

at devising a benchmark considering different char-

acteristics of projects while explicitly including the

features of hybrid systems.

Paper Structure. Section 2 brieﬂy describes related

work. Section 3 discusses the nature of XML data.

Section 4 explains the storage alternatives. Section 5

outlines our newly devised benchmark HYBE. Sec-

tion 6 shows our preliminary results for the different

domains and storage approaches considered.

2 RELATED WORK

As of today there already exists a range of bench-

marks for XML database systems and also a range of

performance analysis has been performed so far.

XBench (Yao et al., 2004) is a family of XML

benchmarks. It generates various types of XML

documents (data-oriented and varying in their size)

and simulates different applications by inserting and

reading data using XQuery. The Michigan Bench-

mark (Runapongsa et al., 2006) runs 45 different

queries (loading, inserting, deleting and updating)

on one large XML document with a recursive struc-

ture. It mostly measures the performance of the

implementation but is not suited to measure across

different systems involving relational databases and

SQL queries. TPox (Nicola et al., 2007) is a do-

main speciﬁc benchmark for the ﬁnancial sector us-

ing FIXML (Cover, 2002). It contains a content gen-

243

Höpfner H., Schad J. and Mansour E. (2009).

ON ANALYZING THE DATABASE PERFORMANCE FOR DIFFERENT CLASSES OF XML DOCUMENTS BASED ON THE USED STORAGE

APPROACH.

In Proceedings of the 4th International Conference on Software and Data Technologies, pages 243-248

DOI: 10.5220/0002252802430248

 SciTePress

erator for different document sizes. The database op-

erations tests include read (XQuery and SQL/XML),

insert, delete and update queries. The systems simu-

lates up to 1,000,000 users to test multiuser perfor-

mance. XMach-1 (B

ohme and Rahm, 2001) is an

XQuery based benchmark for native XML databases

and XML-enabled databases focusing on b2b applica-

tions. It simulates a Web application for handling text

ﬁles consisting mostly of several document-oriented

XML ﬁles. However, as the meta data is handled

in a data-oriented fashion, XMach-1 is not restricted

to document-oriented ﬁles only. XMark (Schmidt

et al., 2001) models the database in the back end of

an Internet auction platform. It provides a document

generator for different document types mainly hav-

ing data-oriented characteristics but descriptive text

features include some document oriented features as

well. XMark uses only retrieval queries and does not

include any database updates. As Xmark is executed

as a single user application the execution time of each

query is measured individually but thereby focuses on

the query processing and not the entire database sys-

tem. X007 (Li et al., 2001) is the XML extension of

the 007 benchmark for object-oriented database sys-

tems (Carey et al., 1993). X007 does not model a

speciﬁc application. It provides one rather generic

but a customizable (size, complexity, depth) complex

XML ﬁle that is fairly regular. So, X007 focuses on

data-oriented features. Issued queries are a means of

retrieving data and do not include update queries.

(Lu et al., 2005; Nicola and Rodrigues, 2006)

analyze the performance with regard to the storage

models and different data requirements. (Nicola and

Rodrigues, 2006) focuses on IBM DB2 and com-

pares its pure XML storage models against shredding

or clobbing. Those studies indicate that clobbing is

best performing if XML documents are always con-

sidered as a whole. Shredding should be used for

ﬁxed schema data as it is often the case with data-

oriented documents. Native XML storage seems to

be the best choice for document-oriented documents

or if the schemata evolve. This view is also supported

by (Serna and Gerrikagoitia, 2005).

3 NATURE OF XML DATA

Different usage scenarios for XML data require

different types of XML documents. They are

classiﬁed into document-oriented and data-oriented

(DuCharme, 2004). Some authors also mix these cat-

egories as semistructured XML documents (Bourret,

2005). This differentiation is quite important as each

class poses different requirements to a database and

generally different database systems might be ade-

quate for each class. For example, previous research

has shown that data-oriented XML documents can be

stored efﬁciently in relational database systems while

document-oriented XML documents should be stored

in a native manner (Bourret, 2005). This paper shows

how hybrid database systems ﬁt into this picture.

Data-oriented XML Documents usually focuses

on data for machine processing. Examples include

ﬂight orders or stock quotes. Data-orientation is often

used to transport certain information in a predeﬁned

format. This usually results in a relatively ﬂat ele-

ment hierarchy and regular structure that conforms to

simple schema information. As this schema informa-

tion can be deﬁned precisely for data-oriented XML

data and does not change frequently, it can be easily

mapped into relational database systems.

The focus of Document-oriented XML Documents

is more similar to a document in the usual sense. Doc-

uments often contain much free text. At this, usu-

ally the entire document is of interest. Often they

are made for human consumptions; examples include

books and advertisements (Bourret, 2005). It is usu-

ally less regularly structured. Consequently it is less

precise and frequent schema updates prohibit (or at

least hinder) a mapping into relations. Hence, na-

tive XML database systems are often used for storing

document-oriented XML documents.

Semistructured XML Documents mix the two

other classes (Bourret, 2005). Here part of the doc-

ument is data-oriented while some subparts might

be document-oriented. Access to the data-oriented

segments is still possible for machines while the

document-oriented part might contain additional in-

formation for human readers.

4 XML DATA STORAGE

As mentioned in the previous sections, there exist

various alternatives for storing, retrieving and main-

taining XML documents. A straight forward solution

would be to treat the data as ﬁles and storing them in

the normal ﬁle system. Collections of different XML

documents could be achieved by a folder hierarchy.

Unfortunately this approach is accompanied by sev-

eral drawbacks such as a complicated search for par-

tial documents or a non appropriate access control.

4.1 Clobbing

Another simple solution is to store the XML doc-

uments as simple streams of characters using a re-

lational database system. For this purpose either a

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

244

varchar column could be utilized. Some database

system such as Oracle and DB2 provide an explicit

character large objects (clob) type for storing larger

character streams

. Therefore, the data is outsourced

from the columns itself and just referenced. As clob-

bing interprets XML documents unparsed it faces

similar problems as the ﬁle system storage; even

while insertion and retrieval of full documents are

simple and fast, searching or accessing individual el-

ements often requires parsing of the data.

In order to make those operations more feasible,

database developers utilized another idea from rela-

tional databases: indexing (Nicola and Rodrigues,

2006). Those indexes or side tables, as called in DB2,

capture part of the structure of the XML document

and thereby make lookup and search over those ele-

ments independent from re-parsing the data. Unfor-

tunately, those tables require parsing on document in-

sertion. According to (Nicola and Rodrigues, 2006)

this slows down insertions by a factor 1.3. Indexing

also requires additional storage space. In the extreme

case the entire document is indexed

, but this requires

at least double the storage space of the original data.

Update queries (if supported at all) are usually

very slow as often the entire document has to rein-

serted into the database (cf. (Nicola and Rodrigues,

2006; Chen et al., 2007)). When comparing the dif-

ferent XML characteristics we do not expect much

difference for the non-indexed clobbing as there the

schema or characteristics or not fully considered.

For the indexed clobbing we expect the data-oriented

XML documents to deliver better performance as the

index is built based on schema information which is

more regular for data-oriented XML documents.

Despite those issues clobbing is regularly used for

storing XML Data especially in cases where only en-

tire documents are of relevance. Also with clobbing

the original form is preserved which might be re-

quired by company or legal policies.

4.2 Shredding

Shredding is also referred to as object-relational map-

ping or structured storage of XML documents. It is

the practice of converting XML documents into a re-

lational data format. This approach exploits the ex-

isting object-relational database technology and pro-

vides an interface for XML on top. Most commercial

database systems support shredding. Unfortunately, it

varchar usually supports data up to 3–5 KB while

clob columns in Oracle and DB2 support up to 2 GB of

data (Chen et al., 2007).

This would be equivalent to the document being stored

in using an object relational mapping.

has some drawbacks as well. It requires a complex

table structure to represent one to many relationships.

A similar problem occurs with nested and/or recursive

XML structures (Nicola and van der Linden, 2005, p.

2). Also it usually requires a ﬁxed schema to derive

the XML to relational mapping which is hard to mod-

ify later. Queries usually involve many time consum-

ing joins to collect all requested data. This directly

inﬂuence the performance.

Shredding requires more storage space due to the

representation of sparse XML documents (not all sub-

elements are presented). However, shredding simple

data-oriented XML documents can sometimes even

reduce the storage consumptions as tags are not re-

peated. So, shredding is a suitable storage model for

small, ﬁxed-schema XML documents that result in a

simple relational table mapping. However, it gener-

ally reduces the ﬂexibility of the XML data format

and results in worse performance when compared to

other approaches.

In this setting support for XQuery or XPath

queries requires those queries to be rewritten as SQL

queries on the underlying tables. The result then has

to be re-transformed to XML which is often done, e.g.

in Oracle 11g (Lee, 2007, p. 16 ff.), via SQL/XML

functionality.

4.3 Native XML Storage

As there exists no formal deﬁnition for “Native XML

Databases” (XML DB) the term is often used for mar-

keting reason for very different products. It was prob-

ably coined by the German Software AG for their ﬁrst

release of the Tamino XML Server in 1999 (Sch

oning,

2003). A widely used and cited deﬁnition is the one

by the XMLDB organization. It is based on the (log-

ical) model for an XML document – as opposed to

the data in that document – and stores and retrieves

documents according to that model (Bourret, 2005).

At a minimum, the model must include elements, at-

tributes, PCDATA, and document order. Examples of

such models are the XPath data model, the XML In-

foset, and the models implied by the DOM and the

events in SAX 1.0. An XML DB has an XML docu-

ment as its fundamental unit of (logical) storage, just

as a relational database has a row in a table as its

fundamental unit of (logical) storage. However, it is

not required to have any particular underlying phys-

ical storage model. For example, it can be built on

a relational, hierarchical, or object-oriented database,

or use a proprietary storage format such as indexed,

compressed ﬁles. A native XML database can store

XML documents without any limitation to the under-

lying structure and allows performant access to this

ON ANALYZING THE DATABASE PERFORMANCE FOR DIFFERENT CLASSES OF XML DOCUMENTS BASED

ON THE USED STORAGE APPROACH

245

XML structure.

Many of those systems use a hierarchical data

model so for example Tamino (Sch

oning, 2003) or

DB2 (Nicola and van der Linden, 2005). There-

fore, the XML documents are parsed and the data is

stored according to the underlying structural model.

This natural representation makes it suitable for both

document- and data-oriented XML documents and

provides ﬂexibility regarding the schema information

as this is most often derived from the data itself.

As the derivation of such model requires a parsed

data representation insertion times are often slower as

compared to clobbing without indexing (Nicola and

Rodrigues, 2006, p. 4), but this approach results in

superior query performance.

4.4 Hybrid XML Storage

Hybrid XML storage does not deﬁne another ap-

proach for storing XML documents. It is done com-

parably to native XML data storage but actually in-

troduces two separate storage engines into a database

system: One for relational data and another for native

XML storage. This results in a combination, which

is invisible to the user. Often they support different

models specifying how the XML document should

be stored for different needs. Oracle, for example,

supports three different storage models for different

needs. The “Hybrid Idea” was ﬁrst introduced by Or-

acle with their 10g release

and IBM with DB2 9. As

of today hybrid systems are also available from other

vendors (e.g., Microsoft SQL Server 2007).

The hybrid database architecture is illustrated in

Figure 1. It usually provides distinct interfaces for

XML and for relational data. These are then com-

piled into one common query format which can then

be executed against both the relational and XML stor-

age. There is no translation of XQuery into SQL as

in the case of shredding. Both data formats can be

be queried via both interfaces (e.g., XQuery on rela-

tional data and SQL on XML documents). It is also

possible to join XML documents and relational data

(Nicola and van der Linden, 2005).

The underlying storage is often based on hierar-

chical models to match the XML structure. Here of-

ten the same set of features is supported as in native

XML databases. The XML data type can be used as

a normal data type in table column deﬁnitions but the

XML documents are stored separately in a postparsed

representation. Consequently, a parsing at insertion

time is required. However, there is no need to parse

the data at the point of query execution.

Even though we would not consider the storage concept

of 10g to be native as it utilizes shredding.

Client

XQuery

Interface

SQL

Interface

Data

Management

Server

XML

Relational

Data

Storage

XQuery

XML

SQL

Relations

Figure 1: General Hybrid Database Architecture.

5 BECHMARKING

To evaluate the performance of the different XML

data storage approaches we devised a new bench-

mark called HYBE. It explicitly includes hybrid

systems in the performance analysis and supports

several query alternatives such as XPath, XQuery,

SQL/XML, XUpdate, W3C XQuery Update Facility.

HYBE consists of two parts: 1) a feature list con-

sidering the general functionality of the storage ap-

proach or database system and 2) a performance anal-

ysis with a ﬁxed set of queries measuring the perfor-

mance. The considered features were: maximal com-

plexity of XML documents, support for schema infor-

mation, support for different query languages, support

for updating queries, and support for combining XML

and relational data.

The query sets used for performance analyses

comprised: inserting 100 documents into the database

(sequentially and batch), retrieving a full document

speciﬁed by an ID, retrieving a speciﬁed value via

XPath and XQuery expressions (indexed and non-

indexed) and via SQL/XML, updating a single spec-

iﬁed value using XUpdate and W3C Update Facility,

and queries that combine relational and XML data.

For a more detailed description of the features and

query sets of HYBE please consider (Schad, 2008).

Due to the differences in database performance

and requirements of different XML document cat-

egories we considered the two major classes in

our benchmark described in Section 3: data- and

document-oriented XML documents.

The data-oriented documents were generated by

using Toxgene (Barbosa et al., 2002). The generated

documents are highly regular and the corresponding

schema information can simply be derived automat-

ically as the element hierarchy is ﬂat and does not

change between different documents. Each document

has an average size of 25 to 50 Byte and a maximum

element depth of four.

For document-orientation we used the XML gen-

erator of XMark. Still we were not looking for one

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

246

very large ﬁle containing all the data but rather split

it into 500 smaller documents. Each resulting doc-

ument had a size between 150 and 700 Byte and an

element depth of less than ten elements. Furthermore,

some schema variation (when derived automatically)

is present as not all elements are mandatory, but we

kept the schema variation intentionally low to keep

the results interpretable. However, we assume a per-

formance advantage with more evolving schemata for

hybrid and native systems.

6 PRELIMINARY RESULTS

As the target of this research was to compare the per-

formance of storage alternatives rather than the per-

formance of systems, we decided to use only a sin-

gle DBMS. We choose IBM’s DB2 9.5 as it supports

clobbing, shredding, and the hybrid storage.

pure XML storage (data oriented)

pure XML storage (document oriented)

clobbing (data oriented)

clobbing (document oriented)

hybrid storage (data oriented)

100

1000

10000

100000

1000000

Insert

(100

documents)

Retrieve

(100

documents)

QueryXpath

(100

documents)

Update

element

(100

document)

Figure 2: Average time per operation per storage alterna-

tive.

We ran each query of our query set (compare Sec-

tion 5) 100 times and in order to check for and in-

corporate speedup effects over time (i.e., caused by

DBMS optimization and buffering as it is done in real

world applications) we repeated each run. These re-

sulting average times are shown in Figure 2.

To compare the different storage approaches two

different views were used: ﬁrst the different queries

were compared for each storage option visualizing the

different characteristics proposed above. As a next

step we compared the different storage options for

each query visualizing the performance characteris-

tic for each query type. This is especially interesting

when it known before hand that the main operation

for XML data in a certain project will be inserting.

The analysis of different queries per storage op-

tion shows that the hybrid XML storage has per-

formance characteristics comparable to native XML

storage (c.f., (Schad, 2008)). It performs better

for data-oriented documents (less complex parsing)

but still reasonably well (especially retrieval) for

document-oriented documents. The execution times

differed around a factor of 3 to 5 when the entire doc-

ument was considered. For node speciﬁc operations

the factor was roughly 1.2. Remember, the document-

oriented data was a factor of 40 larger, which is a com-

mon ratio in real world applications as well.

The XML Extender Shredding storage could not

handle our document-oriented documents as the map-

ping probably resulted in too many table columns.

Therefore, we could not compare execution times of

data and document-oriented data. Still insertion times

were relatively long as they require a parsed repre-

sentation and creation of a (possibly) complex table

structure and full document retrieval consumes the

most time when compared to the other alternatives.

This results from number of join operations which are

required. Still for a number of small data-oriented

documents it performed reasonably well.

For document insertion we could conﬁrm that

non-indexed clobbing is fasted if concerned with full

documents insertion and retrieval. It is about 1.5 to

2 times faster than the hybrid XML storage and al-

most 3.5 times faster than the XML Extender. Still

considering the effort of parsing XML data, this gap

is still considerably well and is the case where we re-

quire parsing in the clobbing setting as well it actually

performs worse than the hybrid XML solution. The

execution times for full document retrieval also left a

similar picture. However, the differences were a lit-

tle lower (between 1.3 and 1.5) when compared to the

hybrid storage but signiﬁcantly higher for schredding

(around 2 to 3) as here a full document retrieval re-

quires a number of join operations. However, without

indexes XML speciﬁc queries such as XQuery con-

sume a lot of time. Due to the XML parsing at query

time XPath queries to a certain node was a factor of

more than 20 times slower when compared to the hy-

brid and shredded storage. Here the hybrid storage

performs best followed by shredding (still around two

times slower). Both approaches can naturally access

individual nodes and therefore the difference between

document and data-oriented documents is here quite

low. Updating of individual nodes shows the same

picture as a parsed representation is required, too.

7 CONCLUSIONS

In this paper we presented a performance comparison

of the currently existing storage alternatives for XML

data in databases. We ﬁrstly described the characteris-

tics of the most important XML document classes that

build the basis of HYBE. To the best of our knowl-

edge, HYBE is the ﬁrst benchmark that evaluates the

hybrid storage approach. It was then used for rela-

ON ANALYZING THE DATABASE PERFORMANCE FOR DIFFERENT CLASSES OF XML DOCUMENTS BASED

ON THE USED STORAGE APPROACH

247

tively measuring the differences between the storage

alternatives regarding different query types.

Our results indicate that it is important to con-

sider the XML storage approach individually for each

project. A main distinction is whether the database

has to “understand” the data or can just consider them

as one large ﬁle. Other important points to consider

are, e.g., read vs. write or the update frequency.

Clobbing and shredding are limited to certain

characteristics and more or less suitable for data-

oriented documents. Hybrid systems are a good

choice for document-oriented documents. Also hy-

brid systems offer often more ﬂexibility as they sup-

port a wider range of features especially in cases

where XML and relational data must be combined.

As the focus of this study was the comparison of

different approaches it is interesting to compare dif-

ferent implementations of those approaches by differ-

ent products such as Oracle’s XML DB, IBM’s DB2

and Microsoft’s SQL Server. Furthermore, we plan to

extend the query set and extend the XML data used

(e.g., different document sizes). Another issue is the

evolution of XML schemata when performing update

queries.

Finally, we plan to observe index related perfor-

mances including speed-up and time for index cre-

ation. Here we assume that those results will vary

signiﬁcantly across different storage options.

REFERENCES

Barbosa, D., Mendelzon, A. O., Keenleyside, J., and Lyons,

K. (2002). ToXgene: An extensible template-based

data generator for XML. In WebDB 2002 Proc., pages

49–54. ACM.

ohme, T. and Rahm, E. (2001). XMach-1: A Benchmark

for XML Data Management. In BTW 2001 Proc.,

pages 264–273. Springer.

Bourret, R. (2005). XML and Databases. online article.

http://www.rpbourret.com/xml/XMLAndDa

tabases.htm.

Carey, M. J., DeWitt, D. J., and Naughton, J. F. (1993). The

007 Benchmark. ACM SIGMOD Record, 22(2):12–

21.

Chen, W.-J., Sammartino, A., Goutev, D., Hendricks, F.,

Komi, I., Wei, M.-P., and Ahuja, R. (2007). DB2 9

pureXML Guide. IBM redbooks.

Cover, R. (2002). FIXML - A Markup Language for

the FIX Application Message Layer. Cover pages.

xml.coverpages.org/ﬁxml.html.

DuCharme, B. (2004). Documents vs. Data, Schemas vs.

Schemas. XML 2004, pages 1554–4648.

Lee, G. (2007). Oracle Database 11g XML DB Technical

Overview. Oracle Corportion.

Li, Y. G., Bressan, S., Dobbie, G., Lacroix, Z., Lee, M. L.,

Nambiar, U., and Wadhwa, B. (2001). XOO7: apply-

ing OO7 benchmark to XML query processing tool.

In CIKM 2001, pages 167–174.

Lu, H., Yu, J., Wang, G., Zheng, S., Jiang, H., Yu, G., and

Zhou, A. (2005). What makes the differences: bench-

marking XML database implementations. TOIT,

5(1):154–194.

Nicola, M., Kogan, I., and Schiefer, B. (2007). An XML

transaction processing benchmark. In SIGMOD 2007,

pages 937–948. ACM.

Nicola, M. and Rodrigues, V. (2006). A performance com-

parison of DB2 9 pureXML and CLOB or shredded

XML storage.

Nicola, M. and van der Linden, B. (2005). Native XML

support in DB2 universal database. In VLDB 2005,

pages 1164–1174.

Runapongsa, K., Patel, J. M., Jagadish, H. V., Chen, Y.,

and Al-Khalifa, S. (2006). The Michigan benchmark:

towards XML query performance diagnostics. Infor-

mation Systems, 31(2):73–97.

Schad, J. (2008). XML-Document Management in

Databases — A Performance Evaluation for Hybrid

Database Systems. Bachelor thesis, IU in Germany,

School of IT, Bruchsal, Germany.

Schmidt, A. R., Waas, F., Kersten, M. L., Florescu, D.,

Manolescu, I., Carey, M. J., and Busse, R. (2001).

The XML Benchmark Project. Technical Report INS-

R0103, CWI, Amsterdam.

Sch

oning, H. (2003). Tamino-Software AG’s Native XML

Server. In XML Data Management, chapter 2.

Addison-Wesley.

Serna, A. and Gerrikagoitia, J. K. (2005). David & Go-

liath: A Comparison Of XML-Enabled and native

XML Data Management Techniques. XML Journal.

xml.sys-con.com/node/104980.

Yao, B. B.,

Ozsu, M. T., and Khandelwal, N. (2004).

XBench Benchmark and Performance Testing of

XML DBMSs. In ICDE 2004, pages 621–632.

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

248