These systems can roughly be categorized into the fol-
lowing three categories:
• The flat streams approach handles the documents
as byte streams (Peng & Chawathe, 2003) (Barton
et al., 2003). Obviously, accessing the structure
of the documents requires parsing and consumes
a lot of time, but this option might still be viable
in a setting where the documents to be stored and
queried are small.
• In the metamodeling approach, the XML doc-
uments are first represented as trees which are
then stored into a database. With suitable in-
dexes this approach provides fast access to sub-
trees, and thus the metamodeling approach is by
far the most popular in the field of XML man-
agement research. Unlike in the flat streams ap-
proach, however, rebuilding large parts of origi-
nal documents from a large number of individual
nodes can be rather expensive. The partitioning
method discussed in this paper also falls into this
category.
• The mixed approach aims at combining the previ-
ous two approaches. In some systems, the data is
stored in two redundant repositories, one flat and
one metamodeled, which allows the stored doc-
uments to be queried and the result documents
to be built efficiently, but obviously creates some
storage overhead. This problem could be tackled
by compression to which XML data is often very
amenable. There are also examples of a hybrid
approach in which coarser structures of the XML
documents are modeled as trees and finer struc-
tures as flat streams (Fiebig et al., 2003).
A lot of work has been carried out to accelerate
structural joins (Al-Khalifa et al, 2002), i.e., opera-
tions which determine all occurrences of parent/child,
ancestor/descendant, preceding/following etc. rela-
tionships between node sets. Some approaches are
based on indexes built on XML trees, whereas oth-
ers aim at designing more efficient join algorithms.
The former category includes the so called proxy in-
dex (Luoma, 2005b) (Luoma, 2006) which effectively
partitions the nodes into overlapping partitions so that
the ancestors of any given node are contained within
the same partition. Thus, when the ancestors of a
given node are retrieved it is sufficient to check the
partitions to which the node is assigned. Conversely,
descendants can also be found efficiently since there
is no need to check the partitions to which the node
is not assigned. However, this approach is suitable
for accelerating only ancestor/descendant operations.
The method discussed in this paper, on the contrary,
can also accelerate preceding/following operations.
The other group of methods include, for example,
the staircase join (Grust & van Keulen, 2003). The
idea of the staircase join is to prune the set of con-
text nodes, i.e., the initial nodes from which the axis
is performed. For instance, for any two context nodes
n and m such that m is a descendant of n, all descen-
dants m are also descendants of n, and thus m can be
pruned before evaluating the descendant axis. An-
other method based on preprocessing the nodes was
proposed in (Tang et al., 2005). In this approach
the nodes were partitioned somewhat similarly to the
method discussed in this paper. However, both of
these methods require a considerable amount of pro-
gramming effort since they work by preprocessing the
data rather than by building indexes. In the context of
relational databases, for example, this would mean ei-
ther reprogramming the DBMS internals or program-
ming a collection of external classes to implement
the join algorithms. Our method, on the contrary, re-
quires very little programming effort even when im-
plemented using a relational database, which we re-
gard as the main advantage of our approach.
3 XPATH BASICS
As mentioned earlier, XPath (W3C, 2006b) is based
on a tree representation of a well-formed XML docu-
ment, i.e., a document that conforms to the syntactic
rules of XML. A simple example of an XML tree cor-
responding to the XML document <b><c d="y"/><c
d="y"><e>kl </e></c><c><e>ez</e></c></b> is
presented in Figure 1. The nodes are identified using
their preorder and postorder numbers which encode a
lot of structural information
1
.
The tree traversals in XPath are based on 12 axes
which are presented in Table 1. In simple terms, an
XPath query can be thought of as a series of location
steps of the form /axis::nodetest[predicate]
which start from a context node - initially the root of
the tree - and select a set of related nodes specified by
the axis. A node test can be used to restrict the name
or the type of the selected nodes. An additional predi-
cate can be used to filter the resulting node set further.
The location step n/child::c[child::*], for ex-
ample, selects all children of the context node n which
are named ”c” and have one or more child nodes. As
a shorthand for axes child, descendant-or-self,
and attribute, one can use /, //, and /@, respec-
tively.
1
In preorder traversal, a node is visited before its sub-
trees are recursively traversed from left to right and in pos-
torder traversal, a node is visited after its subtrees have been
recursively traversed from left to right.
ACCELERATING XPATH AXES THROUGH STRUCTURAL PARTITIONING
97