expression can be represented as a tree structure as
shown in Figure 1.
Figure 1: A query tree.
In this tree structure, a node v is labeled with an
element name or a string value, denoted as label(v).
In addition, there are two kinds of edges: child edges
(c-edges) for parent-child relationships, and
descendant edges (d-edges) for ancestor-descendant
relationships. A c-edge from node v to node u is
denoted by v → u in the text, and represented by a
single arc; u is called a c-child of v. A d-edge is
denoted v ⇒ u in the text, and represented by a
double arc; u is called a d-child of v. In addition, a
node in Q can be a wildcard ‘*’ that matches any
element in T. Such a query is often called a twig
pattern. In the following discussion, we use
startElement and node interchangeably since each
startElement event in S exactly corresponds to a
node in T.
- XML query evaluation and tree matching
In any DAG (directed acyclic graph), a node u is
said to be a descendant of a node v if there exists a
path (sequence of edges) from v to u. In the case of a
twig pattern, this path could consist of any sequence
of c-edges and/or d-edges. Based on these concepts,
the tree embedding can be defined as follows.
Definition 1. An embedding of a twig pattern Q into
an XML document T is a mapping f:
Q → T, from
the nodes of Q to the nodes of T, which satisfies the
following conditions:
(i) Preserve node label: For each u ∈ Q, label(u) =
label(f(u)).
(ii) Preserve c/d-child relationships: If u → v in Q,
then f(v) is a child of f(u) in T; if u ⇒ v in Q,
then f(v) is a descendant of f(u) in T.
If there exists a mapping from Q into T, we say, Q
can be imbedded into T, or say, T contains Q. The
purpose of XML query evaluation is to find all the
subtrees of T, which contain Q.
Notice that an embedding could map several
nodes of the query (of the same label) to the same
node of the database. It also allows a tree mapped to
a path. This definition is quite different from the tree
matching defined in (Hoffmann and O’Donnell,
1982).
Recently, a great many strategies have been
proposed to evaluate XPath queries in an XML
streaming environment (Avila et al., 2002; Chen et
al., 2006; Ives et al.
, 2002; Koch et al., 2004;
Ludascher et al., 2002; Peng and Chawathe, 2003;
Peng et al., 2003). The methods discussed in (Avila
et al., 2002; Ives et al., 2002) are based on finite
state automata (FSA), but only able to handle single
path queries, i.e., a query containing branching
cannot be processed, as observed in (Peng and
Chawathe, 2003). The method proposed in (Peng
and Chawathe, 2003) is a general strategy, but
requires exponential time (O(|T| × 2
|
Q|)) in the worst
case, as analyzed in (Peng et al., 2003). The methods
discussed in (Koch et al., 2004; Ludascher et al.,
2002) do not support d-edges. If we extend them to
general cases, exponential time is required. Up to
now, the research culminates in TwigM presented in
(Chen et al., 2006). It is not only a general-case
algorithm, but also works in polynomial time. In the
worst case, its time complexity is bounded by
O(T
h
Q
d
|Q||T| + |Q|
2
|T|), where T
h
is the height of T
and Q
d
is the largest outdegree of a node in Q. By
this method, each node q of Q is associated with a
boolean array of length Q
d
and a stack of size T
h
, in
which each element is a node v from T such that its
relationship with the nodes in the stack associated
with q’s parent q’ satisfies the relationship between
q and q’. Therefore, each time to figure out a stack
and push a node into it, O(T
h
Q
d
|Q|) time is required,
leading to a time complexity of O(T
h
Q
d
|Q||T| +
|Q|
2
|T|). See Theorem 4.4 in (Chen et al., 2006).
The remainder of the paper is organized as
follows. In Section 2, we discuss an algorithm for
simple cases that a twig pattern contains only d-
edges, as well as wildcards and branches. In Section
3, we extend this algorithm to general cases. Finally,
a short conclusion is set forth in Section 4.
2 ALGORITHM FOR SIMPLE
CASES
In this section, we describe an algorithm for simple
cases that a twig pattern contains only d-edges,
wildcards and branches. First, we give a basic
algorithm in 2.1. Then, in 2.2, we prove the
correctness of the algorithm and analyze its
computational complexities.
rt of Programming
title
nuth
boo
n
author
ln
onald
AN EFFICIENT STREAMING ALGORITHM FOR EVALUATING XPATH QUERIES
191