STACK ENCODING REVISITED
Yangjun Chen
Dept. Applied Computer Science, University of Winnipeg, Canada
Keywords: XML databases, Trees, Paths, XML pattern matching, Twig joins.
Abstract: The twig join, which is used to find all occurrences of a twig pattern in an XML database, is a core
operation for XML query processing. A great many strategies for handling this problem have been proposed
and can be roughly classified into two groups. The first group decomposes a twig pattern (a small tree) into
a set of binary relationships between pairs of nodes, such as parent-child and ancestor-descendant relations;
and transforms a tree matching problem into a series of simple relation look-ups. The second group
decomposes a twig pattern into a set of paths. Among all this kind of methods, the approach based on the
so-called stack encoding [N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Hoins: Optimal XML
Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002,
pp. 310-321] is very interesting, which can represent in linear space a potentially exponential (in the number
of query nodes) number of matching paths. However, the available processes for generating such
compressed paths suffer some redundancy and can be significantly improved. In this paper, we analyze this
method and show that the time complexities of path generation in its two main procedures: TwigStack and
TwigStackXB can be reduced from O(m
2
n) to O(mn), where m and n are the sizes of the query tree and
document tree, respectively. Experiments have been done to compare TwigStackXB and ours, which shows
that using our method much less time is needed to generate matching paths.
1 INTRODUCTION
In XML (World Wide Web Consortium, 1991,
2001), data is represented as a tree; associated with
each node of the tree is an element type from a finite
alphabet . The children of a node are ordered from
left to right, and represent the content (i.e., list of
subelements) of that element.
To abstract from existing query languages for
XML, e.g. XPath (World Wide Web Consortium,
1991), XQuery (World Wide Web Consortium,
2001), XML-QL (Deutch and et al, 1999), and Quilt
(Chamberlin and et al, 1999; Chamberlin and et al,
2000), we express queries as tree patterns where
nodes are types from ∑ ∪ {*} (* is a wildcard,
matching any node type) and string values, and
edges are parent-child or ancestor-descendant
relationships. As an example, consider the query tree
shown in Figure 1, which asks for any node of type
b (node 2) that is a child of some node of type a
(node 1). In addition, the b type (node 2) is the
parent of some c type (node 4) and an ancestor of
some d type (node 5). Type b (node 3) can also be
the parent of some e type (node 7). The query
corresponds to the following XPath expression:
a[b[c and //d]]/b[c and e//d].
In Figure 1, there are two kinds of edges: child
edges (c-edges) for parent-child relationships, and
descendant edges (d-edges) for ancestor-descendant
relationships. A c-edge from node v to node u is
denoted by v u in the text, and represented by a
single arc; u is called a c-child of v. A d-edge is
denoted v u in the text, and represented by a
double arc; u is called a d-child of v.
Finding all occurrences of a twig pattern in a
database has been considered as a core operation in
querying tree structured XML data, both in
relational implementation of XML databases, and in
native XML databases.
Recently this problem has received much
attention in database research community and
different strategies have been proposed. Most of
4 c d 5
2 b
d 8
1 a
6 c
b 3
e 7
Figure 1: A query tree.
5
Chen Y. (2007).
STACK ENCODING REVISITED.
In Proceedings of the Third International Conference on Web Information Systems and Technologies - Internet Technology, pages 5-14
DOI: 10.5220/0001260800050014
Copyright
c
SciTePress
them, for example, all the strategies proposed in (Al-
Khalifa and et al, 2002; Florescu, Kossman, 1999;
McHugh, Widom, 1999; Shanmugasundaram and et
al, 1999; Tukwila System, 2000; Niagara System,
2000; Zhang and et al, 2001), typically decompose a
twig pattern into a set of binary relationships
between pairs of nodes, such as parent-child and
ancestor-descendant relations; and the sizes of
intermediate relations tend to be very large, even
when the input and final result sizes are much more
manageable. Another kind of strategies bases on
path decomposition, such as those discussed in
(Bruno and et al, 2002; Wang and et al, 2003; Wang
and Meng, 2005). In (Wang and et al, 2003; Wang
and Meng, 2005), all the possible paths of an XML
document are explicitly stored and indexed using
B+-trees as well as trie structures. In (Bruno and et
al, 2002), a document is also decomposed, but
dynamically depending on the given queries. This
method is of special interest since the decomposed
paths are not simply stored but compressed by using
the so-called stack encoding. It reduces the number
of intermediate matching paths dramatically.
Although the idea of compressing intermediate
results is very attractive, the process suggested in
(Bruno and et al, 2002) for producing compact paths
is not so efficient and can be substantially improved.
In this paper, we analyze the method described in
(Bruno and et al, 2002) and show that the matching
paths can be produced in a more efficient way.
Particularly, two new algorithms are presented,
which improve the two main procedures of this
method: TwigStack and TwigStackXB, by one order
of magnitude. In (Bruno and et al, 2002), TwigStack
is utilized to generate matching paths for queries
containing only d-edges while TwigStackXB is for
queries containing both c- and d-edges.
The remainder of the paper is organized as
follows. In Section 2, we review the concept of stack
encoding and the algorithm TwigStack presented in
(Bruno and et al, 2002), which is necessary for the
subsequent discussion. In Section 3, we propose a
new algorithm RefinedTwigStack to do the same task
as TwigStack, but using much less time. In Section
4, we extend RefinedTwigStack to general cases.
Finally, a short conclusion is set forth in Section 5.
2 ON THE TWIGSTACK
ALGORITHM
In this section, we review the main procedure
TwigStack given in (Bruno and et al, 2002), which is
used to evaluate a special kind of queries that
contain only d-edges. However, by using a variant
structure of B-tree, called XB-tree, TwigStack can be
easily extended to general cases with both c-edges
and d-edges involved.
In the following, we first review what is a stack
encoding in 2.1. Then, we describe the TwigStack
algorithm (Bruno and et al, 2002) and analyze its
time complexity in 2.2. In (Bruno and et al, 2002), a
theoretical time analysis is not delivered.
2.1 On the Stack Encoding
Let T be a document tree. Let q = q
1
q
2
... q
m-1
q
m
be a query path containing only d-edges. We
associate each q
i
(i = 1, ..., m) with a stack, denoted
S(q
i
), in which each entry is a pair (v, p) with v being
a node in T and p is a pointer to an entry in
S(parent(q
i
)), where parent(q
i
) represents the parent
of q
i
.
At every point during the computation, all S(q
i
)’s
have the following properties
(i) The entries in S(q
i
) (from bottom to top) are
guaranteed to lie on a root-to-leaf path in T.
(ii) The set of stacks contains a compact encoding of
matching paths.
As an example, consider T and q shown in
Figure 2(a).
Obviously, T has four subpaths that match q, as
shown in Figure 2(b). By using the stack encoding,
they can be stored in a way as shown in Figure 2(c),
using much less space.
First, we notice that the matching path v
3
v
4
v
5
v
6
is encoded since v
6
points to v
5
, v
5
to v
4
,
and v
4
to v
3
. Also, the matching path v
1
v
4
v
5
v
6
is encoded since v
1
is below v
3
on the stack
S(q
1
). For the same reason, v
1
v
2
v
5
v
6
is a
matching path since v
2
is below v
4
on the stack S(q
2
)
and has a pointer to v
1
. Finally, since v
3
is below v
5
on the stack S(q
3
) and has a pointer to v
2
, v
1
v
2
v
3
v
6
is also an answer. However, the nodes v
3
,
v
2
, v
5
, v
6
do not make up a matching path since v
3
is
above v
1
on S(q
1
), to which v
2
points.
2.2 Description of TwigStack
Now we describe the algorithm TwigStack, which
stores the intermediate results in a way of stack
encoding, and analyze its time complexity. For this
purpose, we first show a tree encoding method
(Zhang and et al, 2001) and define some notations
that are used in the description of TwigStack.
Let T be a document tree. We associate each
node v in T with a quadruple (DocId, LeftPos,
RightPos, LevelNum), denoted as α(v), where DocId
WEBIST 2007 - International Conference on Web Information Systems and Technologies
6
is the document identifier; LeftPos and RightPos are
generated by counting word numbers from the
beginning of the document until the start and end of
the element, respectively; and LevelNum is the
nesting depth of the element in the document. (See
Figure 3 for illustration.) By using such a data
structure, the structural relationship between the
nodes in an XML database can be simply
determined (Zhang and et al, 2001):
(i) ancestor-descendant: a node v
1
associated with
(d
1
, l
1
, r
1
, ln
1
) is an ancestor of another node v
2
with (d
2
, l
2
, r
2
, ln
2
) iff d
1
= d
2
, l
1
< l
2
, and r
1
> r
2
.
(ii) parent-child: a node v
1
associated with (d
1
, l
1
,
r
1
, ln
1
) is the parent of another node v
2
with (d
2
,
l
2
, r
2
, ln
2
) iff d
1
= d
2
, l
1
< l
2
, r
1
> r
2
, and ln
1
= ln
2
+ 1.
(iii) from left to right: a node v
1
associated with (d
1
,
l
1
, r
1
, ln
1
) is to the left of another node v
2
with
(d
2
, l
2
, r
2
, ln
2
) iff d
1
= d
2
, r
1
< l
2
.
Let q be a query tree containing only d-edges.
We associate each q
i
in q with a data stream L(q
i
),
which contains the quadruples of the database nodes
that match q
i
as illustrated in Figure 4. Such a list
can be established by using an efficient access
mechanism, such as an index structure. In addition,
the quadruples in a list are sorted by their (DocId,
LeftPos) values.
Finally, we notice that in both S(q
i
) and L(q
i
), a
node v is referenced by α(v). But we will refer to v
and α(v) interchangeably in the subsequent
discussion if no confusion will be caused.
In terms of the data structure described above,
we can now specify some operations that are used in
TwigStack.
- next(L(q
i
)): return the next element in L(q
i
).
Initially, the pointer is to the position before the
first element in L(q
i
).
- advance(L(q
i
)): move to the next element in L(q
i
);
- LeftPos(α): return the LeftPost value of α;
- RightPos(α): return the RightPost value of α.
Algorithm TwigStack operates in two phases. In
the first phase, all paths matching individual query
root-to-leaf paths are produced (lines 1 - 14). In the
second phase, these matching paths are merge-joined
to create the answers to the query twig pattern (line
15).
In order to generate all the matching paths, the
query tree q is accessed repeatedly and each time a
node q
i
, which has in its L(
j
i
q ) a node v with the
least LeftPos value among all the nodes in all
L(q
j
)’s, is chosen, satisfying the following
conditions:
(i) Let
1
i
q , ...,
k
i
q
be the children of q
i
. Let v be the
next node in L(q
i
) to be handled. Then, for each
j
i
q (1 j k), v has a descendant u such that
α(u) is in L(
j
i
q ).
(ii) Each
j
i
q recursively satisfies the first property.
Such a node is selected by executing a function,
called getNext(q), which is repeatedly invoked. In
this way, each solution to each individual query
root-to-leaf path is guaranteed to be merge-joinable
with at least one solution to each of other root-to-
leaf paths.
Once such a node, denoted q
act
, is found, the
quadruple α = next(L(q
act
)) (which represents a node
v in T) will be pushed onto S(q
act
) as follows:
1. If q
act
is the root of q, remove any α(u) in S(q
act
)
with RightPos(α(u)) < LeftPos(α(q
act
)). Then, put
next(L(q
act
)) on the top of S(q
act
).
2. If q
act
is not the root of q, remove any α(u) in
S(parent(q
act
)) with RightPos(α(u)) <
LeftPos(next(L(q
act
))). If S(parent(q
act
)) remains
unempty, put next(L(q
act
)) on the top of S(q
act
)
after all the v in S(q
act
) with RightPos(α(v)) <
LeftPos(α(q
act
)) are removed.
Figure 3: Illustration for tree encoding.
A v
1
B v
2
v
6
B
C v
3
v
4
B
v
5
C
(1, 1, 9, 1)
(1, 2, 7, 2)
(1, 3, 3, 3)
(1, 4, 6, 3)
(1, 5, 5, 4)
(1, 8, 8, 2)
Figure 4: Illustration for for L(q
i
)’s.
(v
2
, v
4
, v
6
)
(v
3
, v
5
)
T:
A v
1
B v
2
A v
3
B v
4
A v
5
C v
6
q:
A q
1
B q
2
A q
3
C q
4
A B A C
v
3
v
4
v
5
v
6
v
1
v
4
v
5
v
6
v
1
v
2
v
5
v
6
v
1
v
2
v
3
v
6
v
5
v
3
v
1
v
6
v
4
v
2
v
5
v
3
v
1
(a) (b) (c)
Figure 2: Illustration for stack encoding
S(q
4
) S(q
3
) S(q
2
)
S(q
1
)
A q
1
B q
2
q
5
B
C q
3
q
4
C
(v
1
)
(v
3
, v
5
)
(v
2
, v
4
, v
6
)
STACK ENCODING REVISITED
7
If q
act
is a leaf node, store the corresponding
matching paths.
For the purpose of self-contentment, we show
here the algorithm TwigStack in a format slightly
different from (Bruno and et al, 2002). Then, we
conduct a sample trace and analyze its time
complexity. (In the original paper (Bruno and et al,
2002), the time complexity analysis was not
available.)
Algorithm TwigStack(q)
(*phase 1*)
1. while ¬
end(q) do
2. {
q
act
getNext(q);
3. if (
q
act
is not the root) then
4.
cleanStack(S(parent(q
act
)), LeftPos(next(L(q
act
)));
5. if (
q
act
is the root of q) ∨ ¬ empty(S(parent(q
act
)))
6. then
7. { cleanStack(S(q
act
), LeftPos(next(L(q
act
)));
8.
moveStreamToStack(L(q
act
), S(q
act
),
pointer to top(S(q
act
)));
9. if (
q
act
is a leaf node) then
10. {output all the matching paths (stored in
stacks) in the compact form;
11. pop(
S(q
act
));}
12. }
13. else
advance(L(q
act
));
14.}
(*phase 2*)
15.
mergeAllPathSolutions();
Function getNext(q)
1. if (
q is a leaf node) then return q;
2. let
q
1
, ..., q
k
be the children of q;
3. for
i = 1 to k do
4. {
n
i
getNext(q
i
);
5. if (
n
i
q
i
) then return n
i
;}
6.
n
min
min{LeftPos(n
1
), ..., LeftPos(n
k
)};
7.
n
max
max{LeftPos(n
1
), ..., LeftPos(n
k
)};
8.
while (RightPost(next(L(q)) < LeftPost(next(L(n
max
)) do
9.
advance(L(q));
10. if (
LeftPost(next(L(q)) < LeftPost(next(L(n
min
))
11. then return
q;
12. else return
n
min
;
Function
end(q)
1. if for any leaf node
q
leaf
, L(q
leaf
) is empty
2. then return
true
3
. else return false;
Procedure
cleanStack(S, actL)
1. while (¬ empty(
S) (RightPos(top(S) < actL) do
2. pop(
S);
Procedure moveStreamToStack(L, S, p)
1. push(
S, next(L), p);
2.
advance(L);
By each iteration of the main while-loop of
TwigStack(q), getNext(q) is called to find a node q
act
to handle (see line 2). Then, by executing lines 3 - 8,
next(L(q
act
)) is pushed onto S(q
act
) in the way as
described by (1) and (2) above. If q
act
is a leaf node,
all the matching paths (in their stack encoding) will
be stored in the compact form (see line 9 - 11). In
addition, no matter whether next(L(q
act
)) can be put
onto S(q
act
), the pointer for L(q
act
) will be shifted to
the next element (see line 2 in moveStreamToStack
and line 13 TwigStack).
getNext(q) is a recursive algorithm, by which the
whole q is searched top-down. In this way, any node
returned has always the least preorder number with
the conditions (i) and (ii) above satisfied. This can
be seen from lines 8 - 9, as well as lines 10 - 12.
Finally, we notice that the algorithm terminates
when all L(q
leaf
)’s become empty (see Function
end(q)).
The following example helps for illustration. It is
a detailed sample trace, which not only facilitates the
analysis of the algorithm’s time complexity, but also
reveals a possibility of improvements.
Example 1. Consider the document tree T shown in
Figure 3 and the query tree q shown in Figure 4.
Corresponding to the three leaf nodes in q, we have
three paths: P
1
: q
3
q
2
q
1
; P
2
: q
4
q
2
q
1
; and
P
3
: q
5
q
1
. When we apply TwigStack to T and q,
the stacks associated with the nodes in q will be
changed as follows.
Step 1 - 3: By the first iteration of the main while-
loop, q
1
is selected and then v
1
is pushed onto
S(q
1
). By the second iteration, q
2
will be
chosen since after the first iteration L(q
1
)
becomes empty and so we cannot find a v in
L(q
1
), which is an ancestor of next(L(q
2
)) =
v
2
. (See line 12 in getNext.) Therefore, v
2
goes into S(q
2
). For the same reason, q
3
will
be chosen by the third iteration and
next(L(q
3
)) = v
3
goes into S(q
3
). Since q
3
is a
leaf node, we get the first matching path (for
P
1
): v
3
v
2
v
1
. See Figure 5(a) for
illustration.
Step 4: By the fourth iteration, q
4
is selected and
next(L(q
4
)) = v
3
is pushed onto S(q
4
)) as
shown in Figure 5(b). We get the second
matching path (for P
2
): v
3
v
2
v
1
.
Step 5: By the fifth iteration, q
2
is chosen again.
Then, next(L(q
2
)) = v
4
is put on the top of
S(q
2
) as shown in Figure 5(c). Remember
that after each iteration, the pointer for the
WEBIST 2007 - International Conference on Web Information Systems and Technologies
8
corresponding L(q
act
) is shifted to the next
element.
Step 6: By the sixth iteration, q
3
is selected once
again and next(L(q
3
)) = v
5
will be put onto
S(q
3
). But before that, v
3
is popped out since
RightPos(v
3
) < LeftPos(v
3
) (see line 7 in
TwigStack.) The stacks are changed as shown
in Figure 5(d), which shows another two
paths matching P
1
: v
5
v
4
v
1
and v
5
v
2
v
1
. They are represented in the compact
form.
Step 7: By the seventh iteration, q
4
is selected for the
second time and next(L(q
4
)) = v
5
is pushed
onto S(q
4
). Before this operation, v
3
is first
popped out. The new stacks are shown in
Figure 5(e), from which we will get two new
matching paths (for P
2
): v
5
v
4
v
1
and v
5
v
2
v
1
.
Step 8: By the eighth iteration, q
5
is chosen and
next(L(q
5
)) = v
2
is put on the top of S(q
5
) as
shown in Figure 9(f). It shows the first
matching path for P
3
: v
2
v
1
.
Step 9: By the ninth iteration, q
5
is chosen once
again and next(L(q
5
)) = v
4
is put on the top of
S(q
5
) as shown in Figure 5(g). It shows the
second matching path for P
3
: v
4
v
1
.
Step 10:By the ninth iteration, q
5
is chosen for the
third time and next(L(q
5
)) = v
6
is put on the
top of S(q
5
) as shown in Figure 5(h). It shows
the second matching path for P
3
: v
6
v
1
. We
notice that in this step, q
2
will not be selected
although L(q
5
) = {v
6
} is not empty. It is
because both L(q
3
) and L(q
4
) are empty and
the condition (i) in the previous section
cannot be satisfied.
The time complexity of the algorithm can be
analyzed as follows.
Let n
i
be the size of L(q
i
). Then, the main while-
loop in TwigStack will be iterated
=
m
i
i
n
1
times since
the termination condition of this while-loop is when
all the elements in all L(q
leaf
)’s are exhausted. In
each iteration, the procedure getNext will be invoked
and all the nodes in the query tree will be accessed.
Let λ
ijk
be the number of elements in L(q
k
) checked
when node q
k
is visited during the (i, j)-th execution
of getNext. Then, the worst-case cost is bounded by
O(
()
∑∑
== =
+
m
i
n
j
m
k
ijk
i
11 1
1
λ
)
= O(
∑∑
== =
m
i
n
j
m
k
i
11 1
1 ) + O(
∑∑
== =
m
i
n
j
m
k
ijk
i
11 1
λ
)
= O(m
2
n) + O(mn) = O(m
2
n).
Here we should remark that O(
∑∑
== =
m
i
n
j
m
k
ijk
i
11 1
λ
)
cannot be larger than mn since at most mn elements
may be pushed on to the stacks.
Applying the above method to another algorithm
TwigStackXB in (Bruno and et al, 2002), which is an
extension of TwigStack for general cases, we get the
same time complexity.
3 REMOVING REDUNDANCY
FORM TWIGSTACK
Now we begin to discuss how TwigStack can be
improved. As with TwigStack, we will associate
each node q
i
in q with a data stream L(q
i
), but with
the following conditions satisfied:
(i) For each v L(q
i
), v matches the predicate at q
i
.
(ii) Let
1
i
q , ...,
k
i
q be the children of q
i
. v has a
descendant v matching
j
i
q for j {1, ..., k}.
(iii)Each
j
i
q recursively satisfies (ii).
Obviously, these three conditions correspond to
the two properties (i) and (ii) given in the previous
section, for any node going onto a stack. Nothing is
new. However, not like getNext in TwigStack, which
chooses nodes from q to handle and each time finds
a next v in T to be put in some stack (by multiple
S(q
5
) S(q
4
) S(q
3
) S(q
2
) S(q
1
)
v
3
v
2
v
1
v
3
v
3
v
2
v
1
S(q
5
) S(q
4
) S(q
3
) S(q
2
) S(q
1
)
(a) (b)
(c)
(d)
(e) (f)
v
3
v
3
v
4
v
2
v
1
v
5
v
3
v
4
v
2
v
1
v
5
v
5
v
4
v
2
v
1
v
5
v
5
v
4
v
2
v
1
v
2
(g) (h)
Figure 5: Sample trace.
v
5
v
5
v
4
v
2
v
1
v
4
v
2
v
5
v
5
v
4
v
2
v
1
v
6
STACK ENCODING REVISITED
9
executions), we generate all L(q
i
)’s in one scan,
which enables us to avoid a great number of
repeated accesses to query nodes.
In the following, we will use T’ to represent a
subtree of T, which contains only those nodes
matching some node in q.
We will maintain two m × n (m = |q|, n = |T’|)
matrices, defined as below.
1. The nodes in both q and T’ are numbered in
postorder, and the nodes v are then referred to
by their postorder numbers, denoted as post(v).
2. In the first matrix, each entry c
ij
(i {1, ..., m},
j {1, ..., n}) has value 0 or 1. If c
ij
= 1, it
indicates that i L(j) and for each child of i, j
has a descendant satisfying the predicate at it.
Otherwise, c
ij
= 0. This matrix is denoted by
c(q, T’).
3. In the second matrix, each entry d
ij
(i {1, ...,
m}, j {1, ..., n}) is defined as follows. If j has
a descendant j’ such that c
ij
= 1, then d
ij
= 1;
otherwise d
ij
= 0. This matrix is denoted by d(q,
T’). In addition, a node itself is considered to be
one of its ancestors.
These two matrices can be established by using
an algorithm called matrixGeneration(T’, q),
presented below.
Initially, c
ij
= 0 and d
ij
= 0 for all i and j. During
the execution of the algorithm, the values of c
ij
’s will
be changed according to (2) and (3) described
above; and d
ij
’s will be changed to record whether a
node j in T’ has a descendant j’ that matches a
certain node i in q.
Algorithm matrixGeneration(T’, q)
Input: tree T’ (with nodes 1, ..., n) and tree q (with
nodes 1, ..., m)
Output: c(q, T’) with values created.
begin
1. for u := 1, ..., m do {
2. for v := 1, ..., n do
3. {if v satisfies the predicate at u then
4. let u
1
, ..., u
k
be the children of u;
5. if
vu
d
1
...
vu
k
d = 1 then c
uv
1;
6. }
7. let v
1
, v
2
, ..., v
h
be the nodes such that
pu
v
c = 1 (1
p
h);
8. let {w
1
, ..., w
r
} be a set such that each node in it
is an ancestor of some v
p
(1 p h). Set
l
uw
d = 1
for each w
l
(1 l r).
9. }
end
To see how the above algorithm works, we
should first notice that both T’ and q are both
postorder-numbered. Therefore, the algorithm
proceeds in a bottom-up way (see line 1 and 2). For
any node u in q and any node v in T’, if v satisfies
the predicate at u, we will check each child u
i
of u to
see whether there exists a descendant of v that
matches u
i
(see line 5). If it is the case, c
uv
will be set
to 1.
In line 7 and 8, we change d
ij
’s according to the
newly changed c
ij
’s.
Example 2. As an example, consider the trees T and
q shown in Figure 3 and 4 once again. Since each
node in T matches a node in q, we have T’ = T. In
addition, the nodes of T and q are postorder
numbered as shown in Figure 6(a) and (b),
respectively.
When we apply the above algorithm to these two
trees, c(q, T) and d(q, T) will be created and changed
in the way as illustrated in Figure 7, in which each
step corresponds to an execution of the outmost for-
loop.
In step 1, we show the values in c(q, T) and d(q,
T) after node 1 in q is checked against every node in
T. Since node 1 in q matches node 1 and 2 in T, c
11
and c
12
are all set to 1. Meanwhile, for all those
nodes that are an ancestor of 1 or 2 in T, the
corresponding entries in d(q, T) will be changed. So
we have all d
11
, d
12
, d
13
, d
14
, and d
16
set to 1 (see line
7 and 8).
In step 2, the algorithm generates the matrix
entries for node 2 in q, which is done in the same
way as for node 1 in q.
In step 3, node 3 in q will be checked against
every node in T, but matches only node 4 in T. Since
it is an internal node, its children will be further
checked. For this purpose, we will check both d
14
and d
24
since node 3 in q has two child nodes
postorder-numbered with 1 and 2, respectively.
Since d
14
= d
24
= 1, we set c
34
to 1. Accordingly, d
34
and d
36
are also set to 1.
In step 4, we will set c
43
, c
44
and c
45
to 1 since
node 4 in q is just a leaf node and matches node 3, 4,
and 5 in T. So d
43
, d
44
, d
45
, and d
46
will be
accordingly set to 1.
In step 5, since node 5 in q matches node 6 in T,
and both d
36
and d
46
are equal to 1 (we remark that
A v
1
B v
2
v
6
B
C v
3
v
4
B
v
5
C
Figure 6: Postorder numbering.
1
2
3
4
5
6
4
5 A q
1
B q
2
q
5
B
C q
3
q
4
C
1
2
3
WEBIST 2007 - International Conference on Web Information Systems and Technologies
10
node 3 and 4 in q are the child nodes of node 5), we
set c
56
and then d
56
to 1.
In the following discussion, we use T instead of
T’ to simplify the notation. The reader should notice
that it refers to the subtree of T containing only those
nodes that match some node in q.
Proposition 1. Algorithm matrixGeneration(T, q)
computes the values in c(q, T) and d(q, T) correctly.
Proof. We prove the proposition by induction on the
sum of the heights of T and q, h. Without loss of
generality, assume that height(T) 1 and height(q)
1.
Basic step. When h = 2, the proposition trivially
holds.
Induction hypothesis. Assume that when h = l, the
proposition holds.
Consider T = <t; T
1
, ..., T
k
> and q = <p; P
1
, ...,
P
g
> with height(T) + height(q) = l + 1, where t (p) is
the root of T (q, resp.) and T
1
, ..., T
k
(P
1
, ..., P
g
) are
the subtrees of t (p, resp.). Obviously, we have
height(T
i
) + height(q) l and height(T) + height(P
j
)
l. Therefore, in terms of the induction hypothesis,
the algorithm correctly computes the values in c(q,
T
i
) and d(q, T
i
), as well as the values in c(P
j
, T) and
d(P
j
, T) (i = 1, ..., k; j = 1, ..., g). Assume that 1, ...,
m are the postorder numbers of the nodes in q, and 1,
..., n are the postorder numbers of the nodes in T.
Then, the values for c
ij
(i = 1, ..., m - 1; j = 1, ..., n -
1) and d
ij
(i = 1, ..., m - 1; j = 0, ..., n - 2) are all
correctly generated. Now we will check c
in
and d
i(n-1)
(i = 1, ..., m), as well as c
mj
(j = 1, ..., n) and d
mj
(j =
0, ..., n - 1) to see whether they can be correctly
produced. Let i
1
, ..., i
s
be the children of i. If i
matches n, for each i
f
(1 f s),
ni
f
d ’s will be
checked. If we have
ni
d
1
...
ni
s
d = 1, we set c
in
to
1; otherwise 0 (see line 5). According to the
induction hypothesis, all such
ni
f
d ’s are correctly
generated. Therefore, c
in
(i = 1, ..., m) is correctly
created, so is d
i(n-1)
(i = 1, ..., m). A similar analysis
applies to c
mj
(j = 1, ..., n) and d
mj
(j = 0, ..., n - 1).
Proposition 2. Algorithm matrixGeneration(T, q)
requires O(nm) time and space, where n = |T| and m
= |q|.
Proof. During the whole process, against each node
u in q, all the nodes v in T is checked and for each v
all its children will be examined. Therefore, this part
of time is bounded by
O(
∑∑
==
m
u
n
v
v
d
11
) = O(
=
m
u
n
1
) = O(nm),
where d
v
represents the outdegree of node v in T.
In addition, after each u in q is checked, for all
those nodes in T, which are an ancestor of some
node that matches u, the corresponding matrix
entries in d(q, T) will be established. But this
operation needs only O(n) time if we proceeds as
follows. Each time we search T bottom-up from a
node v that matches u to find all its ancestors, we
mark each node encountered and stop whenever we
meet such a mark (made by a previous searching).
So at most O(n) nodes will be checked and the total
time of this part of operations is bounded by O(nm).
Obviously, to maintain c(q, T) and d(q, T), we
need O(nm) space.
In terms of the matrix c(q, T), it is an easy task to
create L(q
i
) for each q
i
in q as illustrated in Figure
8(a).
Figure 8(b) is the same as Figure 8(a). But in this
figure we use node names in L(q
i
) instead of their
step 1:
1 2 3 4 5 6
1
2
3
4
5
1 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
c(q, T):
1 2 3 4 5 6
1
2
3
4
5
1 1 1 1 0 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
d(q, T):
step 2:
1 2 3 4 5 6
1
2
3
4
5
1 1 0 0 0 0
1 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
c(q, T):
1 2 3 4 5 6
1
2
3
4
5
1 1 1 1 0 1
1 1 1 1 0 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
d(q, T):
step 3:
1 2 3 4 5 6
1
2
3
4
5
1 1 0 0 0 0
1 1 0 0 0 0
0 0 1 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
c(q, T):
1 2 3 4 5 6
1
2
3
4
5
1 1 1 1 0 1
1 1 1 1 0 1
0 0 1 1 0 1
0 0 0 0 0 0
0 0 0 0 0 0
d(q, T):
step 4:
1 2 3 4 5 6
1
2
3
4
5
1 1 0 0 0 0
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 1 0
0 0 0 0 0 0
c(q, T):
1 2 3 4 5 6
1
2
3
4
5
1 1 1 1 0 1
1 1 1 1 0 1
0 0 1 1 0 1
0 0 1 1 1 1
0 0 0 0 0 0
d(q, T):
step 5:
1 2 3 4 5 6
1
2
3
4
5
1 1 0 0 0 0
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 1 0
0 0 0 0 0 1
c(q, T):
1 2 3 4 5 6
1
2
3
4
5
1 1 1 1 0 1
1 1 1 1 0 1
0 0 1 1 0 0
0 0 1 1 1 1
0 0 0 0 0 1
d(q, T):
Figure 7: Sample trace.
STACK ENCODING REVISITED
11
postorder numbers. We will use the node names in
the subsequent discussion to avoid any confusion.
Concerning L(q
i
), we should pay attention to the
following:
(1) The nodes (represented by their quadruple) in
L(q
i
) are sorted by their (DocId, LeftPos) values
(not according to their postorder numbers).
(2) Each node in L(q
i
) satisfies the condition (i) and
(ii) given in 2.2.
Using such a data structure, the algorithm
TwigStack can be substantially improved. The main
idea is a depth-first searching of q. To this end, we
use a stack to control the process. Each entry in the
stack is a pair (q
i
, v
j
), where q
i
q and v
j
T.
Finally, we notice that getNext( ) will not be used
since all the values to be produced by executing
getNext( ) are pre-calculated and incorporated into
L(q
i
)’s. In addition, each node q
i
in q is associated
with its preorder number, denoted as pre(q
i
), which
will be used in the following algorithm. In Figure 9,
we show the preorder numbering of q.
Algorithm
RefinedTwigStack(q)
(*phase 1*)
1.Repeat the following until all
L(q
i
) become empty;
2. { let
pre(q
i
) be the least such that L(q
i
) is not empty;
3. push(
stack, (q
i
, next(L(q
i
)));
4. while ¬ empty(
stack) do
{(
u, v) pop(stack);
6. if (
u
is not the root) then
7.
cleanStack(S(parent(u)), LeftPos(v));
8. if (
u
is the root of q) ∨ ¬ empty(S(parent(u)))
9. then
10
. {cleanStack(S(u), LeftPos(v));
11. push(
S(u), v, pointer to top(S(parent(u)));
advance(L(u));
12. if (
u is a leaf node) then
13. { output all the matching paths (stored in stacks)
in the compact form; pop(
S(u));}
14. }
15. else advance(L(u));
16. let
q
1
, ..., q
l
be the children of u;
17. for
j = l to 1 do
18. {while
next(L(q
j
) is not a descendant of v do
adavance(L(q
j
);
19. push(stack, (
q
j
, next(L(q
j
)));}
20.}}
(*phase 2*)
21.
mergeAllPathSolutions();
Example 3. Continue with Example 2.
By using our method, we will first generate
L(q
i
) for
each
q
i
as shown in Figure 8(b). Then, we will search the
twig pattern
q as follows.
Step 0: At the very beginning, the node
q
1
has the least
LeftPos value and
L(q
1
) is not empty. Push (q
1
,
v
1
) into stack.
Step 1 - 3: In the following while-loop, the whole query
tree will be traversed in depth-first fashion.
When we meet
q
3
, the stacks will be changed as
shown in Figure 10(a). Since
q
3
is a leaf node, we
get the first matching path (for
P
1
): v
3
v
2
v
1
.
Step 4: When we meet
q
4
, another leaf node, the stacks
will be changed as shown in Figure 10(b). We get
the second matching path (for
P
2
): v
3
v
2
v
1
.
Step 5:
q
5
is visited. The stacks are changed as shown in
Figure 10(c). Since
q
4
is a leaf node, we get the
third matching path (for
P
3
): v
2
v
1
.
Step 6: Now
stack (used to control the searching of q) is
empty. We will try to find another node (in
q)
with the least LeftPos value and a non-empty list.
It is q
2
. In L(q
2
), we have one element left: {v
4
}.
Push (
q
2
, v
4
) into stack.
q
2
is visited once again. The stacks will be
changed as shown in Figure 10(d).
Step 7:
q
3
is visited once again. The stacks will be
changed as shown in Figure 10(e). (We notice that
before
v
5
is pushed onto S(q
3
), v
3
is popped out.)
From this, we get another two paths matching
P
1
:
v
5
v
4
v
1
and v
5
v
2
v
1
. They are
represented in the compact form.
Step 8: q
4
is visited once again. The stacks will be
changed as shown in Figure 10(f). (We notice that
before
v
5
is pushed onto S(q
4
), v
3
is popped out.)
From this, we get another two paths matching P
2
:
v
5
v
4
v
1
and v
5
v
2
v
1
. They are
represented in the compact form.
Step 9:
Stack becomes empty once again. This time q
5
is
chosen and in
L(q
5
) we still have two elements:
{
v
4
, v
6
}. Push (q
5
, v
4
) into stack.
3
{3, 4}
{6}
{1, 2}
4
5 A q
1
B q
2
q
5
B
C q
3
q
4
C
1
2
{1, 2}
{3, 4, 5}
{v
2
, v
4
, v
6
}
{v
2
, v
4
}
{v
3
, v
5
} {v
3
, v
5
}
{v
1
}
A q
1
B q
2
q
5
B
C q
3
q
4
C
Figure 8: Illustration for L(q
i
)’s.
5
1
A q
1
B q
2
q
5
B
C q
3
q
4
C
3
4
2
Figure 9: Preorder numbering.
WEBIST 2007 - International Conference on Web Information Systems and Technologies
12
q
5
is visited for the second time. The new status
of the stacks is shown in Figure 10(g). From this,
we get the second path matching
P
3
: v
4
v
1
.
Step 10: Stack becomes empty for the third time, and q
5
is
chosen once again since we still have an element
in
L(q
5
): {v
6
}. Push (q
5
, v
6
) into stack.
q
5
is visited for the second time. The new status
of the stacks is shown in Figure 10(h) (before
v
6
is pushed onto
S(q
5
), v
2
and v
4
are popped out.)
From this, we get the third path matching P
3
: v
6
v
1
.
From the above example, we can see that our
algorithm generates the same paths as TwigStack
although the order of such paths’s generation is
different. More importantly, in our algorithm
getNext is not used, which is replaced with
matrixGeneration that is performed only once.
In the following, we prove the algorithm’s
correctness and analyze its time complexity.
Proposition 3. Let T and q be the document and
query tree, respectively. RefindTwigStack generates
all the matching paths (in T) for every root-to-leaf
path in q.
Proof. In order to prove the proposition, we need to
explain
(i) Every path (in T) found by RefinedTwigStack
must match a root-to-leaf path in q.
(ii) Any path (in T), if it matches a root-to-leaf path
in q and each of its nodes satisfies the condition
(i) and (ii) given in 2.2, must be found by
RefinedTwigStack.
Proof of (i). To see that (i) holds, we notice the
following two properties of the algorithm:
(1) Any node v in any L(q
i
) satisfies the condition (i)
and (ii) given in 2.2.
(2) Any node v put in a S(q
i
) satisfies the condition
below:
Let q’ be the parent of q
i
. The node on the top of
S(q’) must be an ancestor of v.
So each time when we meet a leaf node q’’, all
the paths found must match the path from the root of
q to q’’.
Proof of (ii). Let P = q
1
q
2
...
q
m
be a path in
q. Let {t
1
, t
2
, ..., t
m
} be a set of nodes lying on a path
in T, which makes up a matching path of P with
each t
i
satisfying the condition (i) and (ii) given in
2.2. But this matching path has not been found by
RefinedTwigStack. Then, there exists a k such that
all t
j
(k
j
m) do not have a chance to be put onto
the corresponding stacks. First, we notice that k > 1
since t
1
must appear in L(q
1
) and will be definitely
put onto S(q
1
) during the computation process. Now
we consider t
k
with k > 1, which does not have a
chance to be put onto S(q
k
). In terms of line 8 in
RefinedTwigStack, we must have S(q
k+1
) = {} when
we try to put t
k
onto S(q
k+1
). This implies that t
k-1
must have been popped out at a earlier time point
when we try to put another node t’ onto S(q
k
) with
RightPos(t
k-1
) < LeftPos(t’). But we have obviously
LeftPos(t’) < LeftPos(t
k
). So we have RightPos(t
k-1
)
< LeftPos(t
k
), which contradicts fact that t
k-1
is an
ancestor of t
k
. From this analysis, we know that t
k
has a chance to be put onto S(q
k
). The same analysis
applies to t
k+1
, ..., t
m
. This completes the proof.
The time complexity of RefinedTwigStack is
easy to analyze. In the whole process, each node v in
a L(q
i
) is accessed only once. So the total cost is
bounded by
O(
()
=
m
i
i
qL
1
) = O(mn)
4 GENERAL CASES
The method discussed in Section 3 can be easily
extended to handle general cases that a query tree
contains both c-edges and d-edges. For this purpose,
we define a third matrix p(q, T) as follows.
An entry p
ij
= 1 indicates that there exists some
child k of j, which ‘matches’ i, i.e., c
ik
= 1;
otherwise, p
ij
= 0.
S(q
5
) S(q
4
) S(q
3
) S(q
2
) S(q
1
)
v
3
<
v
2
v
1
v
3
v
3
v
2
v
1
S(q
5
) S(q
4
) S(q
3
) S(q
2
) S(q
1
)
(a) (b)
(c) (d)
(e) (f)
v
3
v
3
v
2
v
1
v
2
v
3
v
3
v
4
v
2
v
1
v
2
v
5
v
3
v
4
v
2
v
1
v
2
v
5
v
5
v
4
v
2
v
1
v
2
(g) (h)
Figure 10: Sample trace.
v
5
v
5
v
4
v
2
v
1
v
4
v
2
v
5
v
5
v
4
v
2
v
1
<
v
6
STACK ENCODING REVISITED
13
Accordingly, the algorithm matrixGeneration
should be slightly changed so that the manipulation
of p(q, T) is involved.
Algorithm generalMatrixGeneration(T, q)
Input: tree
T (with nodes 1, ..., n) and tree q (with nodes 1,
...,
m)
Output:
c(q, T) with values created.
begin
1. for u := 1, ..., m do {
2. for
v := 1, ..., n do
3. {if
v satisfies the predicate at u then
4. let
u
1
, ..., u
k
be the c-children of u;
5. let
u
1
’, ..., u
g
’ be the d-children of u;
6. if
vu
p
1
...
vu
k
p = 1 and
vu
d
'
1
...
vu
k
d
'
= 1
7. then c
uv
1;}
8. let v
1
, v
2
, ..., v
h
be the nodes such that
p
uv
c = 1 (1 p
h);
9. let {
w
1
, ..., w
r
} be a set such that each node in it is an
ancestor of some
v
p
(1 p h). Set
l
uw
d
= 1 for each w
l
(1
l r).
10. let {
t
1
, ..., t
s
} be a set such that each node in it is a
parent of some
v
p
(1 p h). Set
l
ut
d = 1 for each t
l
(1
l s).
11. }
end
Since each node u in q may have both c- and d-
children, each time when checking it against a node
v in T we need to check the corresponding entries in
both d(q, T) and p(q, T) (see line 6). In addition,
besides the computation of new values for some
entries in d(q, T) in each step, we need also to
compute new values for the corresponding entries in
p(q, T) (see line 10).
5 CONCLUSION
In this paper, a new method is discussed, which
substantially improves the method proposed in
(Bruno and et al, 2002) for doing twig joins that are
identified as a core operation for query evaluation in
XML databases. Concretely, our method improves
the algorithm TwigStack and TwigStackXB presented
in (Bruno and et al, 2002) from O(m
2
n) to O(mn),
where m and n are the sizes of the query tree and
document tree, respectively. In addition, a system
implementation and experiments are reported, which
shows that our method uniformly outperforms
TwigStack, completely conforming to the conducted
theoretic analysis.
ACKNOWLEDGEMENTS
The author is supported by NSERC 239074-01
(242523) (Natural Sciences and Engineering Council
of Canada).
REFERENCES
S. Al-Khalifa, H.V. Jagadish, N. Koudas, J.M. Patel, D.
Srivastava, and Y. Wu (2002). Structureal Joins:
Aprimitive for efficient XML query pattern matching,
in
Proc. of IEEE Int. Conf. on Data Engineering.
N. Bruno, N. Koudas, and D. Srivastava (2002). Holistic
Twig Hoins: Optimal XML Pattern Matching, in
Proc.
SIGMOD Int. Conf. on Management of Data
,
Madison, Wisconsin, June 2002, pp. 310-321.
D. D. Chamberlin, J.Clark, D. Florescu and M. Stefanescu
(1999). XQuery1.0: An XML Query Language,
http://www.w3.org/ TR/query-datamodel/.
D. D. Chamberlin, J. Robie and D. Florescu (2000). Quilt:
An XML Query Language for Heterogeneous Data
Sources,
WebDB 2000.
A. Deutch, M. Fernandex, D. Florescu, A. Levy, D.Suciu
(1999). A Query Language for XML, WWW'99.
D. Florescu and D. Kossman, Storing and Querying
(1999). XML data using an RDMBS, IEEE Data
Engineering Bulletin
, 22(3):27-34.
J. McHugh, J. Widom (1999) Query optimization for
XML, in Proc. of VLDB.
J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D.J.
Dewitt, and J.F. Naughton (1999). Relational
databases for querying XML documents: Limitations
and opportunities, in Proc. of VLDB.
The Tukwila System (1999), available from U. of
Washington http:/ /data.cs.washington.edu/integration
/tukwila/.
The Niagara System (2000), available from U. of
Wisconsin http:// www.cs.wisc.edu/niagara/.
H. Wang, S. Park, W. Fan, and P.S. Yu (2003). ViST: A
Dynamic Index Method for Querying XML Data by
Tree Structures,
SIGMOD Int. Conf. on Management
of Data
, San Diego, CA., June 2003.
H. Wang and X. Meng (2005) On the Sequencing of Tree
Structures for XML Indexing, in
Proc. Conf. Data
Engineering
, Tokyo, Japan, April, 2005, pp. 372-385.
World Wide Web Consortium (1999). XML Path
Language (XPath), W3C Recommendation, Version
1.0, November 1999. See
http://www.w3.org/TR/xpath.
World Wide Web Consortium (2001) XQuery 1.0: An
XML Query Language, W3C Recommendation,
Version 1.0, Dec. 2001. See http://www.w3.org/TR/
xquery.
C. Zhang, J. Naughton, D. Dewitt (2001) Q. Luo, and G.
Lohman, on Supporting containment queries in
relational database management systems, in
Proc. of
ACM SIGMOD
, 2001.
WEBIST 2007 - International Conference on Web Information Systems and Technologies
14