them, for example, all the strategies proposed in (Al-
Khalifa and et al, 2002; Florescu, Kossman, 1999;
McHugh, Widom, 1999; Shanmugasundaram and et
al, 1999; Tukwila System, 2000; Niagara System,
2000; Zhang and et al, 2001), typically decompose a
twig pattern into a set of binary relationships
between pairs of nodes, such as parent-child and
ancestor-descendant relations; and the sizes of
intermediate relations tend to be very large, even
when the input and final result sizes are much more
manageable. Another kind of strategies bases on
path decomposition, such as those discussed in
(Bruno and et al, 2002; Wang and et al, 2003; Wang
and Meng, 2005). In (Wang and et al, 2003; Wang
and Meng, 2005), all the possible paths of an XML
document are explicitly stored and indexed using
B+-trees as well as trie structures. In (Bruno and et
al, 2002), a document is also decomposed, but
dynamically depending on the given queries. This
method is of special interest since the decomposed
paths are not simply stored but compressed by using
the so-called stack encoding. It reduces the number
of intermediate matching paths dramatically.
Although the idea of compressing intermediate
results is very attractive, the process suggested in
(Bruno and et al, 2002) for producing compact paths
is not so efficient and can be substantially improved.
In this paper, we analyze the method described in
(Bruno and et al, 2002) and show that the matching
paths can be produced in a more efficient way.
Particularly, two new algorithms are presented,
which improve the two main procedures of this
method: TwigStack and TwigStackXB, by one order
of magnitude. In (Bruno and et al, 2002), TwigStack
is utilized to generate matching paths for queries
containing only d-edges while TwigStackXB is for
queries containing both c- and d-edges.
The remainder of the paper is organized as
follows. In Section 2, we review the concept of stack
encoding and the algorithm TwigStack presented in
(Bruno and et al, 2002), which is necessary for the
subsequent discussion. In Section 3, we propose a
new algorithm RefinedTwigStack to do the same task
as TwigStack, but using much less time. In Section
4, we extend RefinedTwigStack to general cases.
Finally, a short conclusion is set forth in Section 5.
2 ON THE TWIGSTACK
ALGORITHM
In this section, we review the main procedure
TwigStack given in (Bruno and et al, 2002), which is
used to evaluate a special kind of queries that
contain only d-edges. However, by using a variant
structure of B-tree, called XB-tree, TwigStack can be
easily extended to general cases with both c-edges
and d-edges involved.
In the following, we first review what is a stack
encoding in 2.1. Then, we describe the TwigStack
algorithm (Bruno and et al, 2002) and analyze its
time complexity in 2.2. In (Bruno and et al, 2002), a
theoretical time analysis is not delivered.
2.1 On the Stack Encoding
Let T be a document tree. Let q = q
1
⇒ q
2
... ⇒ q
m-1
⇒ q
m
be a query path containing only d-edges. We
associate each q
i
(i = 1, ..., m) with a stack, denoted
S(q
i
), in which each entry is a pair (v, p) with v being
a node in T and p is a pointer to an entry in
S(parent(q
i
)), where parent(q
i
) represents the parent
of q
i
.
At every point during the computation, all S(q
i
)’s
have the following properties
(i) The entries in S(q
i
) (from bottom to top) are
guaranteed to lie on a root-to-leaf path in T.
(ii) The set of stacks contains a compact encoding of
matching paths.
As an example, consider T and q shown in
Figure 2(a).
Obviously, T has four subpaths that match q, as
shown in Figure 2(b). By using the stack encoding,
they can be stored in a way as shown in Figure 2(c),
using much less space.
First, we notice that the matching path v
3
→ v
4
→ v
5
→ v
6
is encoded since v
6
points to v
5
, v
5
to v
4
,
and v
4
to v
3
. Also, the matching path v
1
→ v
4
→ v
5
→ v
6
is encoded since v
1
is below v
3
on the stack
S(q
1
). For the same reason, v
1
→ v
2
→ v
5
→ v
6
is a
matching path since v
2
is below v
4
on the stack S(q
2
)
and has a pointer to v
1
. Finally, since v
3
is below v
5
on the stack S(q
3
) and has a pointer to v
2
, v
1
→ v
2
→
v
3
→ v
6
is also an answer. However, the nodes v
3
,
v
2
, v
5
, v
6
do not make up a matching path since v
3
is
above v
1
on S(q
1
), to which v
2
points.
2.2 Description of TwigStack
Now we describe the algorithm TwigStack, which
stores the intermediate results in a way of stack
encoding, and analyze its time complexity. For this
purpose, we first show a tree encoding method
(Zhang and et al, 2001) and define some notations
that are used in the description of TwigStack.
Let T be a document tree. We associate each
node v in T with a quadruple (DocId, LeftPos,
RightPos, LevelNum), denoted as α(v), where DocId
WEBIST 2007 - International Conference on Web Information Systems and Technologies
6