only for the queries that contain only d-edges. In the
case that a query contains both c-edges and d-edges,
some useless path matchings have to be performed.
In addition, in the worst case, TwigStack needs
O(|D|
|
Q|) time for doing the merge joins as shown by
Chen et al. (see page 287 in (Chen and et al., 2006)),
where D is a largest data stream associated with a
node q in Q, which contains all the document nodes
that match q. Since then, several methods that
improve TwigStack in some way have been reported.
For instance, iTwigJoin (Chen and et al., 2005)
exploits different data partition possibilities while
TJFast (Lu and et al., 2005) accesses only leaf nodes
of document trees by using Dewey IDs. But both of
them still need to do some useless matchings as
shown by the theoretical analysis made in (Choi and
et al., 2003). Twig
2
Stack (Chen and et al., 2006) is
the most recent method that improves TwigStack. By
this method, the stack encoding is replaced with the
hierarchical stack encoding, by which each stack
associated with a query node contains an ordered
sequence of stack trees. In this way, the path joins
are replaced by the so called result enumeration. In
(Chen and et al., 2006), it is claimed that Twig
2
Stack
needs only O(|D|⋅|Q| + |subTwigResults|) time. But a
careful analysis shows that the time complexity of
the method is actually bounded by O(|D|⋅|Q|
2
+
|subTwigResults|). It is because each time a node is
inserted into a stack associated with a node in Q, not
only the position of this node in a tree within that
stack has to be determined, but a link from this node
to a node in some other stack has to be constructed,
which requires to search all the other stacks. The
number of these stacks is |Q| (see Fig. 4 in (Chen
and et al., 2006) to know the working process.)
The method discussed in (Aghili and et al., 2006)
incorporates a binary labeling as a pre-processing
filtration step to reduce the search space. This
method is effective only for the case that selective
key words at leaf nodes are specified in queries.
Finally, we point out that the bottom-up tree
matching was first proposed in (Hoffmann and
O’Donnell, 1982). But it concerns a very strict tree
matching, by which the matching of an edge to a
path is not allowed. In (Gottlob and et al., 2005), an
XPath is transformed into a parse tree and then
evaluated bottom-up or top-down. Both the bottom-
up and top-down strategies need O(|T|
5
⋅|Q|
2
) time
and O(|T|
4
⋅|Q|
2
) space. In (Miklau and Suciu, 2004),
an algorithm for tree homomorphism is discussed,
which is able to check whether a tree contains
another and returns only a boolean answer. But our
algorithms show all the subtrees than match a given
twig pattern query.
In comparison with the above methods, our
methods have the following advantages:
- Our first algorithm needs less time than
Twig
2
Stack. Concretely, our algorithm runs in
O(|D|⋅|Q|) time.
- Neither matching paths nor tree stacks are
generated. Therefore, the costly path joins (Aghili
and et al., 2006), as well as the result
enumeration, a join-like operation (Chen and et
al., 2006), are not needed.
- The runtime memory usage is minimum. During
the process, our algorithm transforms
(dynamically) the data streams to a tree structure
T with all the matching patterns recognized. To
represent the results, each node v in T is
associated with a set of nodes in Q (denoted as
M(v)) such that for each q ∈ M(v) the subtree
rooted at q can be embedded in the subtree rooted
at v. If M(v) contains the root of Q, it indicates an
answer and v will be stored in a global variable (or
report the subtree rooted at v as an answer). Later
on, M(v) will be removed once M(v’s parent) is
established since M(v) will not be accessed any
more.
3 ALGORITHM
In this section, we discuss our algorithm according
to Definition 1. The main idea of this algorithm is to
search both T and Q bottom-up and checking the
subtree embedding by generating dynamic data
structures. In the process, a tree labeling technique is
used to facilitate the recognition of nodes’
relationships. Therefore, in the following, we will
first show the tree labeling in 3.1. Then, in 3.2, we
discuss the main algorithm. In 3.3, we prove the
correctness of the algorithm and analyze its
computational complexities.
3.1 Tree Labeling
Before we give our main algorithm, we first restate
how to label a tree to speed up the recognition of the
relationships among the nodes of trees.
Consider a tree T. By traversing T in preorder,
each node v will obtain a number (it can an integer
or a real number) pre(v) to record the order in which
the nodes of the tree are visited. In a similar way, by
traversing T in postorder, each node v will get
another number post(v). These two numbers can be
used to characterize the ancestor-descendant
relationships as follows.
ICEIS 2007 - International Conference on Enterprise Information Systems
46