already both in (Milo, T. and Suciu, D., 1998) and
(Kaushik, R. et al., 2002). However, both papers do
not elaborate on the algorithmic implications of this
idea, and in the context of A(k)-Index. This
algorithmic elaboration is the main contribution of
this paper (see also section 2).
The rest of the paper is organized as follows: We
review existing work in section 2, then we elaborate
on the data model and other related issues in section
3 – preliminaries. Afterwards we describe A(k)-
Simplified and A(k)-Relevant in section 4, and then
we give a way to remove false drops in section 5.
We then conclude with experimental results (section
6) and further research (section 7).
2 RELATED WORK
In semi-structured databases, indexes are structural
summaries, made to reduce time for query
evaluation. Some of the indexes are approximate,
which means that for some queries, verification is
needed in order to remove false drops (by the term
false drops we mean false positive – some of the
index's results are incorrect). Examples for such
indexes are: Approximate DataGuide (Goldman, R.
and Widom, J., 1999), which is based on Strong
DataGuide in which “similar” nodes are merged,
APEX (Chung, C. W. et al., 2002) – An index that
adjusts its structure using data mining strategies in
order to efficiently evaluate queries with common
suffixes, D(k)-Index (Qun, C., Lim A. and Ong, K.
W., 2003), which uses different local bisimilarity
values in order to emphasize important tags, and
A(k)-Index. In our work we are interested in the
A(k)-Index and local bisimilarity on which it is
based. We say that two nodes are bisimilar if the two
groups of incoming label path are similar. Every
node in a bisimilarity based index is an equivalence
class for the bisimilar relation. The first bisimilarity
based index was 1-Index, which is safe and accurate
but tends to be very large, especially with “complex”
documents. A(k)-Index offers a way to reduce the
index size by relaxing the bisimilarity demand for
having all the incoming label paths, and declaring a
new relation – local bisimilarity. Two nodes are k
local bisimilar, if their two groups of incoming label
paths, not longer than k, are similar. Local
bisimilarity is based on the assumption that long
queries are rare, and therefore we can build a
structure that gives accurate results for short queries,
while in long queries the results are approximate,
and sometimes need verification (i.e. A(k)-Index is
safe). Experiments show that A(k)-Index is much
smaller then 1-Index.
In A(k)-Index both relevant and non-relevant tags
gets the same treatment, because we define one k for
all the index, in D(k)-Index we can emphasize
relevant tags by giving the nodes with the desired
tags different ‘k’ (local similarity) value, however,
in spite the fact that some tags are known to be non
relevant at all, D(k)-Index still includes all the nodes
whether their tags are relevant or not.
As seen in (Kaushik, R. et al., 2002),(Milo, T.
and Suciu, D., 1998) using partial data can improve
evaluation time. The schema in (Milo, T. and Suciu,
D., 1998) is based on simulation and is used for
creating efficient regular path expressions, which
improves DB scanning time. "other" edges are used
as a replacement for unknown tag's edges which
exist in the DB. However, the goal of the data
structure in (Milo, T. and Suciu, D., 1998) differs
from the goal of (Kaushik, R. et al., 2002) and ours.
We use a bisimulation data structure to efficiently
evaluate the query, rather then creating a more
efficient one, both (Kaushik, R. et al., 2002) and us
offer a method to reduce the index size, by replacing
irrelevant tags with the tag "other", and removing
isolated "other" nodes. This method allows an
efficient evaluation of queries based on relevant tags
only. (Kaushik, R. et al., 2002) uses this method to
reduce the F&B index size, but does not give
detailed algorithms for constructing the index and
evaluating the query using "other". We present these
algorithms in depth, especially in the context of
A(k)-Index structure, and exploit the local
bisimulation in treating "other" nodes. A(k)-
Simplified uses this idea and adds a method to
remove "other" node's path, and A(k)-Relevant
exploits local bisimilarity and relaxes the demand
for queries with relevant tags only, by just requiring
a relevant tag at the end of the query and therefore
supports all queries which return relevant data. Both
indexes support the collapse of "other" paths without
introducing false results.
3 PRELIMINARIES
XML or other semi-structured databases are
modeled as a directed, labeled graph G, with each
edge indicating an object-subobject or object-value
relationship. Each node in G has a label, and an OID
(as described in (Papakonstantinou, Y., 1995)), with
simple objects having a distinguished label,
VALUE. In Figure 1 we can see an example of a
semi structured data graph. The id-idref edges are
sketched with dotted line, but in our model, we
consider all the edges in the same manner. Queries
in semi structured databases are based on regular
path expression. Let G be a data graph, and Σ
G
be
WEBIST 2005 - INTERNET COMPUTING
14