
 
already both in (Milo, T. and Suciu, D., 1998) and 
(Kaushik, R. et al., 2002). However, both papers do 
not elaborate on the algorithmic implications of this 
idea, and in the context of A(k)-Index. This 
algorithmic elaboration is the main contribution of 
this paper (see also section 2). 
The rest of the paper is organized as follows: We 
review existing work in section 2, then we elaborate 
on the data model and other related issues in section 
3 – preliminaries. Afterwards we describe A(k)-
Simplified and A(k)-Relevant in section 4, and then 
we give a way to remove false drops in section 5. 
We then conclude with experimental results (section 
6) and further research (section 7).  
2 RELATED WORK 
In semi-structured databases, indexes are structural 
summaries, made to reduce time for query 
evaluation. Some of the indexes are approximate, 
which means that for some queries, verification is 
needed in order to remove false drops (by the term 
false drops we mean false positive – some of the 
index's results are incorrect). Examples for such 
indexes are: Approximate DataGuide (Goldman, R. 
and Widom, J., 1999), which is based on Strong 
DataGuide in which “similar” nodes are merged, 
APEX (Chung, C. W. et al., 2002) – An index that 
adjusts its structure using data mining strategies in 
order to efficiently evaluate queries with common 
suffixes, D(k)-Index (Qun, C., Lim A. and Ong, K. 
W., 2003), which uses different local bisimilarity 
values in order to emphasize important tags,  and 
A(k)-Index. In our work we are interested in the 
A(k)-Index and local bisimilarity on which it is 
based. We say that two nodes are bisimilar if the two 
groups of incoming label path are similar. Every 
node in a bisimilarity based index is an equivalence 
class for the bisimilar relation. The first bisimilarity 
based index was 1-Index, which is safe and accurate 
but tends to be very large, especially with “complex” 
documents. A(k)-Index offers a way to reduce the 
index size by relaxing the bisimilarity demand for 
having all the incoming label paths, and declaring a 
new relation – local bisimilarity. Two nodes are k 
local bisimilar, if their two groups of incoming label 
paths, not longer than k, are similar. Local 
bisimilarity is based on the assumption that long 
queries are rare, and therefore we can build a 
structure that gives accurate results for short queries, 
while in long queries the results are approximate, 
and sometimes need verification (i.e. A(k)-Index is 
safe). Experiments show that A(k)-Index is much 
smaller then 1-Index. 
In A(k)-Index both relevant and non-relevant tags 
gets the same treatment, because we define one k for 
all the index, in D(k)-Index we can emphasize 
relevant tags by giving the nodes with the desired 
tags different ‘k’ (local similarity) value, however, 
in spite the fact that some tags are known to be non 
relevant at all, D(k)-Index still includes all the nodes 
whether their tags are relevant or not.  
As seen in (Kaushik, R. et al., 2002),(Milo, T. 
and Suciu, D., 1998) using partial data can improve 
evaluation time. The schema in (Milo, T. and Suciu, 
D., 1998) is based on simulation and is used for 
creating efficient regular path expressions, which 
improves DB scanning time. "other" edges are used 
as a replacement for unknown tag's edges which 
exist in the DB. However, the goal of the data 
structure in (Milo, T. and Suciu, D., 1998) differs 
from the goal of (Kaushik, R. et al., 2002) and ours. 
We use a bisimulation data structure to efficiently 
evaluate the query, rather then creating a more 
efficient one, both (Kaushik, R. et al., 2002) and us 
offer a method to reduce the index size, by replacing 
irrelevant tags with the tag "other", and removing 
isolated "other" nodes. This method allows an 
efficient evaluation of queries based on relevant tags 
only. (Kaushik, R. et al., 2002) uses this method to 
reduce the F&B index size, but does not give 
detailed algorithms for constructing the index and 
evaluating the query using "other". We present these 
algorithms in depth, especially in the context of 
A(k)-Index structure, and exploit the local 
bisimulation in treating "other" nodes.  A(k)-
Simplified uses this idea and adds a method to 
remove "other" node's path, and A(k)-Relevant 
exploits local bisimilarity and relaxes the demand 
for queries with relevant tags only, by just requiring 
a relevant tag at the end of the query and therefore 
supports all queries which return relevant data. Both 
indexes support the collapse of "other" paths without 
introducing false results. 
3 PRELIMINARIES 
XML or other semi-structured databases are 
modeled as a directed, labeled graph G, with each 
edge indicating an object-subobject or object-value 
relationship. Each node in G has a label, and an OID 
(as described in (Papakonstantinou, Y., 1995)), with 
simple objects having a distinguished label, 
VALUE. In Figure 1 we can see an example of a 
semi structured data graph. The id-idref edges are 
sketched with dotted line, but in our model, we 
consider all the edges in the same manner. Queries 
in semi structured databases are based on regular 
path expression. Let G be a data graph, and Σ
G
 be 
WEBIST 2005 - INTERNET COMPUTING
14