we compute a production rule of the form A1( _, _)
l( _, _) that produces an element with label l and
two parameters representing the first-child and the
next-sibling of e. The start production rule S con-
tains the whole structure of the binary XML tree, but
each label of a non-leaf node in the binary XML tree
is replaced with the nonterminal of the production
rule introduced for this non-leaf element. For exam-
ple, the start production rule S generated for the
DAG of the XML tree of Figure 1 given in Grammar
2 is transformed into the start production rule S of
the following grammar Grammar 5:
SAr(Ak1(Ab(Ac(D,E),),Ak2(Ah(Ac(D,G),),
Ak3(Ab(Ac(F,E),),Ak4(Ah(Ac(F,G),),
Ak5(Ab(Ac(I,E),), Ak6(Ah(Ac(I,G),),)))))),)
Ar(_, _) r( _, _ ) .
Ak1( _, _) k1( _, _ ) .
...
Ac( _,_) c( _, _ ) .
D d(,)
E e(,)
F f(,)
G g(,)
I i(,)
Grammar 5: After initialization, each non-leaf DAG node
of Grammar 2 is substituted with a production rule.
Our approach to sharing similar substructures
within an XML document was inspired by cluster-
ing. Clustering groups similar objects, where simi-
larity of objects is measured by using a distance
function d. Depending on the distance function d,
this process results in different clustering techniques
and different clustering results.
Similarly, in each sharing step, we search within
the start rule S and all other rules for matching pat-
terns p1, …, pn that can be shared by introducing a
new production rule. In order to find the patterns, the
sharing of which achieves the highest benefit, we
examine all matches of a possible pattern within the
production rules as follows. For example, look at the
nesting of nonterminals in the start rule of Grammar
5. Each of the patterns Ac(_,E) and Ac(_,G), where
the parameter ‘_’ matches anything, occurs three
times, i.e. has three matches in the start rule,
whereas e.g. the pattern Ac(D,_) occurs only twice,
i.e. has two matches in the start rule. Each clustering
distance function d calculates the benefit of applying
each of these different possible patterns based on
substituting their matches with their corresponding
instantiations (as defined in Section 2.3.) and finally
use that pattern that achieves the highest benefit.
We follow a greedy approximation, as in each
step, we implement that pattern that ‘locally’
achieves the highest benefit, which will in general
not necessarily lead to the highest ‘global’ benefit.
The benefit is negative, if storing a rule LP for
a pattern P needs more space than the replacement of
the matches of P by their instantiations saves within
all production rules.
However, when the benefit is positive, we store a
rule L P and we replace all matches of P within all
production rules with their corresponding instantia-
tions. We repeat this optimization step until no more
patterns with positive benefit can be found.
For example, starting with Grammar 2, we will
first find the matches C(D,E), C(F,E), and C(I,E)
that all have an edge connecting nodes with labels C
and E and match the pattern C(_,E). As no other
possible pattern achieves a higher benefit, we add
the production rule C_E C(_,E) to the set of pro-
ductions and, within the start rule, replace the match
C(D,E) by the instantiation C_E(D), the match
C(F,E) by the instantiation C_E(F), and the match
C(I,E) by the instantiation C_E(I). Within the next
iteration, we find the matches B(C_E(D),),
B(C_E(F),), and B(C_E(I),) which all have an
edge connecting nodes with label B to nodes with
label C_E. The production rule BC_E(_)
B(C_E(_),) is added and the above given mat-
ches are replaced by BC_E(D), BC_E(F), and
BC_E(I), as can be seen in Grammar 3.
Finally, for those production rules for which stor-
ing the rule has a negative benefit, we delete the rule
and replace each occurrence of a rule instantiation
with a corresponding instantiation of the right-hand
side of the production rule.
3.2 Clustering Strategies
Similarly, as clustering strategies depend on a con-
crete distance function d, the ‘quality’ of our CluX
clustering strategies depends on what we count. We
have implemented 4 different clustering strategies
that influence the choice of patterns to be shared and
have evaluated them, in order to find out, which
strategy achieves the highest compression ratio and
which strategy achieves the smallest runtime. These
strategies are called ‘minimize edges’, ‘minimize
rule size’, ‘minimize succinct storage’ and ‘random’
and are described in the following subsections.
3.2.1 Minimize Edges
For each possible pattern PP, we count the number
of matches within all rules. That pattern PP that has
the most matches is chosen for the next optimization
step, i.e., a new rule is LPP is added to the rule
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
146