The last category (avoiding external redundan-
cies given by schema information) includes such ap-
proaches as XCQ (Ng et al., 2006) and DTD sub-
traction (Böttcher, Steinmetz, and Klein, 2007).
They both separate the structural information from
the textual information and then subtract the given
schema information from the structural information.
Instead of a complete XML structure stream or tree,
they only generate and output information not al-
ready contained in the schema information (e.g., the
chosen alternative for a choice-operator or the num-
ber of repetitions for a *-operator within the DTD).
Both approaches, XCQ and DTD subtraction, are
queriable and applicable to XML streams, but they
can only be used if schema information is available.
XQzip (Cheng and Ng, 2004) and the approach
presented in (Buneman, Grohe, and Koch, 2003) be-
long to the second category (avoiding structural re-
dundancies). They compress the data structure of an
XML document bottom-up by combining identical
sub-trees. Afterwards, the data nodes are attached to
the leaf nodes, i.e., one leaf node may point to sev-
eral data nodes. The data is compressed by an arbi-
trary compression approach. These approaches allow
querying compressed data, but they are not directly
applicable to infinite data streams.
An extension of (Buneman, Grohe, and Koch,
2003) and (Cheng and Ng, 2004) is the BPLEX al-
gorithm (Busatti, Lohrey, and Maneth, 2005). This
approach does not only combine identical sub-trees,
but recognizes patterns within the XML tree that
may span several levels, and therefore allows a
higher degree of compression. In comparison to
BSBC, this approach does not explicitly define how
to compress text constants and attribute values con-
tained in XML data and how to distinguish both in
the compressed XML format.
The first category (avoiding textual redundancies
by tokenization) allows for a much faster compres-
sion approach than the second one, as only local data
has to be considered in the compression as opposed
to considering different sub-trees as in the second
category.
The XMill algorithm (Liefke and Suciu, 2000) is
an example of the first category. It compresses the
structural information separately from the data. Data
is grouped according to its enclosing element and
collected into several containers, and each container
is compressed afterwards. The structure is com-
pressed, by assigning each tag name a unique and
short ID. Each end-tag is encoded by the symbol ‘/’.
This approach does not allow querying the com-
pressed data.
XGrind (Tolani and Hartisa, 2002), XPRESS
(Min, Park, and Chung, 2003) and XQueC (Arion et
al., n.d.) are extensions of the XMill-approach. Each
of these approaches compresses the tag information
using dictionaries and Huffman-encoding (Hufman,
1952) and replaces the end-tags by either a ‘/’-sym-
bol or by parentheses. All three approaches allow
querying the compressed data, and, although not ex-
plicitly mentioned, they all seem to be applicable to
data streams.
Approaches (Bayardo et al., 2004), (Cheney,
2001), and (Girardot and Sunderesan, 2000) are
based on tokenization. (Cheney, 2001) replaces each
attribute and element name by a token, where each
token is defined the first time it is used. (Bayardo et
al., 2004) and (Girardot and Sunderesan, 2000) use
tokenization as well, but they enrich the data by ad-
ditional information that allows for a fast navigation
(e.g., number of children, pointer to next-sibling, ex-
istence of content and attributes). All three of them
use a reserved byte to encode the end-tag of an ele-
ment. They are all applicable to data streams and al-
low querying the compressed data.
The approach in (Ferragina et al., 2006) does not
belong to any of the three categories. It is based on
Burrows-Wheeler Block-Sorting (Burrows and
Wheeler, 1994), i.e., the XML data is rearranged in
such a way that compression techniques such as gzip
achieve higher compression ratios. This approach is
not applicable to data streams, but allows querying
the compressed data if it is enriched with additional
index information.
The approach in (Zhang, Kacholia, and Özsu,
2004) is another succinct representation of XML. It
does not separate the raw data structure that de-
scribes the document tree from the tokens represent-
ing the elements. Therefore, one byte is required to
represent an end-tag, whereas our approach, BSBC,
only needs one bit. Furthermore, our separation of
structural data from element names does not only al-
low for a better compression as shown in the evalua-
tion; it also enables a more efficient evaluation of
path queries because raw bit data can be compared
more efficiently than tokens. A second difference is
our use of inverted element lists instead of token-
dictionaries, which additionally increases the speed
of path query evaluation significantly because the
number of possible path hits can be reduced quite
fast with a simple lookup within the inverted list.
To the best of our knowledge, the separation of
an XML stream into different compressed streams
linked by a bit stream that is also used to evaluate
path queries is unique to our compression technique,
WEBIST 2008 - International Conference on Web Information Systems and Technologies
20