use a reserved byte for encoding the end-tag of an
element. They all allow querying the compressed
data.
The encoding-based compression approach in
(Zhang, Kacholia, and Özsu, 2004) defines a suc-
cinct representation of XML that stores the start-tags
in form of tokens and the end-tag in form of a spe-
cial token (e.g. ‘)’). They enrich their compressed
XML representation by some additional index data
that allows a more efficient query evaluation. This
approach allows querying of compressed data.
XQzip (Cheng and Ng, 2004) and the approaches
presented in (Adiego, Navarro, and de la Fuente)
and (Buneman, Grohe, and Koch, 2003) belong to
grammar-based compression. They compress the
data structure of an XML document by combining
identical sub-trees. Afterwards, the data nodes are
attached to the leaf nodes, i.e., one leaf node may
point to several data nodes. The data is compressed
by an arbitrary compression approach. These ap-
proaches allow querying compressed data.
An extension of (Buneman, Grohe, and Koch,
2003) and (Cheng and Ng, 2004) is the BPLEX al-
go-rithm (Busatti, Lohrey, and Maneth, 2005). This
approach does not only combine identical sub-trees,
but recognizes similar patterns within the XML, and
therefore allows a higher degree of compression. It
allows querying of compressed data.
Schema-based compression comprises such ap-
proaches as XCQ (Ng et al., 2006), XAUST (Sub-
ramanian, and Shankar, 2005), Xenia (Werner et al.,
2006) and DTD subtraction (Böttcher, Steinmetz,
and Klein, 2007). They subtract the given schema
information from the structural information. Instead
of a complete XML structure stream or tree, they
only generate and output information not al-ready
contained in the schema information (e.g., the cho-
sen alternative for a choice-operator or the num-ber
of repetitions for a *-operator within the DTD).
These approaches are queryable and applicable to
XML streams, but they can only be used if schema
information is available.
XSDS follows the same basic idea to delete in-
formation which is redundant because of a given
schema. In contrast to XCQ, XAUST and DTD sub-
traction that can only remove schema information
given by a DTD, XSDS works on XML schema
which is significantly more complex than DTDs.
Furthermore, XSDS uses a counting schema for re-
petitions that compresses stronger than e.g. the ones
used in XCQ or Xenia.
The approach in (Ferragina et al., 2006) does not
belong to any of the three categories. It is based on
Burrows-Wheeler Transform (Burrows and Wheeler,
1994), i.e., the XML data is rearranged in such a
way that compression techniques such as gzip
achieve higher compression ratios. This approach
allows querying the compressed data only if it is
enriched with additional index information.
In comparison to all other approaches, XSDS is
the only approach that combines the following ad-
vantageous properties: XSDS removes XML data
nodes that are fixed by the given XML schema, it
encodes choices, repetitions, and ‘all’-groups in an
efficient manner, and it allows for efficient query
processing on the compressed XML data.
To the best of our knowledge, no other XML
compression technique combines such a compres-
sion performance for SEPA data with such query
processing speed on compressed data.
6 CONCLUSIONS
We have presented XSDS (XML schema subtrac-
tion) – an XML compressor that performs especially
well for electronic payment data in SEPA format.
XSDS removes all data that can be inferred from
the given schema information of the XML docu-
ment. Thereby, XSDS provides two major advantag-
es: First, XSDS generates a strongly compressed
document representation which may save costs and
energy by saving bandwidth for data transfer and by
saving main memory required to process data and by
saving secondary storage needed to archive com-
pressed XML data. Second, XSDS supports fast
query evaluation on the compressed document with-
out prior decompression.
Our experiments have shown that XSDS com-
presses SEPA messages down to a size of 11% of
the original SEPA document size on average, which
outperforms the other compressors, i.e. gzip, XMill
and bzip2, by a factor of 3. Furthermore, query eval-
uation directly on the compressed SEPA data is not
only possible, but in our experiments, query
processing reaches throughput rates that are higher
than those of ADSL2+. Therefore, we consider the
XSDS compression technique to be highly beneficial
in all SEPA applications for which the data volume
is a bottleneck.
REFERENCES
J. Adiego, G. Navarro, P. de la Fuente: Lempel-Ziv Com-
pression of Structured Text. Data Compression Confe-
rence 2004
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
28