plications cannot afford the performance penalty that
has previously been the price of converting to XML.
There have been several efforts to reduce the stor-
age impact of converting to or otherwise utilizing
XML. XMill (Liefke and Suciu, 2000) is an XML
document compressor designed to alleviate the con-
cern of data expansion due to XML conversion. The
WAP Binary XML Format (World Wide Web Con-
sortium, 1999) and its extension Millau (Girardot and
Sundaresan, 2000) have been introduced for the effi-
cient encoding and streaming of XML structure, par-
ticularly in a wireless environment. These compres-
sion schemes addressed the data expansion concern
but do not provide explicit support for querying com-
pressed documents. There have been subsequent ef-
forts to provide query support over compressed XML
documents with systems that compress the content of
the document (Min et al., 2003) or its structure (Bune-
man et al., 2003).
This paper presents XPack, a document compres-
sion system providing both good compression and
strong query support for compressed documents. The
design goal of XPack is to support the acceptance of
XML as a viable data exchange mechanism by min-
imizing the performance penalty incurred by appli-
cations that use it. XPack’s compression and query
techniques are built upon these fundamental design
principles:
1. Redundancy Elimination. XPack reduces much of
the redundancy found in XML documents, yielding
a smaller document footprint.
2. Binary Format. XPack’s binary encoding requires
no parsing when loading a document from disk.
Document parsing and verification is a resource
intensive operation, so applications using XPack
can expect better performance when loading doc-
uments.
3. Compartmentalization. XPack separates various
document components to provide faster access to
interesting aspects of a document, such as its tree
structure information or node tag list.
4. Compressed Data Access. Unlike other widely
used document compression systems, XPack pro-
vides general query facilities that operate over the
compressed documents. Compressed data access
allows applications to store data in a compact,
space-saving format without sacrificing the ability
to do ad hoc querying.
2 RELATED WORK
The XPack document encoding leverages ideas from
our previous work on the Page Digest system (Rocco
et al., 2003) for efficient representation of HTML
Web pages. The Page Digest was designed to support
efficient document processing in applications such as
Web change monitoring (Buttler et al., 2004; Buttler
et al., 2003). XPack is an extension of this work that
targets XML documents, in particular, by providing a
more flexible framework that supports containerized
document compression and efficient path querying.
The foundation work for modern data compression
was done by Ziv and Lempel (Ziv and Lempel, 1977),
who proposed the idea of the dictionary compres-
sor. Dictionary compressors operate by substituting
repeated occurrences of a given string with a shorter
sequence; the original file can be reconstructed by re-
versing the substitution. This technique and its vari-
ants have become integral components of many stan-
dard computing tools such as the Gzip (loup Gailly
and Adler, 2004) file compression tool. For a more
general introduction to data compression, we refer the
reader to (Sayood, 2000; Witten et al., 1999).
Recently, there have been several efforts to de-
sign XML-specific compression algorithms. The first,
XMill (Liefke and Suciu, 2000), was designed to pro-
mote standardized document storage and transmission
formats while alleviating the concern of data expan-
sion that is often the penalty of converting data to
XML. The XMill compressor achieves this goal by
creating a container for each document tag and plac-
ing the data values for each tag into the same con-
tainer. The containers are then compressed using
a standard dictionary compression library. The in-
tuition behind this approach is that grouping values
by their tag names would arrange repetitious sections
of the document closer to each other, which would
yield greater compression efficiency from the dictio-
nary compressor. The fundamental problem with the
XMill approach is its opaque nature: data compressed
with the XMill compressor is only available for use
after being decompressed, a costly overhead step that
must be added to the overhead of parsing the text doc-
ument.
The Millau (Girardot and Sundaresan, 2000) bi-
nary format—an extension to the WAP Binary XML
Format (World Wide Web Consortium, 1999)—has
been introduced for the efficient encoding and stream-
ing of XML structure, particularly in a wireless envi-
ronment. A technique using multiplexed hierarchical
PPM models was introduced in (Cheney, 2001). It
has been shown to be slower than XMill, but in some
cases achieves a higher degree of compression.
There have been several recent efforts to provide
query support over compressed XML documents,
typically by making a trade-off between the degree
of compression and support for queries. The first
of these techniques, XGRIND (Tolani and Haritsa,
2002), compresses XML documents by using Huff-
man encoding for non-enumerated types. If a doc-
ument conforms to a known DTD, additional com-
XPACK: A HIGH-PERFORMANCE WEB DOCUMENT ENCODING
33