nearly the same as in our case (e.g., text values are
encoded differently), but whereas our approach is
capable to sequentially compress infinite data
streams (and to perform path queries on the
compressed data) the approach presented in
(Sundaresan, 2001) is limited by the size of main
memory (According to (Sundaresan, 2001) they had
problems with a file of 281 kB (the smallest test file
in our data set), as it exceeded their time threshold,
whereas we were able to compress a document of
more than 300 MB in decent time).
Another approach (Ng, 2006) also follows the
idea to omit information that is redundant because of
the DTD. It is also using a SAX-parser, i.e., this
approach is capable to compress infinite data
streams. However, this approach uses the paths
allowed by a non-recursive DTD to define a set of
buffers and stores constants found in the XML
document in the buffer defined for their path before
compressing the whole buffer. As the number of
buffers must be finite, this approach does not
support recursive DTDs. Furthermore, (Ng, 2006)
leaves it open how to treat constants other than
numbers. In comparison, our approach can store text
constants and can handle recursive DTDs.
7 SUMMARY AND
CONCLUSIONS
Structure preserving compression of XML data
based on DTD subtraction reduces the verbose
structural parts of XML documents that are
redundant when regarding the structure of a given
DTD, but preserves enough information to search
for certain paths in the compressed XML data. Our
approach uses given DTD element declarations to
generate a set of grammar rules, which is then
augmented to an attribute grammar for either
compression of an XML document to a KST and a
CXML document, for decompression of a KST and
a CXML document or for translating XPath queries
into index positions on CXML data. Our approach
can be extended to arbitrary DTDs as well as to
other schema languages that allow to derive a set of
grammar rules (e.g., XML Schema and RelaxNG)
and to XML streams.
Of course, our structure preserving compression
technique can be combined with an ordinary
compression and decompression, i.e., a normal
compression and decompression technique could be
applied to the data generated by our structure
preserving compression, in order to send an even
less amount of data from a sender to a recipient.
Note however that our overall goal is not only to get
a low compression size, but the main focus of our
contribution is to enable path query processing on
compressed data.
As XPath forms the major part of other query
languages like XQuery and XSLT, we are optimistic
that our approach could also be applied to XQuery
and XSLT.
REFERENCES
Arion, A., Bonifati, A., Costa, G., D’Aguanno S.,
Manolescu, I., Pugliese, A., 2003. XQueC: Pushing
queries to compressed XML data. In Proc. VLDB.
Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R.,
Viglas. S., 2005. Vectorizing and Querying Large
XML Repositories. In ICDE 2005.
Buneman P., Grohe, M., Koch, C., 2003. Path Queries on
Compressed XML. In VLDB 2003.
Busatto, G., Lohrey, M., Maneth, S., 2005. Efficient
Memory Representation of XML Documents. In
DBPL 2005.
Cheng, J., Ng, W., 2004. XQzip: Querying Compressed
XML Using Structural Indexing. In EDBT 2004.
Cloksin W.F., Mellish, C.S., 1997. Programming in
Prolog, Springer. Berlin, 4
th
Edition.
Fredkin, E., 1960. Trie Memory. In Communications of
the ACM.
Huffman, D., 1952. A Method for Construction of
Minimum-Redundancy Codes. In Proc. of IRE.
Liefke, H., Suciu, D., 2000. XMill: An Efficient
Compressor for XML Data. In Proc. of ACM
SIGMOD.
Min, J.K., Park, M.J., Chung, C.W., 2003. XPRESS: A
Queriable Compression for XML Data. In
Proceedings of SIGMOD.
Ng, W., Lam, W.-Y., Wood, P.T., Levene, M., 2006 XCQ:
A Queriable XML Compression System. In
Knowledge and Information Systems, Springer-Verlag.
Olteanu, D., Meuss, H., Furche, T., Bry, F., 2002. XPath:
Looking Forward. In EDBT Workshops 2002.
Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu
I., Busse, R., 2002. XMark: A benchmark for XML
data management. In VLDB 2002.
Su, H., Rundensteiner, E.A., Mani, M., 2005. Semantic
Query Optimization for XQuery over XML Streams.
In VLDB 2005.
Sundaresan N., Moussa, R., 2001. Algorithms and
programming models for efficient representation of
XML for Internet applications. In WWW 2001.
Tolani, P.M., Hartisa, J.R., 2002. XGRIND: A query-
friendly XML compressor. In Proc. ICDE 2002.
Yao, B.B., Ozsu, M.T., Kennleyside, J., 2002. XBench - A
family of benchmarks for XML DBMSs. In
Proceedings of EEXTT.
Extensible Markup Language (XML) 1.0, 2000.
http://www.w3.org/TR/2000/REC-xml-20001006
Ziv, J., Lempel, A., 1977. A Universal Algorithm for
Sequential Data Compression. In IEEE Transactions
on Information Theory.
ICEIS 2007 - International Conference on Enterprise Information Systems
94