CLUX - Clustering XML Sub-trees

Stefan Böttcher, Rita Hartel, Christoph Krislin

Abstract

XML has become the de facto standard for data exchange in enterprise information systems. But whenever XML data is stored or processed, e.g. in form of a DOM tree representation, the XML markup causes a huge blow-up of the memory consumption compared to the data, i.e., text and attribute values, contained in the XML document. In this paper, we present CluX, an XML compression approach based on clustering XML sub-trees. CluX uses a grammar for sharing similar substructures within the XML tree structure and a cluster-based heuristics for greedily selecting the best compression options in the grammar. Thereby, CluX allows for storing and exchanging XML data in a space efficient and still queryable way. We evaluate different strategies for XML structure sharing, and we show that CluX often compresses better than XMill, Gzip, and Bzip2, which makes CluX a promising technique for XML data exchange whenever the exchanged data volume is a bottleneck in enterprise information systems.

References

  1. J. Adiego, G. Navarro, P. de la Fuente: Lempel-Ziv Compression of Structured Text. Data Compression Conference 2004
  2. D. Arroyuelo, F. Claude, S. Maneth, V. Mäkinen, G. Navarro, K. Nguyen, J. Siren, N. Välimäki, 2010: Fast In-Memory XPath Search over Compressed Text and Tree Indexes. ICDE 2010.
  3. R. J. Bayardo, D. Gruhl, V. Josifovski, and J. Myllymaki., 2004. An evaluation of binary xml encoding optimizations for fast stream based XML processing. In Proc. of the 13th international conference on World Wide Web.
  4. S. Böttcher, R. Hartel, Ch. Heinzemann: BSBC: Compressing XML streams with DAG + BSBC. In: WEBIST 2008, Funchal, Portugal, 2008.
  5. S. Böttcher, R. Hartel, Ch. Messinger: XML Stream Data Reduction by Shared KST Signatures. HICSS 2009
  6. S. Böttcher, R. Steinmetz, N. Klein, 2007. XML Index Compression by DTD Subtraction. International Conference on Enterprise Information Systems (ICEIS).
  7. P. Buneman, M. Grohe, Ch. Koch, 2003. Path Queries on Compressed XML. VLDB.
  8. M. Burrows and D. Wheeler, 1994. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation.
  9. G. Busatto, M. Lohrey, and S. Maneth, 2005. Efficient Memory Representation of XML Dokuments, DBPL.
  10. J. Cheney, 2001. Compressing XML with multiplexed hierarchical models. In Proceedings of the 2001 IEEE Data Compression Conference (DCC 2001).
  11. J. Cheng, W. Ng: XQzip, 2004. Querying Compressed XML Using Structural Indexing. EDBT.
  12. F. Claude and G. Navarro, 2007: A Fast and Compact Web Graph Representation. Proc. 14th International Symposium on String Processing and Information Retrieval (SPIRE).
  13. P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan, 2006. Compressing and Searching XML Data Via Two Zips. In Proceedings of the Fifteenth International World Wide Web Conference.
  14. D. K. Fisher and S. Maneth, 2007. Structural Selectivity Estimation for XML Documents. In Proc of the ICDE.
  15. M. Girardot and N. Sundaresan. Millau, 2000. An Encoding Format for Efficient Representation and Exchange of XML over the Web. Proceedings of the 9th International WWW Conference.
  16. D. A. Huffman, 1952. A method for the construction of minimum-redundancy codes. In: Proc. of the I.R.E.
  17. J. Larsson and A. Moffat, 2000: Off-Line DictionaryBased Compression. Procceedings of the IEEE.
  18. H. Liefke and D. Suciu, 2000. XMill: An Efficient Compressor for XML Data, Proc. of ACM SIGMOD.
  19. J. K. Min, M. J. Park, C. W. Chung, 2003. XPRESS: A Queryable Compression for XML Data. In Proceedings of SIGMOD.
  20. W. Ng, W. Y. Lam, P. T. Wood, M. Levene, 2006: XCQ: A queryable XML compression system. Knowledge and Information Systems.
  21. D. Olteanu, H. Meuss, T. Furche, F. Bry, 2002: XPath: Looking Forward. EDBT Workshops.
  22. A. Schmidt, F. Waas, M. Kersten, M. Carey, I. Manolescu, and R. Busse, 2002. XMark: A benchmark for XML data management. Hong Kong, China.
  23. H. Subramanian, P. Shankar: Compressing XML Documents Using Recursive Finite State Automata. CIAA 2005
  24. P. M. Tolani and J. R. Hartisa, 2002. XGRIND: A queryfriendly XML compressor. In Proc. ICDE.
  25. Ch. Werner, C. Buschmann, Y. Brandt, S. Fischer: Compressing SOAP Messages by using Pushdown Automata. ICWS 2006
  26. N. Zhang, V. Kacholia, M. T. Özsu, 2004. A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. ICDE
  27. J. Ziv and A. Lempel: A Universal Algorithm for Sequential Data Compression, 1977. In IEEE Transactions on Information Theory, No. 3, Volume 23, 337-343
Download


Paper Citation


in Harvard Style

Böttcher S., Hartel R. and Krislin C. (2010). CLUX - Clustering XML Sub-trees . In Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8425-04-1, pages 142-150. DOI: 10.5220/0002877901420150


in Bibtex Style

@conference{iceis10,
author={Stefan Böttcher and Rita Hartel and Christoph Krislin},
title={CLUX - Clustering XML Sub-trees},
booktitle={Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2010},
pages={142-150},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002877901420150},
isbn={978-989-8425-04-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - CLUX - Clustering XML Sub-trees
SN - 978-989-8425-04-1
AU - Böttcher S.
AU - Hartel R.
AU - Krislin C.
PY - 2010
SP - 142
EP - 150
DO - 10.5220/0002877901420150