Authors:
Stefan Böttcher
;
Rita Hartel
and
Christoph Krislin
Affiliation:
University of Paderborn, Computer Science, Germany
Keyword(s):
XML Compression, Grammar-based Compression, XML Sub-tree Clustering.
Related
Ontology
Subjects/Areas/Topics:
Databases and Information Systems Integration
;
e-Business
;
Enterprise Information Systems
;
Middleware Integration
;
Middleware Platforms
;
Technology Platforms
;
Web Databases
Abstract:
XML has become the de facto standard for data exchange in enterprise information systems. But whenever XML data is stored or processed, e.g. in form of a DOM tree representation, the XML markup causes a huge blow-up of the memory consumption compared to the data, i.e., text and attribute values, contained in the XML document. In this paper, we present CluX, an XML compression approach based on clustering XML sub-trees. CluX uses a grammar for sharing similar substructures within the XML tree structure and a cluster-based heuristics for greedily selecting the best compression options in the grammar. Thereby, CluX allows for storing and exchanging XML data in a space efficient and still queryable way. We evaluate different strategies for XML structure sharing, and we show that CluX often compresses better than XMill, Gzip, and Bzip2, which makes CluX a promising technique for XML data exchange whenever the exchanged data volume is a bottleneck in enterprise information systems.