NEW APPROACHES FOR XML DATA COMPRESSION
Márlon A. C. Teixeira
1
, Rodrigo S. Miani
2
, Gean D. Breda
2
,
Bruno B. Zarpelão
2
and Leonardo S. Mendes
2
1
Acre Federal Institute of Education, Science and Technology (IFAC), Rio Branco, Brazil
2
School of Electrical and Computer Engineering, University of Campinas (UNICAMP), Campinas, Brazil
Keywords: Data Compression, Data Exchange, Extensible Markup Language (XML), Interoperability.
Abstract: Integration of information systems is essential to organizations. Therefore, it is necessary to make different
technologies interoperate. Extensible Markup Language (XML) is often used for data exchange because it is
self-descriptive and platform-independent. However, XML is a verbose language which may bring
problems related to the size of documents. This work proposes two new approaches for XML data
compression and compares our solutions with three algorithms: WAP Binary Extensible Markup Language
(WBXML), Xmill and Efficient XML Interchange (EXI). The comparison is based on compression rate and
compression time for files with different sizes.
1 INTRODUCTION
Extensible Markup Language (XML) is a de facto
standard to exchange data in heterogeneous
environments. However, problems with the size of
XML documents are common due to the redundancy
contained in their repeated tags (Ng et al., 2006)
(Augeri et al., 2007). A possible solution to reduce
the size of XML documents is the application of data
compression algorithms. Yet, XML-conscious
compressors like WBXML (WAP Binary Extensible
Markup Language) (XBXML, 1999), XMill (Liefke
and Suciu, 2000) and EXI (Efficient XML
Interchange) (EXI, 2011) bring about, as a result,
compressed documents which are not in the XML
format. This way, one of the great advantages of the
XML language, which is the diversity of APIs
(Application Programming Interfaces) that can be
found for parsing and querying XML data, is
annulled.
In the present work, we propose two approaches
for the compression of XML documents. The first
solution, entitled Schema-aware algorithm, is based
on the elimination of redundant tags of the document
structure. The final compression outcome is still a
document in the XML format. So, it is possible to
handle the compressed document by using the
various XML parsing and querying solutions
available. The second solution is a hybrid algorithm
which applies the general purpose compressor GZIP
(Gailly and Addler, 2011) on the result of the
compression achieved from the Schema aware
algorithm.
To assess both propositions, we have carried out
a comparative with the WBXML, Xmill and EXI
algorithms taking in consideration two metrics:
compression rate and compression time. These two
metrics were collected during the compression of
files of 8 different sizes.
The remainder of the paper is structured as
follows. In section 2, the Schema aware algorithm is
presented. In section 3, the Hybrid algorithm is
presented. Section 4 brings the comparison between
the two proposed solutions and the compressors
WBXML, XMill and EXI. At last, section 5 frames
the final considerations on this work.
2 SCHEMA-AWARE
ALGORITHM
The Schema-aware algorithm consists of the
elimination of redundant tags which are in the same
level of the document, without neither compressing
the data, nor the tags existent in the document. To
demonstrate how this redundancy occurs, we shall
use the following XML document:
233
A. C. Teixeira M., S. Miani R., D. Breda G., B. Zarpelão B. and S. Mendes L..
NEW APPROACHES FOR XML DATA COMPRESSION.
DOI: 10.5220/0003896202330237
In Proceedings of the 8th International Conference on Web Information Systems and Technologies (WEBIST-2012), pages 233-237
ISBN: 978-989-8565-08-2
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)