amount of internal memory allocated to a process can
never exceed twice the file size. Not only did EXI
exceed this, but it exceeded three times the original
file size. Section 5 provides a brief discussion on why
we think this occurs, and what can be done to im-
prove it. However, in terms of our online compression
algorithms (which don’t produce substantial memory
footprint), we can see that they scale well to data of
all sizes.
Relating each individual algorithm to the baseline
transfers (Offline & GZIP) we see that in each case all
online algorithms perform better than GZIP. However,
in each case as we relax the "online constraint" (the
more data we are allowed to buffer), the better and
more significant results we receive. Comparing Of-
fline XSAQCT, which requires processing the entire
document before transfer to Lightweight Sax, which
processes on the fly, the Offline XSAQCT performs
on average, 15% better. However, comparing Offline
to Path-centric algorithm, Offline XSAQCT performs
on average, 2% better, which is significant because
Path-centric algorithm does not require a full docu-
ment scope, which would be optimal in the domain of
large-scale XML streaming. In comparison to Tree-
chop, the original and lightweight SAX variants also
prove to be quite competitive and once again the Path-
centric algorithms beat Treechop. Finally, in com-
parison to EXI, the Path-centric algorithms tend to
be similar for documents whose size was larger than
1GB. For example, for 1gig.xml, a document which
contains a lot of character data, EXI beats "all" algo-
rithms by a substantial margin.
5 CONCLUSIONS AND FUTURE
WORK
This paper provided a design and implementation
of four new versions of online XML compression,
which exploit local structural redundancies of pre-
order traversals of an XML tree and focus on reduc-
ing the overhead of sending packets and maintaining
load balancing between the sender and receiver with
respect to encoding and decoding. For testing various
algorithms, a suite consisting of 11 XML files with
various characteristics was designed and analyzed. In
order to take into account network issues, such as
its bottleneck, two measures of the encoding process
were considered to compare ten encoding techniques,
using GZIP, EXI, Treechop, XSAQCT, and its im-
provement, and our new algorithms. Our experiments
indicated that our new algorithms have similar or bet-
ter performance than existing online algorithms, such
XSAQCT and Treechop, and have only worse perfor-
mance than (a non-queryable XML compressor) EXI
for files larger than 1 GB.
For future work we will consider the amount
of overhead of performing each encoding as in the
amount of resources required for encoding. Recall
the missing data in Table 4. The reason for the ex-
traordinary resource requirement can be attributed to
the fact that EXI also performs manipulation on the
text to increase compression ratios, which appears to
be resource heavy. In addition, more techniques will
be introduced that alleviate the vast overhead of send-
ing ASCII zero characters to delimit text and to sat-
isfy the full mixed content property. Next, context-
modeling techniques (such as those used in the PPM
& PAQ variants of data compression) will be intro-
duced as a pre-processor on character data to allow
higher compression ratios. Finally, a more concise
and meaningful definition of Structure Regularity that
takes into account the skewness of occurrences, by us-
ing the annotated tree representation, will be defined
and studied.
REFERENCES
Arion, A., Bonifati, A., Manolescu, I., and Pugliese, A.
(2007). XQueC: a query-conscious compressed XML
database. ACM Transactions on Internet Technology,
7(2).
Baseball.xml (2012). baseball.xml, retrieved October 2012
from http://rassyndrome.webs.com/cc/baseball.xml.
enwiki dumps (2012). enwiki-latest.xml, retrieved October
2012 from http://dumps.wikimedia.org/enwiki/latest/.
EXI (2012). Efficient XML Interchange (EXI) Format
1.0, Retrieved October 2012 from http://www.w3.org/
TR/exi/.
GZIP (2012). The gzip home page, retrieved October 2012
from http://www.gzip.org.
Hartmut, L. and Suciu, D. (2000). XMill: an efficient com-
pressor for XML data. ACM Special Interest Group on
Management of Data (SIGMOD) Record, 29(2):153–
164.
HTTP (2012). HTTP RFC 2616, retrieved Octo-
ber 2012 from http://www.w3.org/protocols/rfc2616/
rfc2616.html.
Leighton, G. and Barbosa, D. (2009). Optimizing XML
compression. XML Database Symposium (XSym)
’09, pages 91–105, Berlin, Heidelberg. Springer-
Verlag.
Leighton, G., Müldner, T., and Diamond, J. (2005). TREE-
CHOP: A Tree-based Query-able Compressor for
XML. The Ninth Canadian Workshop on Information
Theory, pages 115–118.
Lin, Y., Zhang, Y., Li, Q., and Yang, J. (2005). Supporting
efficient query processing on compressed XML files.
Proceedings of the Symposium on Applied Comput-
ing (SAC) ’05, pages 660–665, New York, NY, USA.
ACM.
WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies
14