sparse if there is an advantage in exploiting its
zeros” (Duff et al., 1986). In (George et al., 1993), a
sparse matrix is also defined as the matrix which is
populated mainly with zeros whiles some references
are more specific, limiting the definition to those
matrices with certain amount of 0’s; e.g. 50% of the
entries are 0 as in (Mackay and Neal, 1995).
According to the above, our matrices for
representing the XML structures are considered to
be sparse. This is mainly reflected by the analysis
given in (Al-Badawi, 2010) which shows that the
number of zeros in the childOf and nextOf matrices
reaches n
2
-n, and the numbers of zeros in the descOf
matrix may exceed n
2
-h×n where ‘n’ is the number
of nodes and ‘h’ is the number of levels in the
underlying XML tree. When n goes high, the
number of 0 entries easily exceeds 90% of the total
entries
From a technical point of view, storing matrices
of this size in the computer system is a trade off
between the high storage size and storage
performance (Tarjan and Yao, 1979). For example,
to store a matrix when n=10
6
into two-dimensional
array of type character (one byte stores one
character), we need 10
12
bytes of memory which
may defeat the HW/SW capabilities. One way to
address this issue is using sparse matrix compression
(SMC) techniques to compact the matrix’s storage.
The architecture of any SMC technique depends
on the computation to be performed, the pattern of
the non-zero entries, and even the architecture of the
computer system itself (Duff et al., 1986; Willcock
and Lumsdaine, 2006). Among these three factors,
we are only concerned with the computation
constraints in this stage of our research; having that
achieving optimum storage with good performance
is the main goal of the compression process. The
investigation of other issues is a subject for further
research.
To align the choice of the used SMC with the
cost reduction of the XML querying and updating
operations, PACD categorizes the existing sparse-
matrix techniques into two groups; the first includes
the techniques which do not necessitate any
decompression/recompression process during the
XML querying and updating operations so the
overhead complexity incurred by these processes
will be avoided during the XML querying and
updating. The second category contains those
techniques which defeat the XML operations by the
cost of decompressing/recompressing processes
done to the underlying storage. Detailed discussion
of this aspect plus the empirical proof lays down
outside the scope of this paper due the space
limitation.
4 CONCLUSIONS
To conclude, this paper described the PACD’s DCM
which uses three data compression processes to
compact the XML structures. As introduced in (Al-
Badawi et al., 2009), the XML structures are
theoretically encoded into ten n×n matrices each of
which represents a structural relationship which
corresponds to an XPath’s axis or an extension. Each
structural relationship is encoded into a set of node
pairs where such relationship applies between them.
So, each matrix represents the corresponding
structural relationship between all nodes in the XML
tree.
PACD’s matrices are found in invertible pairs
and inclusive pairs, and are sparse. The first
compression phase uses the first characteristic; that
is each invertible pair is represented by only one
matrix. This process can reduce the number of
matrices from ten to five matrices. The next
compression phase uses the inclusiveness
characteristic; that is two or more matrices are
clustered into a single matrix such that the full
architecture of all composing matrices is preserved
in the clustered matrix. The last compression phase
is based on using one or more sparse-matrix
compression techniques to compact the layout of the
resulting matrix from the first two compressions.
The strength and efficiency of the PACD’s
overall storage is determined by the specification of
the clustering and SMC methods used. A complete
discussion about this topic including the
experimental proof is the subject for further
publications.
REFERENCES
Al-Badawi, M. (2010) ‘A Performance Evaluation of a
New Bitmap-based XML Processing Approach’, PhD
Thesis, University of Sheffield, UK.
Al-Badawi, M., Eaglestone, B., and North, S. (2009)
‘PACD: A Bitmap-based Framework for Processing
XML Data’, In the proceedings of the WebIST’09,
Lisbon, Portugal, pages 66-71.
Berglund, A., Boag, S., Chamberlin, D., Fernández, M.,
Kay, M., Robie, J., and Siméon, J. (2010) XML Path
Language (XPath) 2.0 (2
nd
Ed.), [Online] Avail: http://
www.w3.org/TR/xpath20/ [15/11/2011].
Boag, S., Chamberlin, D., Fernández, M., Florescu, D.,
Robie, J., and Siméon, J. (2010) XQuery 1.0: An XML
Query Language, (2
nd
Ed.) [Online] Avail: http://www
WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies
94