exchange, and application objects in memory in
order to access information, a situation that occurs in
all existing applications, causes inefficiency in
processing XML documents.
The proposed Contiguous Memory Tree (CMT)
and its XML API (Guseynov, 2006) completely
resolve Parsing and Processing Efficiency by
creating an efficient interchange format for XML. It
is based on the presentation of XML documents as a
tree structure that contiguously resides in memory
and is simultaneously a stream that can be directly
copied as a message and an application object that
can be directly accessed through the CMT XML
API. CMT is the universal way to exchange XML
documents and any hierarchical information
regardless of operating systems and languages like
C++ with direct access to memory or Java, Visual
Basic, Perl, others with the ability to contiguously
allocate arrays in memory. CMT and its XML API
has all features of existing formats: the compatibility
with the XML document at the XML Information
Set level, serialization, parsing as DOM and SAX,
XML schema independence and self-description,
support a sequential or fragments processing
(Streamability), indexing of repeated strings,
preservation of the state for documents with the
same schema and vocabulary, platform and language
neutrality, reduced document size, and fast
processing speed. In addition, CMT and its XML
API have significant advantage. CMT XML API
does not need to read and evaluate markup or decode
information items that takes much CPU time when
processing, thus is significantly more efficient than
any known parser by the elapsed time that a parser
needs to parse an input stream before actual
processing.
2 CONTIGUOUS MEMORY TREE
We may define CMT based on pointers for
languages like C++ with direct access and explicit
allocation of memory and like Java based on the
ability to contiguously allocate arrays in memory.
The approach based on arrays is universal because
almost all programming languages have the ability
to contiguously allocate memory arrays of the basic
types, integer and character.
To build CMT for an XML document or any
hierarchical information we need three arrays: the
array of integers, Hierarchy[], to hold the
hierarchical (tree) structure of the document; the
array of characters, SchemaComponents[], to hold
tag names, attribute names, and other components
for all documents with the same XML schema; the
array of characters, DocumentValues[], for each
document to hold elements and attributes values for
the whole document.
The Hierarchy[] array is built with blocks of six
integers:
SchemaNode
{ int parent;
int firstChild;
int nextSibling;
int tagName;
int offset;
int numOfAttributes;
}
The first four integers allow CMT to be built
from any hierarchical information. For each element
E1 from the hierarchy of an XML document, the
parent, firstChild, and nextSibling members of
SchemaNode are positions in the Hierarchy[] array
that are start positions, respectively for parent, first
child, and next sibling elements for E1. The member
tagName is the start position in the
SchemaComponents[] array for the element name.
The last two integers in the struct SchemaNode
pertains to XML documents. The member offset is
the start position in DocumentValues[] for the
content or value that is the text between two tags in
an XML element. numOfAttributes represents the
number of attributes in an XML element.
Each CMT element in memory consists of a
SchemaNode followed by numOfAttributes pair of
integers: the first element of a pair is the start
position of the attribute name in the
SchemaComponents[] array and the second is the
start position of the attribute value in the
DocumentValues[] array. Next three tables present
an example of CMT for an XML document
<Product bottles="12" size="9oz" >
<ItemName>Chartreuse verte</ItemName>
<ItemPrice>$18.00</ItemPrice>
</ Product >
To define CMT we need to build each element in the
XML document one by one into Hierarchy[],
SchemaComponents[], and DocumentValues[]
arrays. They will remain unchanged if we copy them
to any location and we may also directly store these
arrays contiguously on the disk or any other medium
as a stream of bytes to exchange the XML document
with other applications. After copying these three
arrays back into memory an application can access
without parsing all the information that the XML
document has, starting from any position in the
Hierarchy[] array.
WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies
82