Recently, StAX, a promising pull API for XML
processing, is implemented by some XML parsers,
for example BEA StAX RI (BEA, 2003), Sun Java
Streaming XML Parser (Sun, 2005), Oracle StAX
Pull Parser (Oracle, 2003) and Woodstox (Codehaus,
2006). But all of them have some fatal defects. For
example, BEA StAX RI cannot recognize invalid
characters in character data part of the document,
cannot parse entity references in the default attribute
values and cannot get the correct character text when
it reports a characters event; Sun Java Streaming
XML Parser cannot correctly deal with the external
parameter entity references in Document Type
Definition (DTD), cannot recognize legal characters
ranged from #x10000 to #x10FFFF and surrogate
pairs, and it doesn’t read attribute-list declarations in
the external subset; Oracle StAX Pull Parser is based
on the SAX Parser of Oracle XDK which leads to its
inefficient; Woodstox cannot fully support UTF-8,
especially the surrogate pairs, and it cannot
recognize some invalid names. As the only
validating parser among those StAX parsers,
Woodstox doesn’t report errors when it meets
invalid XML documents; instead it reports
exceptions and stops parsing.
Besides these common XML parsers, some
researches also focus on automatically generating
XML parsers from an XML Schema. XML
Screamer (Kostoulas et al, 2006) is such an
experimental system. Its parsing is integrated with
Schema-based validation and deserialization to
achieve high performance. But these automatically
generated parsers are not common XML parsers.
Each generated parser only fits those Schema-based
valid documents under a particular XML Schema.
OnceXMLParser is a common XML parser and
fully implements XML1.0/1.1 and Namespaces in
XML (W3C, 1999). It passes all the API tests for
StAX (Tatu, 2004), DOM (W3C, 2004) and SAX
(David, 2001) and all the XML conformance tests
(W3C, 2003), it also shows outstanding parsing
performance after adopting some efficient
performance tuning algorithms.
3 SYSTEM ARCHITECTURE
OnceXMLParser adopts a lightweight architecture
and consists of the following components shown in
Figure 1.
Like common lexical analysis used for traditional
programming language, Scanner is implemented as a
Deterministic Finite Automaton (DFA) to recognize
terminals from the input streams. Some well-
formedness constraints (WFCs), such as checking
valid name characters and valid attribute values, are
checked while recognizing those tokens. But before
regular lexical analysis, Scanner has to decode
characters denoted by encoding signature in the
scanning buffer and then put the decoded characters
into a pre-allocated character buffer. Considering
that the smaller size of character buffer needs more
filling buffer operations and more saving actions for
unparsed tokens at the end of character buffer, but
the larger size of character buffer increases costs for
normalization operations, we adjust the proper size
of character buffer in order to improve performance.
CoreParser is the core syntax processor, which
uses tokens got from Scanner and returns interested
information to applications through some standard
APIs. In addition, CoreParser checks most WFCs.
For example, in order to conform to WFC: Element
Type Match (W3C, 1998), CoreParser must check
whether these element names in start tags and
corresponding end tags are equal. CoreParser
resolves markups and characters according to their
statistical frequencies of occurrences, first for the
highest frequency.
DTDParser is a syntax analyzer for processing
declarations in DTD. It is also responsible for
building VC rules. In order to make the building
process efficiently, especially for those VC rules
about elements’ relations, DTDParser needs efficient
algorithms which will be discussed in the next
section. In addition, in order to facilitate the building
and checking processes of VC rules, efficient data
structures, such as light-weighed hash map, are
implemented.
Validation module checks VC rules collected by
DTDParser. The design of validation module is
crucial for performance, because these checking are
always burdensome tasks, for example, VC rules of
elements usually concern with the relationships
among elements and the content’s format of a
particular element; rules of attributes always concern
with the type and the default value of this attribute.
Moreover the efficient checking of VC: Element
Valid (W3C, 1998) is the largest challenge as we
will discuss soon, so we present some key
algorithms to solve it in the next section.
Figure 1: Architechture of OnceXMLParser.
ENSURING HIGH PERFORMANCE IN VALIDATING XML PARSER
211