control over character-level parsing. The following
text works with an abstract representation of the doc-
ument type definition, independent of its origin. The
first lines in the fragments from Figure 6 can give an
impression of the d2d input and the resulting XML
document.
1.2 Incremental Specifications,
Incomplete Documents and
Robustness
In practice, two features turned out to be of high value
and have been considered in the latest revision of d2d:
Firstly, the syntax of d2d has been carefully re-
vised for a more convenient support of incremental
refinement of document type definitions. Introduc-
ing new child elements, changing an element’s con-
tent from empty to optional children, from optional to
required or vice versa; all these operations require a
certain robustness and redundancy, esp. in tokeniza-
tion.
Secondly, the treatment of incomplete documents
turned out to be highly desirable: The new seman-
tics and the new algorithm realize a total translation
function, which recovers from ilelgal input as early as
possible. This allows convenient diagnostics in case
of erroneously incomplete documents, as well as a
well-defined way of creating incomplete documents
intentionally as preliminary versions.
To his end, the user may simply leave out required
child elements of a content model, or mark elements
as semantically incomplete by a special “brute-force”
closing tag, even if they are syntactically complete.
Both cases result in special meta-tags which are in-
serted in the generated XML model and should affect
further processing. In case of error, input may be not
only be missing, but also the opposite case, superflu-
ous tags or character data which cannot be parsed ac-
cording to the current content model. This kind of
input is transferred verbatim to the generated output,
wrapped in a meta-element.
In all these cases, the parsing algorithm tries to
resume work as soon as possible, maximizing the di-
agnostic output in one single run. The input syntax
and the complete operational semantics of this new,
robust parsing algorithm are subject of this paper.
2 THE PARSING ALGORITHM
There are three distinct layers of parsing in the d2d
architecture: Tokenization, tag-based parsing, and di-
rect character-based parsing for user-defined embed-
ded syntax. This text focusses on the second layer.
Explicit tagging and the LL(1) criterion for gram-
mar determinism can easily be taught to domain
experts without training in formal language theory.
Since the design of the formal and semi-formal doc-
ument structures for a new project or a certain pro-
duction context will happen in mixed teams of com-
puter scientists and domain experts, this is the ade-
quate level for communication.
2.1 Tokenization
As preprocessing to tag-based parsing, tokenization
effectively is the identification of the different kinds
of tags and of character data. Tokenization is de-
scribed informally in Figure 2. Its result is an instance
of the data type D from Figure 1.
Basically, all tags start with the command lead-
in character. This can be re-defined by the user, and
defaults to “#”. It is followed by an identifier. Like
in XML, a leading slash “/” indicates a closing tag,
a trailing slash an empty element. For situations in
which a closing tag cannot be inferred, d2d supports
a short-hand notation, similar to the “\verb κ...κ”
construct known from L
A
T
E
X, abbreviating the clos-
ing tag to the single character κ. Additionally there
are forms with triple slashes, which are “brute-force”
closing and empty tags. Using them indicates that the
contents of the element are intentionally left incom-
plete.
2.2 Content Model Declarations
For the purpose of this article we simply put all tag
identifiers into one global name space. Then every
element is typed by a simple identifier as its tag. Each
such identifier used as a tag is mapped to one content
model. A content is an extended regular expression T
from Figure 1.
The meaning is fairly standard: Any ident stands
for an element with that identifier as its tag. #empty
stands for empty content. #chars stands for charac-
ter data. In contrast to WC DTD and other formats,
we have full compositionality of all operators. The
three unary operators “?”, “+” and “*” stand for op-
tional, repeated and optional-repeated occurrence, as
usual. The three binary operators “,”, “|”, “&” mean
sequence, alternative and permutation. Note that, in
contrast to Relax-NG (Clark and Murata, 2008), the
operator “&” stands for permutation of its contiguous
sub-terms, not for interleaving.
As mentioned above, the following description
works on an instance of the abstract data type T
D
of
KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development
450