from the point of view of subject area. Clarke
(Clarke, 1994) supposes that every structured
document is labeled and he defines a metric for it.
He creates a model for a text database as a string of
concatenated symbols drawn from a text alphabet
and a stoplist alphabet. The goal of his
investigations is the definition of operations over a
structured document. Query algebra is presented and
it expresses searches over structured texts.
Calvanese, De Giacomo and Lenzerini (Calvanese,
1999 and Groß-Hardt, 2002) consider structured
documents in terms of XML and DTD. Again, the
goal of their research is the definition of operations
over structured documents using description logic.
1.2 Article structure
The article consists of four parts. There is a formal
presentation of a structured document model in the
first one. The model presents a description of a
document, without taking into account the
operations over it, as the operations over different
types of documents are quite different. In the second
part the model is transformed into the language of
conceptual graphs. In the third one, there is a
summary algorithm presented for conversion of CG
in a related DTD.
2 STRUCTURED DOCUMENT
MODEL – A FORMAL
PRESENTATION
The information, which can be stored and
reproduced in an electronic format, is known as
document information. A document is assumed to be
a random text, stored on a holder as a single
information unit.
Every random text consists of a set of sentences,
which in turn consist of words. A sentence begins
with an upper-case letter and ends with a full stop,
denoting the end of it. A sentence with no words and
no ending point is an empty sentence. Sentences are
gathered together in logical structures, called
paragraphs. Each paragraph is a separate part of the
document. Paragraphs serve for separating
information into logically differentiated units. This
type of separation helps introduce a logical order in
a document, and process an information search.
Sentences exist only in paragraphs. A paragraph will
be called empty paragraph, if it consists of no
sentence or only empty sentences. The empty
paragraphs do not have information (meaningful)
contents. Therefore a document is an empty
document, if it consists of no paragraphs or only
empty paragraphs.
There are two document categories:
• Documents that have a predefined logical
order of their paragraphs
• free format documents, i.e. documents that
have no predefined logical order of their
paragraphs.
We are interested in creating a model for
documents, having a predefined logical order of
their paragraphs.
There is no exact definition for the term
structured document. There are only intuitive
assumptions, which are used depending on the
desired goals. For the goals of the present article we
shall introduce the following definitions:
Definition 1: A structured document is assumed
to be a document, for which the paragraphs can be
named and a relation ‘order’ is defined.
The relation ‘order’ applied for paragraphs can
be a strict one or a free one. Each document that has
paragraphs with a free order relation can be
transformed into a new document having a strict
order relation of the paragraphs by fixing the
paragraphs’ sequence. Paragraph naming and
paragraph ordering are called document labeling.
The document labeling consists in splitting up the
document into sets of tags, each of them consisting
of a part of the document text. Generally, repetition
of the same tags, consisting of the same part of the
text is allowable.
Let D be the set of structured documents.
Let X be the set of all elements (tags) for a given
set of documents D.
X = { nomenclature of elements (tags)
of a given document }
The tags are the metadata of a document. They
describe the document structure. This means that
every document can hold, besides its own
information tags, information part driven from outer
document of another type. Generally, the outer
document is not obligatory to be a structured one.
Hyperlinks can point toward a part of the document
itself.
Let T be the set of all paragraphs or texts for a
given set of documents D.
T = { paragraphs }
A document we assume to be a subset of the set
T
. Then for every document d that is a
member of the set
D it is true that . XTd ×⊆
We assign
d
to be the set containing the
separate tags of a given document.
X
⎪
⎭
⎪
⎬
⎫
⎪
⎩
⎪
⎨
⎧
∈∀= ∧ Ν∈ DX
x
ii
x
X
d
i
d
i
d I, [1]
or
ICETE 2004 - GLOBAL COMMUNICATION INFORMATION SYSTEMS AND SERVICES
258