HYPERSET/WEB-LIKE DATABASES AND THE EXPERIMENTAL

IMPLEMENTATION OF THE QUERY LANGUAGE DELTA

Current State of Affairs

Richard Molyneux and Vladimir Sazonov

Department of Computer Science, University of Liverpool, Ashton Street, Liverpool, United Kingdom

Keywords:

Web-like, semistructured, distributed databases, hypersets, bisimulation, query language Delta.

Abstract:

The hyperset approach to WEB-like or semistructured databases is outlined. WDB is presented either (i) as a

ﬁnite edge-labelled graph or, equivalently, (ii) as system of (hyper)set equations or (iii) in a special XML-WDB

format convenient both for distributed WDB and for including arbitrary XML elements in this framework. The

current state of affairs on experimental implementation of a query language ∆ (Delta) to such databases—the

main result of this paper—is described, with consideration of further implementation work to be done.

1 INTRODUCTION

It appears that the great success of Codd’s relational

approach to databases (Codd, 1983) was based on

taking the most fundamental concepts of logic and

set theory as its foundation. Thus, any relation is a

set of tuples, each tuple being also represented by a

set of a special kind. From the second half of the

1990s a new idea of semistructured databases (SSDB)

emerged; see (Abiteboul et al., 2000) as a general

reference. In the age of the Internet and the World-

Wide Web, allowing accessibility of remote and het-

erogeneous databases, the relational paradigm has be-

come too narrow and restrictive. The structure of the

data over the Internet is typically non-ﬁxed or non-

uniform. The idea of graph representation of data

was introduced with interpretation of graph (directed)

edges like links to browse the Web. Furthermore, be-

cause of such “browsing”, considering the graph as a

binary (or ternary, if taking labels on edges into ac-

count) relation is not fully adequate in this context.

That is, intuitively this is something more than just

a relation. Indeed, interpreting end nodes of outgo-

ing edges from any given node n as “children” or

even as “elements” of n becomes more appropriate.

In particular, the latter is the terminology of XML—

the widely adopted approach to semistructured data

(however, tree-like rather than graph-like).

This also leads us again to a set theoretic idea of

representation of data—now semistructured data—a

far going generalization of the relational approach. It

is also worth noting that in the foundations of mathe-

matics, the previous century was marked by the tri-

umph of the set theoretic approach for representat-

ing mathematical data (the concepts) as well as the

style of mathematical language and reasoning. Math-

ematical logicians also developed generalized com-

putability theory over abstract sets (of sets of sets,

etc.) in the form of admissible set theory (Barwise,

1975). In computer science, a set theoretic program-

ming language SETL was created (quite naturally,

for the case of ﬁnite sets only). Also some the-

oretical considerations on computability and query

languages over hereditarily ﬁnite sets were done

in (Dahlhaus and Makowsky, 1986; Dahlhaus and

Makowsky, 1992; Sazonov, 1987; Sazonov, 1993)

with the perspective of a generalized set-theoretically

presented databases—in fact semistructured ones—

even before the term “semistructured databases” had

arisen. Moreover, the set theoretic approach is closely

related with a special version of the graph approach.

Probably the ﬁrst mathematical result relating both

approaches was Mostowski’s Collapsing Lemma al-

lowing to interpret graph nodes as sets of sets corre-

sponding to children of these nodes. This, however,

worked properly only for well-founded graphs and

sets (which in the ﬁnite case, especially interesting

for database applications, means the absence of cy-

cles). But arbitrary graphs, even with cycles can also

be “collapsed” into sets (interrelated by the member-

ship relation) in the more general non-well-founded

set theory called also hyperset theory (Aczel, 1988;

Barwise and Moss, 1996). Here, for example the set

Ω = {Ω} consisting of itself and corresponding to the

Molyneux R. and Sazonov V. (2007).

HYPERSET/WEB-LIKE DATABASES AND THE EXPERIMENTAL IMPLEMENTATION OF THE QUERY LANGUAGE DELTA - Current State of Affairs.

In Proceedings of the Second International Conference on Software and Data Technologies - Volume ISDM/WsEHST/DC, pages 29-37

DOI: 10.5220/0001343200290037

 SciTePress

loop graph  is quite a legal and meaningful thing.

These two trends (from set theory to semistruc-

tured databases as graphs and vice versa) were called

in (Sazonov, 2006) top-down and bottom-up ap-

proaches to semistructured databases. They meet

most closely in the work (Buneman et al., 2000)

which is devoted to a speciﬁc graph approach to

semistructured data considered up to bisimulation.

The latter concept is the key one in (Lisitsa and

Sazonov, 1997; Lisitsa and Sazonov, 1999) for inter-

preting graph nodes as a system of hypersets belong-

ing one to another according to the graph edges. Nev-

ertheless, (Buneman et al., 2000) is still rather a graph

approach than hyperset one according to the special,

however related to, but not a genuine set theoretical

way as graphs are treated there; cf. (Sazonov, 2006).

Note that the alternative and popular XML rep-

resentation of semistructured databases besides being

based mainly on the idea of a ﬁnite tree (not arbi-

trary graph), assumes a ﬁxed order on the children

of any node, and therefore can not be considered as

a (hyper)set approach. Also, any graph representa-

tion like in (Abiteboul et al., 1997; McHugh et al.,

1997) which does not assume considering graphs up

to bisimulation (to thereby make the nodes into ab-

stract hypersets) is outside of the hyperset view.

The goal of this paper is to demonstrate how

theoretical ideas of hyperset approach to semistruc-

tured (which can be also naturally called Web-like)

databases (WDB) could be implemented and poten-

tially applied in the form of a practically working

query language Delta to such data. More precisely,

we want to demonstrate that this is working in prin-

ciple. The real practical work on efﬁciency as well

as many other questions which should be resolved for

any realistic database management system should be

inevitably postponed here because this is rather work

for a team of developers. However, we will present

here some related considerations.

Organizing the paper. Section 2 is devoted to the

background of the hyperset view to SSD/WDB with

the details of how it is represented in terms of sys-

tems of set equations. Also an appropriately restricted

XML (XML-WDB) format is introduced to repre-

sent set equations and to make this approach easily

adapted to already widely accepted XML. It is as-

sumed, in fact, a distributed version of a hyperset

based WDB somewhat similar to, but not necessarily

so huge as the WWW. Section 3 introduces the theo-

retical version of the language Delta and gives quick

outline of the implemented version of Delta by exam-

ples of queries. Section 4 describes how such queries

can be executed, again by examples. Section 5 out-

lines what should be done to make this language more

practically applicable and efﬁcient. We conclude in

Section 6 by making some comparisons.

2 HYPERSET APPROACH TO

WDB

The set theoretic and, more generally, hyperset ap-

proach (Sazonov, 1993; Lisitsa and Sazonov, 1997;

Leontjev and Sazonov, 2001; Sazonov, 2001) to WDB

is based on the concept of hereditary ﬁnite sets or

more generally on hyperset theory (Aczel, 1988; Bar-

wise and Moss, 1996). Such semi-structured data is

represented as abstract sets of sets of sets, etc., even

with the possibility of cycles. For visualisation pur-

poses, hyperset databases are represented by graphs

(as depicted in Figure 1) with directed edges repre-

senting the membership relations.

name

"Alice"

"Sam" "cat"

species

name

"Bob"

pet

bob alice

husband

wife

sam

Figure 1: Example WDB representing a ﬁctitious family.

However, when considering implementation (and also

intuitively) it is far more appropriate to represent

WDB by corresponding system of set equations. Each

set equation consists of a set name equated to the

corresponding bracket expression, were labelled el-

ements (of the set) may be either atomic values or

nested bracket expressions or set names described

in other equations. For example, the system of

“unnested” or “ﬂat” set equations generated by the

graph in Figure 1 (and vice versa) looks as follows:

bob = {name:"Bob", wife:alice}

alice = {name:"Alice", husband:bob, pet:sam}

sam = {name:"Sam", species:"cat"}

or, equivalently, with the nesting allowed:

bob = {name:"Bob", wife:alice}

alice = {name:"Alice", husband:bob,

pet:{name:"Sam", species:"cat"} }

In particular, this demonstrates that the speciﬁc form

of set names

bob

alice

sam

however helpful intu-

itively is formally not important. They can always

be renamed (say, by numbered “object identities” like

&23

, etc.) or substituted as above. The proper infor-

mation on the WDB is carried by (i) labels on WDB-

graph edges (

name

wife

husband

, etc.), (ii) atomic

ICSOFT 2007 - International Conference on Software and Data Technologies

data on leaves (

"Bob"

"Alice"

), and (iii) the graph

structure or, respectively set-element nesting.

Note 1 Atomic data is in fact treated as labels

on additional leaf edges or, equivalently, as sin-

gleton sets consisting of a (labelled) empty set

(

"Bob" = {Bob:{}}

2.1 Bisimulation

In contrast to XML, the order and repetition of the

elements in set equations play no role. This leads

to the well-known (see e.g. (Aczel, 1988)) concept

of bisimulation relation between graph nodes or set

names (n

≈ n

). This relation (and corresponding re-

cursive algorithm) is based on the idea that any two

sets are equal (or two set names and corresponding

graph nodes are bisimilar) if for each (labelled) ele-

ment of the ﬁrst set there exists an equal (bisimilar) el-

ement in the second set, and vice-versa. Bisimilar set

names are said to denote the same abstract (hyper)set.

Thus, we are considering WDB-graph (or system of

set equations) up to bisimulation which makes this

approach more than just a pure graph theoretic one.

WDB is called strongly extensional (Aczel, 1988) or

non-redundant if different nodes (set names) are non-

bisimilar i.e. denote different hypersets. This also

means that ≈ coincides with =.

2.2 WDB and XML

Although set equations represent WDB in the most

natural and intuitive way directly suggesting that

our data are hypersets, it makes sense to relate

this approach to the popular XML representation of

semistructured data and probably to use appropriate

existing techniques. Also numerous existing XML

data can be treated by our approach what can make

its application range considerably wider.

The main idea of the XML representation of any

system of set equations consists in (recursive) replac-

ing any labelled bracket expression

label : {...}

by XML element

. More pre-

cisely, the nested version of the system of set equa-

tions above (for Figure 1) is represented as an XML

document in the following special XML-WDB format:

<?xml version="1.0"?>

<set:eqns xmlns:set="...">

<set:eqn set:id="bob">

</set:eqn>

<set:eqn set:id="alice">

<name>Alice</name>

<pet>

</pet>

</set:eqn>

</set:eqns>

Here

set:id

is the required attribute of

set:eqn

el-

ement and should have a unique (across the whole

document) value called also set name. Any other at-

tributes, except

set:ref

, are also allowed in the el-

ement

set:eqn

. The attribute

set:ref

must refer

to some existing

set:id

(possibly multiple—having

the attribute type

IDREFS

). The elements

set:eqn

are

allowed to contain arbitrary XML sub-elements with

arbitrary attributes (except

set:id

) and text data.

Such a XML-WDB document can be treated as a

system of set equations by using the following simple

transformations which can be done in arbitrary order:

•

<set:eqn set:id="setname">...</set:eqn>

is replaced by equation

setname = {...}

• any attribute (say, with two values)

attr =

"value1 value2"

, except the distinguished at-

tributes

set:id

and

set:ref

, is removed and re-

placed by sub-elements

<attr>value1</attr><attr>value2</attr>

Thus, attributes are treated as tags.

•

any text data

transforms to the sequence of

empty sub-elements

As we intend to ignore any order and repetitions

(in contrast with the ordinary XML approach)

this, in fact, means that a sentence (any text data)

is considered rather as an unordered set of words.

•

transforms to

tag:{...}

and

<tag/>

transforms to

tag:{}

•

transforms to the sequence

tag:setname1, tag:setname2, ...

In the case if the element

<tag>

is non-empty with

some content “

...

”, as in the previous clause, we

just add

tag:{...}

to the above sequence.

• Finally, omit

<?xml>

and

tags.

This way any XML-WDB document with arbitrary

nested XML elements represents system of set equa-

tions with accordingly nested right-hand sides. In par-

ticular, arbitrary XML documents can be incorporated

into our hyperset WDB framework (by nesting in-

side

set:eqn

elements). Both transformations from

XML-WDB to system of set equations and vice versa

have been implemented.

2.3 Distributed WDB

WDB system of set equations may be divided into

several subsystems (XML-WDB ﬁles) with the possi-

HYPERSET/WEB-LIKE DATABASES AND THE EXPERIMENTAL IMPLEMENTATION OF THE QUERY

LANGUAGE DELTA - Current State of Affairs

bility for the set names s participating in one subsys-

tem to be deﬁned by set equations s = {. . .} in some

other subsystems. To this end we introduce one more

attribute

set:href

(in the same namespace) whose

value should be full set name consisting of a URL of

a possibly remote XML-WDB ﬁle concatenated with

#simple-set-name

where

simple-set-name

is the

set:id

value from this ﬁle. So, strictly speaking,

we should always assume full set names, even if us-

ing simple set names in set equations of distributed

WDB. This also allows to avoid non-intended simple

set name clashes in different XML-WDB ﬁles cre-

ated possibly by various users in remote sites. It is

required that each full set name should refer to an

existing XML-WDB ﬁle and corresponding equation

within that ﬁle for the simple set name part (after

symbol). We also have an appropriate XML schema

(Delta-WDB Site, 2007) for the XML-WDB format.

Of course, the above transformation rules from XML-

WDB to systems of set equations should be amended

accordingly for

set:href

attribute. Anyway, this

will lead to a (possibly huge) uniﬁed system of set

equations—distributed WDB.

The analogy of WDB with the WWW does not

imply it is so huge. It could be distributed between

several sites and supported by specialized WDB-

servers of some branches (departments) of an orga-

nization. Another point is that WDB might be much

more structured than the WWW, however the general

approach imposes no restrictions. Nevertheless, the

concept of WDB-schema (much more ﬂexible than

for the relational case and based on the concept of

a bisimulation or “one-way” simulation) relativized

to some typing relation on labels/atomic values can

be considered for such databases ((Sazonov, 1993;

Lisitsa and Sazonov, 1997) and e.g. (Abiteboul et al.,

2000)).

As relational databases are (theoretically) a partial

case of this hyperset approach, it should be straight-

forward to incorporate them into a WDB however het-

erogeneous they might be.

3 QUERY LANGUAGE

There has already been much theoretical consider-

ations on (some versions of) the query language ∆

(Delta) to hyperset/WDB databases. The two main

syntactical categories of Delta are:

• ∆-terms representing set valued operations over

hypersets (set queries); and

• ∆-formulas representing truth valued operations

(boolean queries).

Inclusion of set theoretic operators of transitive clo-

sure (

TC), recursion (Rec), and decoration (Dec)

allows to deﬁne in ∆ exactly all polynomial time

computable operations over hypersets (represented as

WDB), thus demonstrating and characterising theo-

retically its rich expressive power (assuming that a

linear order on labels is given). The operators of ∆

are deﬁned as follows:

h∆-termi ::= hset variable or constanti

: a

, . . . , l

, a

} a TC(a) Dec(a,b)

{l : t(x, l) | l : x ∈ a & ϕ(x, l)}

Rec p.{l : x ∈ a | ϕ(x, l, p)}

h∆-formulai ::= a = b

= l

< l

R l

l : a ∈ b ϕ & ψ ϕ∨ ψ ¬ϕ

∀l : x ∈ a.ϕ(x, l) ∃l : x ∈ a.ϕ(x, l)

The intuitive set theoretic semantics of the majority

of the above constructs should be well-understood

by anybody with the minimal mathematical back-

ground in set theory and logic; see also (Sazonov,

2006). General note: a, b, . . . denote (set valued)

terms, x, y, z, . . . are set variables and l, l

are label

values (just strings of symbols) or variables, depend-

ing on the context, and ϕ, ψ are (boolean valued) ∆-

formulas. Additionally, the binding label and set vari-

ables l, x, p of quantiﬁers, collect, and recursion con-

structs above should not appear free in the bounding

term a (denoting a ﬁnite set) otherwise these oper-

ators become unbounded and thus, in general, non-

computable. Equality = and the (alphabetic) ordering

< on labels is understood standardly. The relation R

on labels is any (easily) computable relation, in par-

ticular “to be a (preﬁx/inﬁx/sufﬁx) substring”—quite

usable in queries. It could be also a relation

typing. For example we could have

John::name

and

June::month

. On the other hand, the equality be-

tween ∆ terms/hypersets a = b (or a ≈ b) is under-

stood as the equality of abstract hypersets denoted by

these terms and, eventually, is computed by the bisim-

ulation algorithm discussed above. Moreover, bisim-

ulation is, in fact, implicitly involved in the (computa-

tional) meaning of the membership relation according

to the equivalence

l : a ∈ b ⇐⇒ ∃m : x ∈ b.(m = l & x ≈ a).

This means: ﬁnd an outgoing l-labelled edge from b

which leads to some node x bisimilar to a. But think-

ing abstractly, l : a ∈ b says just that a is an l-labelled

element of b. The recursion operator

Rec p.{l : x ∈

a | ϕ(x, l, p)} deﬁnes a subset π of the set denoted by

a obtained as the result of stabilizing (due to ﬁnite-

ness of a) the monotonic sequence of subsets of a de-

ﬁned iteratively as p

0 and p

k+1

= p

∪ {l : x ∈ a |

ICSOFT 2007 - International Conference on Software and Data Technologies

ϕ(x, l, p

)}. The transitive closure TC(a) denotes the

set of (labelled) elements of elements . . . of elements

of a. We refer to (Aczel, 1988) for the precise deﬁni-

tion of the decoration operator

Dec and only note here

that it is the only operator in ∆ which allows to con-

struct real (cyclic) hypersets like Ω = {Ω} or like in

the Figure 1 from the ordinary “uncycled” sets (of sets

of sets,.. . ) of ﬁnite depth. This can be also reason-

ably called the plan performance operator (Sazonov,

2006) because its input(s) can be considered as a plan

of construction of a hyperset, and the output is the re-

sulting abstract hyperset. Imagine that we have a plan

of a Web site (i.e. of a system of hyperlinked Web

pages) and that

Dec is a tool (a query) which auto-

matically creates all the required Web pages.

Practically, constructs of the ∆-query language are

expressed as ‘english-like’ statements in block struc-

tured query language similar to SQL. Additional fea-

tures have been (and even more are intended to be)

added to ∆ making the language more practically con-

venient, but not increasing its theoretical expressive

power. Say, the powerset operation

Pow(a) giving the

set of all subsets of a is evidently intractable (requir-

ing exponential time and space) and is not worth to

be added. But everything which is polynomial time

computable is already deﬁnable in the original theo-

retical version of ∆ presented above. Therefore, any

additions we intend to make, however important prac-

tically, are just “syntactic sugaring” of the above the-

oretical version of ∆. In particular, let us note the fol-

lowing extensions:

Library functions: Creation, deletion and modiﬁca-

tion of user deﬁned/predeﬁned queries and con-

stants. These queries/constants can then be used

in any query thus saving time and effort.

Queries with declarations: Similar to the library

function but queries/constants are deﬁned for

(possibly multiple) use within a particular query.

For example, our BNF of ∆ (Delta-WDB Site, 2007)

contains the following production rules:

<set query declaration> ::= "set query "

<set query name> "("[<variables>]")"

(" be " | " = ") <delta-term>

<delta-term with declarations> ::= "let "

" endlet"

Analogously (and even simpler), set constant dec-

larations are quite convenient. Recall that full set

names are typically quite long, having the form

url

#simple-set-name

; thus declaring a set constant,

say c, for this set name will make queries much more

readable and manageable.

3.1 Syntactical Correctness and the

Contextual Analysis

Queries satisfying the BNF are called well-formed.

But they should also be well-typed, with all identiﬁers

properly declared or quantiﬁed The latter is achieved

by tracking and suitably correcting the syntactical cat-

egories and types in the parse tree of a query. This

guarantees that all identiﬁers are properly declared

and typed (if this is possible at all). Occurrences of

set or label variables or constants, the type of query

declaration and corresponding query calls should mu-

tually agree, etc.

3.2 Example of Non-well-typed Query

BibDB

title

author

bookbook

"Jones"

"Databases"

"Jones"

author

"Databases"

title

refers-to

"Smith" "Databases"

author title

refers-to

paper

refers-to

Figure 2: Example of a distributed—according to the

colouring—bibliography WDB (similar to one from (Abite-

boul et al., 2000)) represented as two XML-WDB ﬁles with

URL1.

and

URL2

; see (Delta-WDB Site, 2007).

The following is an attempt to query the bibliography

WDB from Figure 2.

> set query collect { pub-type:pub

> where pub-type:pub in BibDB

> and exists ’refers-to’:ref in pub .

> ref=b2 };

Query is well-formed

Query is not well-typed

Error at character 55, occurrence of

identifier name BibDB not declared:

set query collect { pub-type:pub

where pub-type:pub in BibDB <-------

and exists ’refers-to’:ref in pub .

Error at character 104, occurrence of

identifier name b2 not declared:

and exists ’refers-to’:ref in pub .

ref=b2 <-------

HYPERSET/WEB-LIKE DATABASES AND THE EXPERIMENTAL IMPLEMENTATION OF THE QUERY

LANGUAGE DELTA - Current State of Affairs

};

Note that

BibDB

and

are identiﬁers which are

nowhere declared in this query. See the corrected ver-

sion of this example in Section 4.1 where these identi-

ﬁers are appropriately related to the WDB considered.

4 QUERY EXECUTION

To execute a (set or boolean) well-formed and well-

typed query q whose all participating set names (con-

stants) are taken from the given WDB—a system of

set equations—we should:

extend this system by new equation res = q with res

a fresh (i.e. unused in WDB) set or boolean, if

such is q, name and

simplify the extended system

WDB

′

= WDB+ (res = q)

until it will contain only (possibly nested) bracket

expressions as the right-hand sides of the equa-

tions or the truth values true or false (if the left-

hand side is boolean name).

Thus, after simpliﬁcation is complete, no complex set

or boolean queries in set equations will appear, and

the resulting version WDB-RES of WDB will con-

sist, alongside with the old equations of the original

WDB, of new set equations (new set names equated

to possibly nested bracket expressions) and boolean

equations (boolean names equated to boolean values

true or false). This process is quite natural. For ex-

ample, if the given query contains some complex sub-

query like q =

q1 then the equation res = q is split

into two subqueries res =

res1 and res1 = q1 with

res1 a fresh set name. We omit further details refer-

ring the reader to (Sazonov, 2006). The point is that

at the end we will have the equation res = {. . .} of the

required form whose right-hand side may use some

set names either from the original WDB or newly in-

troduced during this process auxiliary set names (like

res1 above) with corresponding equations of the re-

quired form. Thus, execution of a query extends the

original WDB to WDB-RES. This extension with the

set name res as the “entrance point” is the result of

the query and can be considered as temporary one un-

til we need this result. (We could also consider update

queries which would change the original WDB.)

It was demonstrated in the op. cit. that ∆ covers the

expressive power of UnQL and UnCAL from (Bune-

man et al., 2000) which also have a polynomial time

complexity, but do not exhaust the full P-time.

4.1 Example of Valid Query and the

Result of its Execution

> set query

> let set constant BibDB be URL1#BibDB,

> set constant b2 be URL2#b2

> in collect { pub-type:pub

> where pub-type:pub in BibDB

> and exists ’refers-to’:ref in pub .

> ref=b2 }

> endlet;

Query is well-formed

Query is well-typed

Result = {paper:URL2#p2, book:URL1#b1}

Finished in: 398 ms

The informal meaning of this query is (imprecisely)

as follows: “Find all publications which refer to the

book

.” But, as we see, the answer contains, be-

sides the evident publication

referring to

, also

which refers to

where the latter is formally

bisimilar to

(same

title

and

author

elements), as

required in the formal query. If we really want to in-

clude only references to the book

, then seemingly

right solution to replace the equality

ref=b2

by the

conjunction

(ref=b2 and book:ref in BibDB)

the above query does not really help because in any

case

p3=b2

(are bisimilar) in the above WDB. Equal-

ity of (hyper)sets is deﬁned by their elements, ele-

ments of elements, etc., i.e. by outgoing—not by

incoming—edges. So, after removing redundancies

(say, omitting

) we should have one joint node

with two incoming edges

BibDB

book

−→

and

BibDB

paper

−→

(besides two more incoming

refers-to

edges from

and

and the evident two outgoing

edges). This is probably not what the designer(s) of

this distributed WDB had in mind.

This example emphasizes the real meaning of hy-

perset vs. pure graph approaches to semistructured

databases and the role of removing redundancies on

the level of the design. The right approach here

should be based on a well-chosen discipline:

(i) either to reconstruct this database by replacing la-

bels

book

and

paper

publication

and adding

outgoing edges from each publication showing its

type

(

’book’

’paper’

; see Figure 3)

(ii) or to use some WDB-schema e.g. requiring that

there is only one, up to bisimulation,

book

paper

edge from

BibDB

to any given publication.

“Up to bisimulation” means here that if two children

BibDB

are bisimilar then they are labelled by the

same label. But it is not our goal here to go into details

of such kind of discipline. In any case, we should

ICSOFT 2007 - International Conference on Software and Data Technologies

be precise and accurate in designing the WDB and in

formulating both formal and intuitive versions of our

queries. The mathematical ground of hyperset theory

is quite solid and sufﬁcient for that.

The main point is that if we have a formal query,

it has a unique (up to bisimulation) answer—in fact,

a hyperset or boolean value—and all the queries are

bisimulation-invariant and can be computed in poly-

nomial time (with respect to the size of WDB). Vice

versa, any P-time computable and bisimulation invari-

ant (and also “generic” (Lisitsa and Sazonov, 1997))

query is deﬁnable in ∆. In fact, this also assumes

that the language ∆ has full P-time computable power

of restructuring—not only simple retrieval of already

existing elements in the WDB. For example the query

restructuring the

BibDB

database as described in the

previous paragraph in (i) could be written in ∆ using

the plan performance operator Dec.

4.2 Example of Restructuring Query

First, extend the main library by the following useful

queries (deﬁned either formally or—for simplicity of

presentation—semi-formally):

library add

set query Pair(set x, set y) =

{’fst’:x,’snd’:y},

set query First(set z) =

"the first element of z if z is a Pair",

set query Second(set z) =

"the second element of z if z is a Pair",

set query GraphOfPairs(set a) =

"the set of labelled pairs L:Pair(u,v)

where (L:v in u) holds in the

transitive closure TC(a)";

Then the required restructuring query (in (i) above) is

deﬁned as follows:

set query let set constant BibDB = URL1#BibDB

in let set constant restructuredBibDB be

(U collect{

’null’:if (L=’Paper’ or L=’Book’)

then{’publication’:X,

’type’:call

Pair(call Second(X),{L:{}}),

L:call Pair({L:{}}, {})

}

else {L:X}

where L:X in call GraphOfPairs(BibDB)

}

)

in decorate ( restructuredBibDB, BibDB )

endlet

endlet;

Here

’null’

is a label whose value is not impor-

tant,

GraphOfPairs(BibDB)

is essentially the orig-

inal bibliography graph in Figure 2 represented in

the traditional set theoretic way as the set of (or-

dered) pairs, and

restructuredBibDB

denotes the

result of the required restructuring of this graph as set

of pairs. At last,

decorate(restructuredBibDB,

BibDB)

(the plan performance operator applied to the

plan—set of pairs

restructuredBibDB

) essentially

transforms this set of pairs understood as a graph into

corresponding system of set equations (as it is de-

scribed in Section 2) with

BibDB

serving as the main

set name. The content of the set

BibDB

(the result of

this query) is a rather lengthy nested bracket expres-

sion (a set of sets of sets . . .) which is omitted here.

The corresponding graph is as follows:

p3’/b2’

publication

"Jones"

author

"Databases"

title

refers-to

"Smith" "Databases"

author title

refers-to

publication

refers-to

"Paper"

"Book"

"Paper"

BibDB’

p2’

"Paper"

type

p1’

type

"Book"

b1’

type

Figure 3: The result of the restructuring query.

The fact that one publication has the type both of

book

and

paper

is the result of the initial design of

BibDB

It is not a failure of the above query.

5 FURTHER EXTENSIONS

5.1 Path Expressions

The ability to select nodes of a WDB graph to arbi-

trary depth can be elegantly achieved using path ex-

pressions. As shown in (Sazonov, 2006), the action of

a rich class of path expressions is deﬁnable in the orig-

inal ∆, itself having no path expressions at all, withthe

help of

TC and Rec. Our next goal is to implement the

extension of ∆ by path expressions like in

set query

select {pub-type:x in BibDB

where exists <b1>refers-to*<x>refers-to<b2> .

author:"Smith" in x

};

HYPERSET/WEB-LIKE DATABASES AND THE EXPERIMENTAL IMPLEMENTATION OF THE QUERY

LANGUAGE DELTA - Current State of Affairs

Quantiﬁcation occurs over paths from

hav-

ing an appropriate intermediate set/node

. Due to

≈

the answer to this query is the non-empty set

{paper:p2}

. It would be empty if to remove “

”.

5.2 Supporting Bisimulation/Strong

Extensionality in Background Time

One of the key points of our approach is the inter-

pretation of WDB-graph nodes as set names where

different nodes can, in principle, denote the same (hy-

per)set. Whether it is so is deﬁned by the bisimulation

relation which can be computed by appropriate (re-

cursive) comparison of labelled child nodes. Thus, in

outline, to check bisimulation of two nodes we need

to check bisimulation (or non-bisimulation) between

some children, granchildren, etc. of the given nodes,

i.e. a lot of nodes could be involved. If WDB is dis-

tributed, the communication overhead of download-

ing the relevant XML-WDB ﬁles will be too great.

(There is also the analogous problem with

TC, not

discussed here.) So, the equality relation for hyper-

sets seems intractable practically, although theoret-

ically it takes polynomial time with respect to the

size of WDB. Nevertheless, we consider that hyper-

set approach to WDB or semistructured databases is

worth to be implemented because it suggests a very

clear and mathematically well-understood view on

semistructured data and their querying. Thus, the

question is whether bisimulation problem can be re-

solved in any reasonable and practical way. Some

possible approaches and views are outline below.

Firstly, should we expect that in reality bisimilar

nodes in WDB (i.e. redundancies in WDB) will ap-

pear frequently? We could rather assume that WDB is

permanently supported in a strongly extensional state

i.e. with any rare redundancies eliminated.

Redundancies arising during query execution. As

we described above, execution of queries leads to

(temporary) extension of the original WDB poten-

tially leading to new redundancies (so that equal-

ity subqueries applied to newly generated nodes can

be non-trivial). But these redundancies can also be

eliminated locally at the server executing the query.

Moreover, the algorithm of query execution could be

amended in such a way that as soon as a new (aux-

iliary) nodes/set names are generated (like res, res1

in Section 4) any possible redundancies will be elim-

inated immediately.

Let, in general, WDB

′

be an extension of the given

strongly extensional (non-redundant) WDB by a set N

of new set names and by new set equations n = {. . .}

with the right-hand sides involving both old and new

set names. (The original WDB-set equations remain

the same in WDB

′

.) Also, without loss of general-

ity and for the sake of the argument we may consider

that all old and new set equations are ﬂat—involving

no nesting. This is the situation which can arise dur-

ing computation of a query when new set names and

set equations are generated. The question is how to

compute new bisimulation relation ≈

′

on WDB

′

using the trivial bisimulation relation coinciding with

identity = on the original WDB. Evidently, ≈

′

re-

stricted to WDB nodes/set names coincide with = on

WDB because set names participating in both parts of

set equations of WDB are also from WDB. (WDB

remains “self-contained”.)

Now, we can conclude from the deﬁnition of

bisimulation relation that only those nodes from the

original WDB are needed to compute ≈

′

(between

new nodes and also new vs. old nodes) which are

nodes/set names appearing in the new set equations

and their children. This restricts the number of down-

loaded WDB ﬁles, and thus elimination of redundan-

cies can be done almost locally.

Redundancies which can appear during a local up-

date of a WDB ﬁle are more problematic because

the old non-bisimilar nodes outside this ﬁle might be-

come bisimilar due to possible links (or paths) to the

local nodes with changed/added meaning. The appro-

priate strategy of removing all such redundancies is

not so straightforward (as above) and needs to be de-

veloped yet. However, taking into account the locality

of changes, this task does not seem to be unrealistic.

Local bisimulation. Assume that all WDB nodes are

divided into classes L

according to their sites (WDB

servers) or even ﬁles. There is a quite natural deﬁni-

tion of local (i.e. computed locally; see (Delta-WDB

Site, 2007)) lower and upper “best” approximations

≈

−

⇒ n

≈ n

⇒ n

≈

to the global bisimulation relation ≈ on the whole

WDB. All these local approximations can help to

compute and to permanently support global bisimu-

lation in a distributed way in background time. More-

over, we could require “local independency” (≈

−

≈

, and hence = ≈↾ L) and additionally the “local

non-redundancy” (≈

−

= ≈

= =

Deliberate redundancies in WDB can also appear,

called mirroring in WWW. But if to require that mir-

roring in WDB should be “ofﬁcially” registered, then

such a deliberate redundancy should most plausibly

be dealt with in a quite feasible way.

In general, WDB should not be assumed to be just

another version of WWW, freely extensible by any-

body in the world. That is, appropriate discipline

and restrictions in working with WDB could make the

problem of bisimulation practically resolvable.

ICSOFT 2007 - International Conference on Software and Data Technologies

6 COMPARISONS

The crucial feature of our approach to semistructured

databases distinguishing it from others such as Lorel

(Abiteboul et al., 1997) and (even the most closest

to our approach) UnQL (Buneman et al., 2000) is its

(hyper)set theoretical character. Also the query lan-

guage ∆ has mathematically precise description of its

expressive power (as P-time) which makes it sufﬁ-

ciently complete from the theoretical point of view.

In this sense our approach is top-down—from theory

to practice.

Some important aspects from the practical point

of view such as path expressions are currently not in-

cluded in the language, unlike the approaches from

op. cit., however being formally deﬁnable in the orig-

inal ∆ and pending implementation. As to the UnQL

language and the related language UnCAL, they were

shown to be embeddable in Delta, but not vice versa;

see the technical details in (Sazonov, 2006). In a sense

the same holds for Lorel. (Here we do not take into

account that Lorel was later migrated to the query lan-

guage to XML, where the order plays an essential

role.) The original Lorel deals with graphs, like ∆,

but it considers literal equality between graph nodes

(oids) instead of using bisimulation relation. Lorel

also uses equality between sets of oids, which, how-

ever, is not “deep” set equality assumed by bisimula-

tion. Therefore, Lorel would treat our example with

BibDB

differently, and only very informal comparison

is possible. Thus, Lorel can be said to be also strictly

embeddable in ∆ (ignoring path variables potentially

leading to the exponential complexity and, for sim-

plicity, some aspects like typing and coercion). There

is no recursion operator (except Kleene’s star in path

expressions) and nothing similar to decoration opera-

tor (important for deep restructuring) in Lorel. (How-

ever StruQL can do something reminding decoration;

see e.g. (Abiteboul et al., 2000).) In a sense hyperset

(∆) vs. graph approaches (UnQL and Lorel) compli-

ment each other. Finally, our implementation assumes

distributed WDB, like WWW.

7 CONCLUSION

Although ∆ does not include yet path expressions and

some other useful explicit constructs of the languages

UnQL and Lorel, we already have a working and quite

expressive (in a sense more expressive) query lan-

guage, and this was our current goal. Of course, there

is a lot to do for making this a full-ﬂedged and ef-

ﬁcient (hyper) Set based WDB Management System

(WDBMS or SDBMS).

REFERENCES

Abiteboul, S., Buneman, P., and Suciu, D. (2000). Data on

the Web - From Relations to Semi-structured Data and

XML. Morgan Kaufmann Publishers, San Francisco,

California.

Abiteboul, S., Quass, D., McHugh, J., Widom, J., and

Wiener, J. L. (1997). The Lorel query language for

semistructured data. International Journal on Digital

Libraries, 1(1):68–88.

Aczel, P. (1988). Non-Well-Founded Sets. CSLI, Stanford,

CA.

Barwise, J. (1975). Admissible Sets and Structures.

Springer, Berlin.

Barwise, J. and Moss, L. (1996). Vicious circles: on the

mathematics of non-wellfounded phenomena. Center

for the Study of Language and Information.

Buneman, P., Fernandez, M., and Suciu, D. (2000). UnQL:

a query language and algebra for semistructured data

based on structural recursion. The VLDB Journal,

9(1):76–110.

Codd, E. F. (1983). A relational model of data for large

shared data banks. Communications of the ACM,

26(1):64–69.

Dahlhaus, E. and Makowsky, J. A. (1986). The choice of

programming primitives for SETL-like programming

languages. In ESOP’86, LNCS 213, pages 160–172.

Dahlhaus, E. and Makowsky, J. A. (1992). Query languages

for hierarchic databases. Information and Computa-

tion, 101:1–32.

Delta-WDB Site (2007).

http://www.csc.liv.ac.uk/

molyneux/ICSOFT2007appendix/

Leontjev, A. and Sazonov, V. (2001). ∆: Set-theoretic query

language capturing logspace. Annals of Mathematics

and Artiﬁcial Intelligence, 33:309–345.

Lisitsa, A. and Sazonov, V. (1999). Linear ordering on

graphs, anti-founded sets and polynomial time com-

putability. Theoretical Computer Science, 224(1–

2):173–213.

Lisitsa, A. and Sazonov, V. Y. (1997). Bounded hyperset

theory and web-like data bases. In Proceedings of the

Kurt Goedel Colloquium (KGC 1997), volume 1234,

pages 178–188.

McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and

Widom, J. (1997). Lore: A database management

system for semistructured data. SIGMOD Record,

26(3):54–66.

Sazonov, V. Y. (1987). Bounded set theory, polynomial

computability and ∆-programming. In Lect. Not.

Comput. Sci., volume 278, pages 391–397.

Sazonov, V. Y. (1993). Hereditarily-ﬁnite sets, data bases

and polynomial-time computability. Theoretical Com-

puter Science, 119(1):187–214.

Sazonov, V. Y. (2001). Using agents for concurrent query-

ing of web-like databases via a hyperset-theoretic ap-

proach. In PSI ’02: 4th International Andrei Er-

shov Memorial Conference on Perspectives of System

Informatics, pages 378–394, London, UK. Springer-

Verlag.

Sazonov, V. Y. (2006). Querying hyperset / Web-like

databases. Logic Journal of the IGPL, 14(5):785–814.

HYPERSET/WEB-LIKE DATABASES AND THE EXPERIMENTAL IMPLEMENTATION OF THE QUERY

LANGUAGE DELTA - Current State of Affairs