IMPLEMENTATION OF ALGEBRA FOR

QUERYING WEB DATA SOURCES

Iztok Savnik

Faculty of Mathematics, Science and Information Systems, University of Primorska, Koper, Slovenia

Keywords:

Database algebras, query languages, distributed databases, query optimization, Web query languages.

Abstract:

The paper presents the implementation of query execution system Qios. It serves as a lightware system for

the manipulation of XML data. Qios employs the relational technology for query processing. The main aim

in the implementation is to provide a querying system that is easy to use and does not require any additional

knowledge about the internal representation of data. The system provides robust and simple solutions for

many design problems. We aimed to simplify the internal structures of query processors rooted in the de-

sign of relational and object-relational query processors. We propose efﬁcient internal data structures for the

representation of queries during all phases of query execution. The query optimization is based on dynamic

programming and uses beam search to reduce the time complexity. The data structure for storing queries pro-

vides efﬁcient representation of queries during the optimization process and the simple means to explore plan

caching. Finally, main memory indices can be created on-the-ﬂy to support the evaluation of queries.

1 INTRODUCTION

The Internet contains large amount of different data

sources accessible through ftp ﬁles, XML or HTML

documents, and the wrappers around relational and

object-relational database systems. The data avail-

able via such data sources may vary from simple lists

of records, catalogs containing large amounts of ﬂat

tables, to complex data repositories including large

conceptual schemata and tables composed of complex

objects (Abiteboul and Beeri, 1993; Roth et al., 1988).

Querying Web data sources poses several new prob-

lems such as uniform access to the data sources, inte-

gration of information provided by the data sources,

and efﬁcient manipulation of large data sets.

Our work focuses on the design of robust and

ﬂexible query execution system for querying and in-

tegration of data from Web data sources. The de-

sign of query execution system Qios(Savnik, 2007) is

based on the existing work on relational and object-

relational query execution systems (Graefe, 1993;

Jarke and Koch, 1984; D. Daniels, 1982). The kernel

of algebra comprises relational operations extended

for the manipulation of complex objects. Further, the

algebra includes a set of operations for querying con-

ceptual schemata. These operations allow for brows-

ing the conceptual representation of data sources and

using the elements of conceptual schemata for query-

ing extensional data (Savnik et al., 1999).

We investigated the efﬁcient and robust design of

the internal structures of the query execution engine.

In order to simplify the internal structure of system a

single data structure is used for the representation of

queries during all phases of query execution. The al-

gorithm used for the optimization is based on graph

representation of query trees and query transforma-

tion rules. Query transformation is seen as matching

rule input tree against a query tree and than duplicat-

ing the rule output tree. Similarly, query evaluation

module uses the same structure for the implementa-

tion of scans. Finally, to provide efﬁcient implemen-

tation of query optimization and evaluation queries

are stored in a data structure called Mesh (Graefe and

DeWitt, 1987). The data structure is reﬁned to pro-

vide efﬁcient access to the stored queries.

The query optimization algorithm is based on a

version of dynamic programming called memoisa-

tion. The most promising results were obtained with

optimization algorithm which uses beam search for

the exploration of the hypothesis space of equivalent

queries. Further, we explore plan caching (Graefe,

2005; Marathe, 2006) which speeds up signiﬁcantly

the subsequent execution of queries issued on the

same domain.

The query evaluation is based on dynamic selec-

tion of the query evaluation plans. We use a sim-

Savnik I. (2008).

IMPLEMENTATION OF ALGEBRA FOR QUERYING WEB DATA SOURCES.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - DISI, pages 91-96

DOI: 10.5220/0001673700910096

 SciTePress

ple strategy that exploits the large quantity of main

memory which is lately provided by almost any per-

sonal computer. The main memory indices can be

constructed on-the-ﬂy to support query evaluation.

The construction of indices is based on standard in-

dex selection rules available from any database text-

book. For instance, hash-based index is used when-

ever a larger tables has to be joined. In this way we

achieve fast performance of query processor for rela-

tively large quantities of data. Furthermore, the user

does not need to concern about the details of query

evaluation process.

The paper is organized as follows. The following

section presents the data model used for the represen-

tation of data from Web data sources. Section 3 in-

troduces the basic operations of algebra and presents

the examples of queries. The main portion of pa-

per presents the implementation of algebra. Section

4 describes storage manager, parser, query represen-

tation, query optimization and evaluation. Section 5

overviews the empirical results of query execution in

Qios. Finally, concluding remarks are stated in Sec-

tion 6.

2 ALGEBRA

The data model used for the representation of data

stored at different types of data sources must meet the

following requirements. First, the data sources pro-

vide various types of data including semi-structured

data, XML data, (ﬂat) relational tables, and ob-

jects represented by object-relational database mod-

els. Third, we expect that besides the extensional

data, the data sources will include large amounts of

intensional data describing the structure and the con-

tents of the extensional databases.

The F-Logic data model was used as the formal

basis of the system (Kifer et al., 1995): it was shown

(Savnik et al., 1999) that it can serve as the basis for

the representation of the semi-structured data, it pro-

vides a convenient representation of the relational and

the object-relational database models, it can be used

for the representation of complex objects, and, it can

be used very naturally to represent intensional data.

The operations for inquiring about the basic prop-

erties of objects which relate to the representation of

objects are called model operations. Besides the stan-

dard comparison operations =, >, >=, <, <=, the set

operations ∈ and ⊆, and the component selector op-

erator ”.”, which are deﬁned in relational algebra, the

algebra includes the following model operations. The

operations ext and exts map class objects to the sets

of theirs members, or the set of their instances, respec-

tively. Next, the operation class of allows for the

mapping of the ground objects to their parent classes

Further, the poset comparison operations ≺

, 

, 

and 

are used to relate objects with respect to the

partial ordering relationship deﬁned among objects.

The operations subcl and supcl map class objects

to a set of their subclasses or super-classes, respec-

tively. Finally, the operation =˜ is deﬁned for search-

ing the text using regular expressions. The operation

=˜ is deﬁned as in the Perl programming language.

The declarative operations of the algebra are used for

the manipulation of the sets of objects. The detailed

presentation of the model and declarative operations

can be found in (Savnik et al., 1999). The following

groups of declarative operations are deﬁned in the al-

gebra.

Relational Operations. These operations include

standard relational operations select, project,

union, differ and join which are extended for the

manipulation of objects.

Nested-relational Operations. The operation

group(s,a,b) is deﬁned similarly to the SQL

group-by construct. It groups objects from s by the

values of attributes from the set a. The values of the

attributes which are not in a are stored as the relation

which is the value of the attribute b. The operation

unnest(s,a) is used to unnest a set valued attribute

a of objects from the argument set s.

Object-restructuring Operations. The operation

collapse(s,a) collapses the tuple structured at-

tribute a nested in the objects from the set s. The

operation flatten(s) is used for collapsing the set

of sets s.

Operation apply (Buneman and Frankel, 1979). The

functional operation apply(s, f ) is used for the ap-

plication of a query expression f to a set of objects s.

This operation is useful for the application of a query

to the sets of objects that can be located at different

sites.

3 QUERY EXECUTION SYSTEM

Qios (v0.9) is a system for the manipulation of data

from internet data sources. The system is intended to

serve as the lightware kernel of a data manipulation

server. The main aims in the design of Qios are to

provide: capabilities to manipulate collections of data

in a fast manner, various data manipulation functions

from classical querying to data restructuring, and, the

A ground object can have a single parent class.

ICEIS 2008 - International Conference on Enterprise Information Systems

Object manager

Berkeley DB

Record manager with cache

Access methods

Figure 1: Storage manager.

capabilities to organize, store and browse the data col-

lections obtained from the Internet data sources on the

local host. The system currently provides the inter-

face for XML. The treatment of other data formats

requires the addition of the interface routines for the

conversion of data into internal database format.

The query execution system Qios is composed of

the following components: the storage manager, the

parser, the subsystem for query optimization, the sub-

system for Web access, and the query evaluation sub-

system. In this section we overview the implementa-

tion of the Qios components. Not all operations of

the presented object algebra (Savnik et al., 1999) are

implemented in the current version of system. The

operations that are included are: model-based opera-

tions, the operations select, project, and join.

Record Manager. The record storage manager is

based on the Berkeley DB storage system providing

access to different storage structures such as for in-

stance hash-based index or B+ tree. Record manager

implements a data store for records representing indi-

vidual and class objects. Records are treated as arrays

of bytes, the structure of which is known at the object

level. Each record has a record identiﬁer (abbr. rid)

implemented as system generated identiﬁer which is

used as the key for the hash-based index in Berkeley

DB. Therefore, records are stored as oid/value pairs

where the values are packed in the sequences of bytes.

Record manager includes the methods for the work

with object identiﬁers, routines for reading and writ-

ing records, methods for the realization of the hier-

archical database model, and access to main memory

indices that are associated to the class object.

Object Manager. The object manager serves as a

cache of objects loaded from Web as well as for

storing intermediate results during query processing.

Qios persistent objects are objects that are tied to the

database via object manager. Each object has an iden-

tiﬁer which is implemented by means of record iden-

tiﬁer from the subordinate level. External identiﬁers

which are unique within the dataﬁle can be assigned

to objects when they are created. Object cache is real-

Parser

Optimizer

Execution

Type chk

- --

Figure 2: Query processing.

ized using LRU (least recently used) strategy for the

selection of objects to be removed from cache. The

size of object cache can be set as the system parame-

ter via the conﬁguration record of a dataﬁle as well as

at run-time.

Parser. This module includes the implementation of

a parser for the algebra expressions. The algebra ex-

pressions are checked for syntactical errors and then

translated into query trees.

The basic skeleton of the query tree is constructed

during the parsing process. The variables and the

names of the data sources and spans are stored in sym-

bol table symtab. The query nodes represent the oper-

ations, spans linking rules, and access methods. The

data for the different phases of query processing is

stored in the same tree nodes. Each particular module

(e.g. query optimization) manipulates its own view of

query nodes.

Type-checking is implemented by procedure com-

puting the types of query nodes bottom-up, that is,

from access methods toward the root of query tree.

The class objects related to the access methods are

retrieved during the parsing process. The type of a

query node is stored by constructing a new class ob-

ject.

Query Optimizer. The query optimization subsystem

is composed of the following main modules. The

query tree manipulation module includes routines for

manipulation of query trees, application of rules on

the nodes of query trees, and property functions by

which the physical and logical properties of the query

tree nodes are determined.

The cost function is realized by a separate module.

The extensions of the cost functions used for the rela-

tional query optimizer are deﬁned. The optimization

module comprises the algorithms for the optimization

of query expressions. The algorithms which are stud-

ied are the exhaustive search and an algorithm based

on dynamic programming.

The core of the module is the data structure mesh

for the representation of sets of queries. Common

sub-expressions of queries are shared ie. each query

is represented in mesh only once. Queries are orga-

nized into equivalence classes. Let us ﬁrst present the

data structure Mesh.

mesh stores query expressions as query trees orga-

nized as a directed acyclic graph (dag). The query

IMPLEMENTATION OF ALGEBRA FOR QUERYING WEB DATA SOURCES

EquClass

Norm. query exp

Ids

MESH





Figure 3: Mash with access paths.

trees share common parts hence there is only one rep-

resentation of a query expression in mesh. Further,

the query trees are organized into equivalence classes

including logically equivalent queries. mesh has three

entry points.

The algorithm for query optimization is based on

dynamic programming. We use the variant of dy-

namic programming algorithm called memoisation

which stores the optimal results of the sub-queries

and uses them in the computation of the composed

queries. A simple brute force enumeration of all

reasonable compositions of operations from a given

query to provide the basis for classical dynamic pro-

gramming would be computationally too inefﬁcient.

The enumeration of equivalent queries of a given

query have to be guided by some form - in our case,

the compositional structure of query.

Memoisation is realized in a very simple manner.

Every time a query is to be optimized we ﬁrst check

Mesh if the optimal query already exists for a given

equivalence class of input query. At this point we can

see the use of Mesh for plan caching. If we do not

clean Mesh after the execution of a given query that

the optimization of the subsequent queries can make

use of the existing optimized queries in Mesh. The

optimization time is reduced signiﬁcantly.

The query transformation rules are used for the

transformation of query trees into logically equiva-

lent query trees that have different structure and po-

tentially a faster evaluation method. The logical opti-

mization rules are in Qios speciﬁed in a language that

follows strictly the syntax and the semantics of the

Qios query expressions.

Finally, the cost function module is used to esti-

mate the number of bytes processed by the physical

query tree including the cost of materialization of the

intermediate query evaluation results. The computa-

tion of the cost estimation for the operations select,

project and join is realized using standard relational

formulas based on the selectivity of attributes and pre-

supposing the normal distribution of attributes values.

The computation of statistics for classes is gener-

ated dynamically, triggered by query compilation pro-

cedure when classes are loaded into the main memory.

The following data is gathered for a given class: the

number of class objects, the size of objects and the

cardinality of all attributes.

Query Evaluation System. The query evaluation mod-

ule is based on the iterator-tree representation of the

query evaluation plans. The physical query execution

plan is computed from the optimized query trees by

adding to the existing query nodes information about

physical operation that will implement given logical

operation (query node). The query nodes already con-

tain information about the statistics, index selection,

and cost estimation.

The main strategy which was used in the imple-

mentation of query execution is to select reasonably

fast access methods on-the-ﬂy without considering al-

ternatives. Simple rules are used for index selection.

Firstly, if selection or join is based on equality of at-

tributes than hash-based index is generated. Secondly,

if selection or join involves range predicate we use B-

tree index. The selection of query execution plan is

implemented by the procedure which computes phys-

ical operations for all logical operations (query nodes)

forming the physical query tree in a bottom-up man-

ner.

XML Loader. XML loader provides the interface to

the Web. It can read XML data from local ﬁles as well

as the internet data sources. The XML ﬁles are parsed

and translated into the internal representation using a

parser SAX. Module includes the procedures for load-

ing XML data, conversion of the DTD schemata into

internal database schemata, and discovery of database

schemata from XML data.

4 EMPIRICAL RESULTS

This section presents the initial experimental results.

A simple artiﬁcially generated database is used for the

experiments. We constructed eight tables that include

three attributes: the ﬁrst two attributes are of type in-

teger, and the last one is a string. Each table includes

10000 randomly generated records: the ﬁrst two at-

tributes are set to random integer number in the range

of 0 to 5000. String includes 300 characters.

The queries used in the experiments form a chain of

joins of length N. Each query is restricted by a selec-

tion to force small number of query output records.

The following example presents test query that forms

a chain of 5 classes.

Example 4.1 The query in this example joins classes

using the attributes p1 and p2. Note that the expres-

ICEIS 2008 - International Conference on Enterprise Information Systems

 R Comp S Opt Eval Inx Mem

1 1 12768 2 31 7440 1 25M

1 2 11 0 35 6358 0 25M

2 1 19439 3 129 8627 2 34M

2 2 17 0 149 6397 0 34M

3 1 26650 4 441 1682 3 42M

3 2 23 0 491 6991 0 42M

4 1 37857 5 739 13121 4 52M

4 2 30 0 731 7315 0 52M

5 1 42780 6 957 13077 5 62M

5 2 33 0 925 7295 0 62M

6 1 51541 7 1277 14296 6 72M

6 2 40 0 1194 7631 0 72M

7 1 64389 8 1830 16539 7 87M

7 2 46 0 1982 7473 0 87M

Figure 4: Query execution.

sions of the form r :: p represents the name of attribute

p deﬁned for a relation r. The condition u.r1 :: p1 < 5

restricts the output to small number of records.

select( join(r1.ext,

join(r2.ext,

join(r3.ext,

join(r4.ext,r5.ext,

i,j: i.r4::p2 == j.r5::p1),

x,y: x.r3::p2 == y.r4::p1 ),

s,d: s.r2::p2 == d.r3::p1 ),

z,w: z.r1::p2 == w.r2::p1 ),

u: u.r1::p1 < 5 )

The empirical results are presented in Figure 4. We

observe two separate executions of the same query.

The ﬁrst is executed after the system is started; in

this case the statistics and the indices are not cre-

ated initially. The second execution is performed af-

ter the ﬁrst one is completed so most indices are al-

ready created. We observe the following parameters:

number of joins (), the execution of query after sys-

tem startup (R=1) or later (R=2), time needed for the

compilation of query including computation of statis-

tics (Comp), number of classes for which the statis-

tics was computed (S), optimization (Opt), evaluation

of query including the creation of indices and con-

sole output (Eval), number of created indices (Inx),

and, ﬁnally, the amount of memory used by the sys-

tem (Mem). Time is speciﬁed in milliseconds. The

experiments were done on Pentium 4 computer with

500MB RAM running FreeBSD.

Let us now give some comments on empirical re-

sults presented in Figure 4. Firstly, it is obvious that

the creation of indices for larger tables can signiﬁ-

cantly slow down the overall performance. This pays

off through the efﬁcient evaluation of queries. Indices

are usually constructed gradually: the overall compu-

tation is performed for sequence of queries where the

previous queries may trigger the construction of in-

dices used by the subsequent queries. Least recently

used indices are dropped if there is no more room in

the main memory for a new index. This did not hap-

pen during the experiments presented in Figure 4. The

amount of memory used during query execution never

exceeds 87M which is acceptable for practically all

recent personal computers.

The optimization phase is relatively fast compar-

ing to computation of statistics and index creation.

This is achieved primarily by using beam search algo-

rithm that restricts the number of choices during the

search to N queries for each equivalence class. In the

case of exhaustive search the exponential time curve

turns very steep after 5 joins. In the presented exper-

iments we have used the window (beam) of 7 queries

in each equivalence class.

5 RELATED WORK

All recent database algebras have evolved from the re-

lational database algebra proposed by Codd in (Codd,

1970). Although some algebras do not include the

operations of relational algebra directly, each of them

is relationally complete i.e. it includes the ability to

compute selection, standard set operations, projection

and Cartesian product.

Our work on object algebra has been inﬂuenced by

the early functional query language FQL proposed by

Buneman and Frankel (Buneman and Frankel, 1979)

and by some of its descendants, for instance, the func-

tional database programming language FAD (Dan-

forth and Valduriez, 1992). The algebra is by its na-

ture a functional language, where the operations can

be combined by the use of functional composition and

higher-order functions to form the language expres-

sions. The presented object algebra can be treated as a

generalization of FQL for the manipulation of objects.

It subsumes the operations of FQL, i.e. operations ex-

tension, restriction, selection and composition.

Let us now present the work related to the imple-

mentation of the presented query execution system.

Firstly, the implementation is closely related to the

implementation of Query Algebra originally proposed

by Shaw and Zdonik in (Shaw and Zdonik, 1990) and

implemented by Mitchell (Mitchell, 1993). In partic-

ular, we have used a similar representation of query

expressions by means of query trees. Furthermore,

the representation of query expressions in Qios is op-

timized by using single operation nodes and query

trees during all phases of query processing.

The design of the query execution system was

based on the design of Exodus optimizer generator

(Graefe and DeWitt, 1987) and its descendant Vol-

cano (Graefe and McKenna, 1993). The data struc-

ture MESH used in Exodus query optimizator gener-

IMPLEMENTATION OF ALGEBRA FOR QUERYING WEB DATA SOURCES

ator is improved by adding additional access paths.

The data structure can be accessed through: unique

identiﬁer, normalized query expression, and equiva-

lence class. The algorithm for query optimization is

rooted in Graefe’s work on Volcano optimizer algo-

rithm (Graefe and McKenna, 1993). This algorithm

uses top-down search guided by the possible ”moves”

that are associated to a query node. The algorithm

uses memoisation to avoid repeated optimization of

the same query. The search is restricted by the cost

limit which is a parameter in optimization.

We intended to implement a lightware system able

to manipulate middle size XML databases including

up to some 100.000 records. This size is reasonable

for most of collections appearing on the Web as well

as in our local data environments. One of the aims in

the design of Qios was to exploit this advantage. The

design of Qios has the following salient features. A

single data structure is used during the complete op-

timization and evaluation process. The central data

structure of query optimization Mesh is robust and

simple to use. Rules are treated as query trees and

rule matching and application procedures are based

on graph (tree) algorithms.

The computation of query execution plan that uses

indices created during query evaluation has similar

objectives: it is expected that XML database will be

of the above stated size and that the user is not in-

formed about the existence of indices and the selected

access paths during the execution. All decisions about

the creation of main memory indices are done by the

system.

6 CONCLUSIONS

The paper presents the implementation of an object

algebra. The data model based on F-Logic provides

a convenient environment for the representation of

semi-structured as well as structured data. Algebra

includes standard operations on sets which evolved

from the relational and nested-relational algebras and

the operations for querying the conceptual schemata.

Object algebra is implemented in the query execution

system Qios which is rooted in the architecture of the

relational and object-relational query processors.

REFERENCES

Abiteboul, S. and Beeri, C. (1993). On the power of the

languages for the manipulation of complex objects. In

Verso Report No.4. INRIA.

Buneman, P. and Frankel, R. (1979). Fql- a functional query

language. In Proc. of the ACM Conf. on Management

of Data. ACM SIGMOD.

Codd, E. F. (1970). A relational model of data for large

shared data banks. In Comm. of the ACM, Vol. 13 ,

Issue 6. ACM.

D. Daniels, E. (1982). An introduction to distributed query

compilation in r*. In IBM Research Report RJ3497

(41354). IBM.

Danforth, S. and Valduriez, P. (1992). A fad for data inten-

sive applications. In Trans. on Know. and Data Eng.,

Vol.4, No.1. IEEE.

Graefe, G. (1993). Query evaluation techniques for large

databases. In Comp. Surveys, Vol.25, No.2. ACM.

Graefe, G. (2005). Query processing. In Slides, ICDE In-

ﬂuential Paper Award. ICDE.

Graefe, G. and DeWitt, D. (1987). The exodus optimizer

generator. In Proceedings of ACM SIGMOD interna-

tional conference on Management of data. ACM.

Graefe, G. and McKenna, W. (1993). The volcano optimizer

generator: extensibility and efﬁcient search. In Proc.

of IEEE Conf. on Data Engineering. IEEE.

Jarke, M. and Koch, J. (1984). Query optimization in

database systems. In Comp. Surveys, Vol.16, No.2.

ACM.

Kifer, M., Lausen, G., and Wu, J. (1995). Logical founda-

tions of object-oriented and frame-based languages. In

Journal of ACM, Vol.42, No.4. ACM.

Marathe, A. (2006). Batch compilation, recompilation, and

plan caching issues in sql server 2005. In Microsoft,

White paper. Microsoft.

Mitchell, G. A. (1993). Extensible query processing in an

object-oriented database. In Ph.D. thesis. Brown Uni-

versity.

Roth, M. A., Korth, H. F., and Silberschatz, A. (1988).

Extended algebra and calculus for non 1nf relational

databases. In Trans. Database Systems, Vol.13, No.4.

ACM.

Savnik, I. (2007). Qios: Querying and integration of in-

ternet data. In http://www.famnit.upr.si/˜savnik/qios/.

FAMNIT.

Savnik, I., Tari, Z., and Mohoric, T. (1999). Qal: A query

algebra of complex objects. In Data & Knowledge

Eng. Journal, Vol.30, No.1. North-Holland.

Shaw, G. M. and Zdonik, S. B. (1990). A query algebra for

object oriented databases. In Proc. of IEEE Conf. on

Data Engineering. IEEE.

ICEIS 2008 - International Conference on Enterprise Information Systems