IMPLEMENTATION OF ALGEBRA FOR
QUERYING WEB DATA SOURCES
Iztok Savnik
Faculty of Mathematics, Science and Information Systems, University of Primorska, Koper, Slovenia
Keywords:
Database algebras, query languages, distributed databases, query optimization, Web query languages.
Abstract:
The paper presents the implementation of query execution system Qios. It serves as a lightware system for
the manipulation of XML data. Qios employs the relational technology for query processing. The main aim
in the implementation is to provide a querying system that is easy to use and does not require any additional
knowledge about the internal representation of data. The system provides robust and simple solutions for
many design problems. We aimed to simplify the internal structures of query processors rooted in the de-
sign of relational and object-relational query processors. We propose efficient internal data structures for the
representation of queries during all phases of query execution. The query optimization is based on dynamic
programming and uses beam search to reduce the time complexity. The data structure for storing queries pro-
vides efficient representation of queries during the optimization process and the simple means to explore plan
caching. Finally, main memory indices can be created on-the-fly to support the evaluation of queries.
1 INTRODUCTION
The Internet contains large amount of different data
sources accessible through ftp files, XML or HTML
documents, and the wrappers around relational and
object-relational database systems. The data avail-
able via such data sources may vary from simple lists
of records, catalogs containing large amounts of flat
tables, to complex data repositories including large
conceptual schemata and tables composed of complex
objects (Abiteboul and Beeri, 1993; Roth et al., 1988).
Querying Web data sources poses several new prob-
lems such as uniform access to the data sources, inte-
gration of information provided by the data sources,
and efficient manipulation of large data sets.
Our work focuses on the design of robust and
flexible query execution system for querying and in-
tegration of data from Web data sources. The de-
sign of query execution system Qios(Savnik, 2007) is
based on the existing work on relational and object-
relational query execution systems (Graefe, 1993;
Jarke and Koch, 1984; D. Daniels, 1982). The kernel
of algebra comprises relational operations extended
for the manipulation of complex objects. Further, the
algebra includes a set of operations for querying con-
ceptual schemata. These operations allow for brows-
ing the conceptual representation of data sources and
using the elements of conceptual schemata for query-
ing extensional data (Savnik et al., 1999).
We investigated the efficient and robust design of
the internal structures of the query execution engine.
In order to simplify the internal structure of system a
single data structure is used for the representation of
queries during all phases of query execution. The al-
gorithm used for the optimization is based on graph
representation of query trees and query transforma-
tion rules. Query transformation is seen as matching
rule input tree against a query tree and than duplicat-
ing the rule output tree. Similarly, query evaluation
module uses the same structure for the implementa-
tion of scans. Finally, to provide efficient implemen-
tation of query optimization and evaluation queries
are stored in a data structure called Mesh (Graefe and
DeWitt, 1987). The data structure is refined to pro-
vide efficient access to the stored queries.
The query optimization algorithm is based on a
version of dynamic programming called memoisa-
tion. The most promising results were obtained with
optimization algorithm which uses beam search for
the exploration of the hypothesis space of equivalent
queries. Further, we explore plan caching (Graefe,
2005; Marathe, 2006) which speeds up significantly
the subsequent execution of queries issued on the
same domain.
The query evaluation is based on dynamic selec-
tion of the query evaluation plans. We use a sim-
91
Savnik I. (2008).
IMPLEMENTATION OF ALGEBRA FOR QUERYING WEB DATA SOURCES.
In Proceedings of the Tenth International Conference on Enterprise Information Systems - DISI, pages 91-96
DOI: 10.5220/0001673700910096
Copyright
c
SciTePress
ple strategy that exploits the large quantity of main
memory which is lately provided by almost any per-
sonal computer. The main memory indices can be
constructed on-the-fly to support query evaluation.
The construction of indices is based on standard in-
dex selection rules available from any database text-
book. For instance, hash-based index is used when-
ever a larger tables has to be joined. In this way we
achieve fast performance of query processor for rela-
tively large quantities of data. Furthermore, the user
does not need to concern about the details of query
evaluation process.
The paper is organized as follows. The following
section presents the data model used for the represen-
tation of data from Web data sources. Section 3 in-
troduces the basic operations of algebra and presents
the examples of queries. The main portion of pa-
per presents the implementation of algebra. Section
4 describes storage manager, parser, query represen-
tation, query optimization and evaluation. Section 5
overviews the empirical results of query execution in
Qios. Finally, concluding remarks are stated in Sec-
tion 6.
2 ALGEBRA
The data model used for the representation of data
stored at different types of data sources must meet the
following requirements. First, the data sources pro-
vide various types of data including semi-structured
data, XML data, (flat) relational tables, and ob-
jects represented by object-relational database mod-
els. Third, we expect that besides the extensional
data, the data sources will include large amounts of
intensional data describing the structure and the con-
tents of the extensional databases.
The F-Logic data model was used as the formal
basis of the system (Kifer et al., 1995): it was shown
(Savnik et al., 1999) that it can serve as the basis for
the representation of the semi-structured data, it pro-
vides a convenient representation of the relational and
the object-relational database models, it can be used
for the representation of complex objects, and, it can
be used very naturally to represent intensional data.
The operations for inquiring about the basic prop-
erties of objects which relate to the representation of
objects are called model operations. Besides the stan-
dard comparison operations =, >, >=, <, <=, the set
operations and , and the component selector op-
erator ”.”, which are defined in relational algebra, the
algebra includes the following model operations. The
operations ext and exts map class objects to the sets
of theirs members, or the set of their instances, respec-
tively. Next, the operation class of allows for the
mapping of the ground objects to their parent classes
1
.
Further, the poset comparison operations
o
,
o
,
o
and
o
are used to relate objects with respect to the
partial ordering relationship defined among objects.
The operations subcl and supcl map class objects
to a set of their subclasses or super-classes, respec-
tively. Finally, the operation is defined for search-
ing the text using regular expressions. The operation
is defined as in the Perl programming language.
The declarative operations of the algebra are used for
the manipulation of the sets of objects. The detailed
presentation of the model and declarative operations
can be found in (Savnik et al., 1999). The following
groups of declarative operations are defined in the al-
gebra.
Relational Operations. These operations include
standard relational operations select, project,
union, differ and join which are extended for the
manipulation of objects.
Nested-relational Operations. The operation
group(s,a,b) is defined similarly to the SQL
group-by construct. It groups objects from s by the
values of attributes from the set a. The values of the
attributes which are not in a are stored as the relation
which is the value of the attribute b. The operation
unnest(s,a) is used to unnest a set valued attribute
a of objects from the argument set s.
Object-restructuring Operations. The operation
collapse(s,a) collapses the tuple structured at-
tribute a nested in the objects from the set s. The
operation flatten(s) is used for collapsing the set
of sets s.
Operation apply (Buneman and Frankel, 1979). The
functional operation apply(s, f ) is used for the ap-
plication of a query expression f to a set of objects s.
This operation is useful for the application of a query
to the sets of objects that can be located at different
sites.
3 QUERY EXECUTION SYSTEM
Qios (v0.9) is a system for the manipulation of data
from internet data sources. The system is intended to
serve as the lightware kernel of a data manipulation
server. The main aims in the design of Qios are to
provide: capabilities to manipulate collections of data
in a fast manner, various data manipulation functions
from classical querying to data restructuring, and, the
1
A ground object can have a single parent class.
ICEIS 2008 - International Conference on Enterprise Information Systems
92
Object manager
Berkeley DB
Record manager with cache
Access methods
Figure 1: Storage manager.
capabilities to organize, store and browse the data col-
lections obtained from the Internet data sources on the
local host. The system currently provides the inter-
face for XML. The treatment of other data formats
requires the addition of the interface routines for the
conversion of data into internal database format.
The query execution system Qios is composed of
the following components: the storage manager, the
parser, the subsystem for query optimization, the sub-
system for Web access, and the query evaluation sub-
system. In this section we overview the implementa-
tion of the Qios components. Not all operations of
the presented object algebra (Savnik et al., 1999) are
implemented in the current version of system. The
operations that are included are: model-based opera-
tions, the operations select, project, and join.
Record Manager. The record storage manager is
based on the Berkeley DB storage system providing
access to different storage structures such as for in-
stance hash-based index or B+ tree. Record manager
implements a data store for records representing indi-
vidual and class objects. Records are treated as arrays
of bytes, the structure of which is known at the object
level. Each record has a record identifier (abbr. rid)
implemented as system generated identifier which is
used as the key for the hash-based index in Berkeley
DB. Therefore, records are stored as oid/value pairs
where the values are packed in the sequences of bytes.
Record manager includes the methods for the work
with object identifiers, routines for reading and writ-
ing records, methods for the realization of the hier-
archical database model, and access to main memory
indices that are associated to the class object.
Object Manager. The object manager serves as a
cache of objects loaded from Web as well as for
storing intermediate results during query processing.
Qios persistent objects are objects that are tied to the
database via object manager. Each object has an iden-
tifier which is implemented by means of record iden-
tifier from the subordinate level. External identifiers
which are unique within the datafile can be assigned
to objects when they are created. Object cache is real-
Parser
Optimizer
Execution
--
Type chk
- --
Figure 2: Query processing.
ized using LRU (least recently used) strategy for the
selection of objects to be removed from cache. The
size of object cache can be set as the system parame-
ter via the configuration record of a datafile as well as
at run-time.
Parser. This module includes the implementation of
a parser for the algebra expressions. The algebra ex-
pressions are checked for syntactical errors and then
translated into query trees.
The basic skeleton of the query tree is constructed
during the parsing process. The variables and the
names of the data sources and spans are stored in sym-
bol table symtab. The query nodes represent the oper-
ations, spans linking rules, and access methods. The
data for the different phases of query processing is
stored in the same tree nodes. Each particular module
(e.g. query optimization) manipulates its own view of
query nodes.
Type-checking is implemented by procedure com-
puting the types of query nodes bottom-up, that is,
from access methods toward the root of query tree.
The class objects related to the access methods are
retrieved during the parsing process. The type of a
query node is stored by constructing a new class ob-
ject.
Query Optimizer. The query optimization subsystem
is composed of the following main modules. The
query tree manipulation module includes routines for
manipulation of query trees, application of rules on
the nodes of query trees, and property functions by
which the physical and logical properties of the query
tree nodes are determined.
The cost function is realized by a separate module.
The extensions of the cost functions used for the rela-
tional query optimizer are defined. The optimization
module comprises the algorithms for the optimization
of query expressions. The algorithms which are stud-
ied are the exhaustive search and an algorithm based
on dynamic programming.
The core of the module is the data structure mesh
for the representation of sets of queries. Common
sub-expressions of queries are shared ie. each query
is represented in mesh only once. Queries are orga-
nized into equivalence classes. Let us first present the
data structure Mesh.
mesh stores query expressions as query trees orga-
nized as a directed acyclic graph (dag). The query
IMPLEMENTATION OF ALGEBRA FOR QUERYING WEB DATA SOURCES
93
EquClass
Norm. query exp
Ids
MESH
'
&
$
%
c
c
c
c
c
c
?
z
c
U
s
=
s
6
-
R
-
Figure 3: Mash with access paths.
trees share common parts hence there is only one rep-
resentation of a query expression in mesh. Further,
the query trees are organized into equivalence classes
including logically equivalent queries. mesh has three
entry points.
The algorithm for query optimization is based on
dynamic programming. We use the variant of dy-
namic programming algorithm called memoisation
which stores the optimal results of the sub-queries
and uses them in the computation of the composed
queries. A simple brute force enumeration of all
reasonable compositions of operations from a given
query to provide the basis for classical dynamic pro-
gramming would be computationally too inefficient.
The enumeration of equivalent queries of a given
query have to be guided by some form - in our case,
the compositional structure of query.
Memoisation is realized in a very simple manner.
Every time a query is to be optimized we first check
Mesh if the optimal query already exists for a given
equivalence class of input query. At this point we can
see the use of Mesh for plan caching. If we do not
clean Mesh after the execution of a given query that
the optimization of the subsequent queries can make
use of the existing optimized queries in Mesh. The
optimization time is reduced significantly.
The query transformation rules are used for the
transformation of query trees into logically equiva-
lent query trees that have different structure and po-
tentially a faster evaluation method. The logical opti-
mization rules are in Qios specified in a language that
follows strictly the syntax and the semantics of the
Qios query expressions.
Finally, the cost function module is used to esti-
mate the number of bytes processed by the physical
query tree including the cost of materialization of the
intermediate query evaluation results. The computa-
tion of the cost estimation for the operations select,
project and join is realized using standard relational
formulas based on the selectivity of attributes and pre-
supposing the normal distribution of attributes values.
The computation of statistics for classes is gener-
ated dynamically, triggered by query compilation pro-
cedure when classes are loaded into the main memory.
The following data is gathered for a given class: the
number of class objects, the size of objects and the
cardinality of all attributes.
Query Evaluation System. The query evaluation mod-
ule is based on the iterator-tree representation of the
query evaluation plans. The physical query execution
plan is computed from the optimized query trees by
adding to the existing query nodes information about
physical operation that will implement given logical
operation (query node). The query nodes already con-
tain information about the statistics, index selection,
and cost estimation.
The main strategy which was used in the imple-
mentation of query execution is to select reasonably
fast access methods on-the-fly without considering al-
ternatives. Simple rules are used for index selection.
Firstly, if selection or join is based on equality of at-
tributes than hash-based index is generated. Secondly,
if selection or join involves range predicate we use B-
tree index. The selection of query execution plan is
implemented by the procedure which computes phys-
ical operations for all logical operations (query nodes)
forming the physical query tree in a bottom-up man-
ner.
XML Loader. XML loader provides the interface to
the Web. It can read XML data from local files as well
as the internet data sources. The XML files are parsed
and translated into the internal representation using a
parser SAX. Module includes the procedures for load-
ing XML data, conversion of the DTD schemata into
internal database schemata, and discovery of database
schemata from XML data.
4 EMPIRICAL RESULTS
This section presents the initial experimental results.
A simple artificially generated database is used for the
experiments. We constructed eight tables that include
three attributes: the first two attributes are of type in-
teger, and the last one is a string. Each table includes
10000 randomly generated records: the first two at-
tributes are set to random integer number in the range
of 0 to 5000. String includes 300 characters.
The queries used in the experiments form a chain of
joins of length N. Each query is restricted by a selec-
tion to force small number of query output records.
The following example presents test query that forms
a chain of 5 classes.
Example 4.1 The query in this example joins classes
using the attributes p1 and p2. Note that the expres-
ICEIS 2008 - International Conference on Enterprise Information Systems
94
R Comp S Opt Eval Inx Mem
1 1 12768 2 31 7440 1 25M
1 2 11 0 35 6358 0 25M
2 1 19439 3 129 8627 2 34M
2 2 17 0 149 6397 0 34M
3 1 26650 4 441 1682 3 42M
3 2 23 0 491 6991 0 42M
4 1 37857 5 739 13121 4 52M
4 2 30 0 731 7315 0 52M
5 1 42780 6 957 13077 5 62M
5 2 33 0 925 7295 0 62M
6 1 51541 7 1277 14296 6 72M
6 2 40 0 1194 7631 0 72M
7 1 64389 8 1830 16539 7 87M
7 2 46 0 1982 7473 0 87M
Figure 4: Query execution.
sions of the form r :: p represents the name of attribute
p defined for a relation r. The condition u.r1 :: p1 < 5
restricts the output to small number of records.
select( join(r1.ext,
join(r2.ext,
join(r3.ext,
join(r4.ext,r5.ext,
i,j: i.r4::p2 == j.r5::p1),
x,y: x.r3::p2 == y.r4::p1 ),
s,d: s.r2::p2 == d.r3::p1 ),
z,w: z.r1::p2 == w.r2::p1 ),
u: u.r1::p1 < 5 )
The empirical results are presented in Figure 4. We
observe two separate executions of the same query.
The first is executed after the system is started; in
this case the statistics and the indices are not cre-
ated initially. The second execution is performed af-
ter the first one is completed so most indices are al-
ready created. We observe the following parameters:
number of joins (), the execution of query after sys-
tem startup (R=1) or later (R=2), time needed for the
compilation of query including computation of statis-
tics (Comp), number of classes for which the statis-
tics was computed (S), optimization (Opt), evaluation
of query including the creation of indices and con-
sole output (Eval), number of created indices (Inx),
and, finally, the amount of memory used by the sys-
tem (Mem). Time is specified in milliseconds. The
experiments were done on Pentium 4 computer with
500MB RAM running FreeBSD.
Let us now give some comments on empirical re-
sults presented in Figure 4. Firstly, it is obvious that
the creation of indices for larger tables can signifi-
cantly slow down the overall performance. This pays
off through the efficient evaluation of queries. Indices
are usually constructed gradually: the overall compu-
tation is performed for sequence of queries where the
previous queries may trigger the construction of in-
dices used by the subsequent queries. Least recently
used indices are dropped if there is no more room in
the main memory for a new index. This did not hap-
pen during the experiments presented in Figure 4. The
amount of memory used during query execution never
exceeds 87M which is acceptable for practically all
recent personal computers.
The optimization phase is relatively fast compar-
ing to computation of statistics and index creation.
This is achieved primarily by using beam search algo-
rithm that restricts the number of choices during the
search to N queries for each equivalence class. In the
case of exhaustive search the exponential time curve
turns very steep after 5 joins. In the presented exper-
iments we have used the window (beam) of 7 queries
in each equivalence class.
5 RELATED WORK
All recent database algebras have evolved from the re-
lational database algebra proposed by Codd in (Codd,
1970). Although some algebras do not include the
operations of relational algebra directly, each of them
is relationally complete i.e. it includes the ability to
compute selection, standard set operations, projection
and Cartesian product.
Our work on object algebra has been influenced by
the early functional query language FQL proposed by
Buneman and Frankel (Buneman and Frankel, 1979)
and by some of its descendants, for instance, the func-
tional database programming language FAD (Dan-
forth and Valduriez, 1992). The algebra is by its na-
ture a functional language, where the operations can
be combined by the use of functional composition and
higher-order functions to form the language expres-
sions. The presented object algebra can be treated as a
generalization of FQL for the manipulation of objects.
It subsumes the operations of FQL, i.e. operations ex-
tension, restriction, selection and composition.
Let us now present the work related to the imple-
mentation of the presented query execution system.
Firstly, the implementation is closely related to the
implementation of Query Algebra originally proposed
by Shaw and Zdonik in (Shaw and Zdonik, 1990) and
implemented by Mitchell (Mitchell, 1993). In partic-
ular, we have used a similar representation of query
expressions by means of query trees. Furthermore,
the representation of query expressions in Qios is op-
timized by using single operation nodes and query
trees during all phases of query processing.
The design of the query execution system was
based on the design of Exodus optimizer generator
(Graefe and DeWitt, 1987) and its descendant Vol-
cano (Graefe and McKenna, 1993). The data struc-
ture MESH used in Exodus query optimizator gener-
IMPLEMENTATION OF ALGEBRA FOR QUERYING WEB DATA SOURCES
95
ator is improved by adding additional access paths.
The data structure can be accessed through: unique
identifier, normalized query expression, and equiva-
lence class. The algorithm for query optimization is
rooted in Graefe’s work on Volcano optimizer algo-
rithm (Graefe and McKenna, 1993). This algorithm
uses top-down search guided by the possible ”moves”
that are associated to a query node. The algorithm
uses memoisation to avoid repeated optimization of
the same query. The search is restricted by the cost
limit which is a parameter in optimization.
We intended to implement a lightware system able
to manipulate middle size XML databases including
up to some 100.000 records. This size is reasonable
for most of collections appearing on the Web as well
as in our local data environments. One of the aims in
the design of Qios was to exploit this advantage. The
design of Qios has the following salient features. A
single data structure is used during the complete op-
timization and evaluation process. The central data
structure of query optimization Mesh is robust and
simple to use. Rules are treated as query trees and
rule matching and application procedures are based
on graph (tree) algorithms.
The computation of query execution plan that uses
indices created during query evaluation has similar
objectives: it is expected that XML database will be
of the above stated size and that the user is not in-
formed about the existence of indices and the selected
access paths during the execution. All decisions about
the creation of main memory indices are done by the
system.
6 CONCLUSIONS
The paper presents the implementation of an object
algebra. The data model based on F-Logic provides
a convenient environment for the representation of
semi-structured as well as structured data. Algebra
includes standard operations on sets which evolved
from the relational and nested-relational algebras and
the operations for querying the conceptual schemata.
Object algebra is implemented in the query execution
system Qios which is rooted in the architecture of the
relational and object-relational query processors.
REFERENCES
Abiteboul, S. and Beeri, C. (1993). On the power of the
languages for the manipulation of complex objects. In
Verso Report No.4. INRIA.
Buneman, P. and Frankel, R. (1979). Fql- a functional query
language. In Proc. of the ACM Conf. on Management
of Data. ACM SIGMOD.
Codd, E. F. (1970). A relational model of data for large
shared data banks. In Comm. of the ACM, Vol. 13 ,
Issue 6. ACM.
D. Daniels, E. (1982). An introduction to distributed query
compilation in r*. In IBM Research Report RJ3497
(41354). IBM.
Danforth, S. and Valduriez, P. (1992). A fad for data inten-
sive applications. In Trans. on Know. and Data Eng.,
Vol.4, No.1. IEEE.
Graefe, G. (1993). Query evaluation techniques for large
databases. In Comp. Surveys, Vol.25, No.2. ACM.
Graefe, G. (2005). Query processing. In Slides, ICDE In-
fluential Paper Award. ICDE.
Graefe, G. and DeWitt, D. (1987). The exodus optimizer
generator. In Proceedings of ACM SIGMOD interna-
tional conference on Management of data. ACM.
Graefe, G. and McKenna, W. (1993). The volcano optimizer
generator: extensibility and efficient search. In Proc.
of IEEE Conf. on Data Engineering. IEEE.
Jarke, M. and Koch, J. (1984). Query optimization in
database systems. In Comp. Surveys, Vol.16, No.2.
ACM.
Kifer, M., Lausen, G., and Wu, J. (1995). Logical founda-
tions of object-oriented and frame-based languages. In
Journal of ACM, Vol.42, No.4. ACM.
Marathe, A. (2006). Batch compilation, recompilation, and
plan caching issues in sql server 2005. In Microsoft,
White paper. Microsoft.
Mitchell, G. A. (1993). Extensible query processing in an
object-oriented database. In Ph.D. thesis. Brown Uni-
versity.
Roth, M. A., Korth, H. F., and Silberschatz, A. (1988).
Extended algebra and calculus for non 1nf relational
databases. In Trans. Database Systems, Vol.13, No.4.
ACM.
Savnik, I. (2007). Qios: Querying and integration of in-
ternet data. In http://www.famnit.upr.si/˜savnik/qios/.
FAMNIT.
Savnik, I., Tari, Z., and Mohoric, T. (1999). Qal: A query
algebra of complex objects. In Data & Knowledge
Eng. Journal, Vol.30, No.1. North-Holland.
Shaw, G. M. and Zdonik, S. B. (1990). A query algebra for
object oriented databases. In Proc. of IEEE Conf. on
Data Engineering. IEEE.
ICEIS 2008 - International Conference on Enterprise Information Systems
96