STATIC OPTIMIZATION OF DATA INTEGRATION PLANS

IN GLOBAL INFORMATION SYSTEMS

Janusz R. Getta

School of Computer Science and Software Engineering, University of Wollongong, Wollongong, Australia

Keywords:

Data integration, Global information system, Multidatabase system, Online data integration, Integration plan,

Static optimization.

Abstract:

Global information systems provide its users with a centralized and transparent view of many heterogeneous

and distributed sources of data. The requests to access data at a central site are decomposed and processed

at the remote sites and the results are returned back to a central site. A data integration component of the

system processes data retrieved and transmitted from the remote sites accordingly to the earlier prepared data

integration plans.

This work addresses a problem of static optimization of data integration plans in a global information system.

Static optimization means that a data integration plan is transformed into more optimal form before it is used

for data integration. We adopt an online approach to data integration where the packets of data transmitted

over a wide area network are integrated into the ﬁnal result as soon as they arrive at a central site. We show

how data integration expression obtained from a user request can be transformed into a collection of data

integration plans, one for each argument of data integration expression. This work proposes a number of static

optimization techniques that change an order operations, eliminate materialization and constant arguments

from data integration plans implemented as relational algebra expressions.

1 INTRODUCTION

Efﬁcient integration of data retrieved and transmitted

from the remote sources is one of the central problems

in the development of global information systems that

provide the users with a centralized and transparent

view of many heterogeneous and distributed sources

of data. A data integration component of a global

information system processes data retrieved from the

remote sites and transmitted to a central site. A typ-

ical architecture of a global information system de-

composes the user requests into the requests related

to the remote source of data and submits the requests

fro the processing at the remote sites. The results of

processing at the remote sites are transmitted back to

a central site and integrated with data already avail-

able there. A process of data integration acts upon a

data integration plan which is prepared when a user’s

request is decomposed into the requests related to the

remote sources. A data integration plan determines an

order in which the individual requests are issued and

a way how the results of these request are combined

into the ﬁnal result. The individual requests can be

issued accordingly to entirely sequential or entirely

parallel, or mixed sequential and parallel strategies.

Accordingly to an entirely sequential strategy a re-

quest q

can be submitted for processing at a remote

site only when all results of the requests q

,...,q

i−1

are available at a central site. An entirely sequential

strategy is appropriate when the results received so

far can be used to reduce the complexity of the re-

maining requests q

,...,q

i+k

. Accordingly to an en-

tirely parallel strategy all requests q

,...,q

i+k

are submitted simultaneously for the parallel process-

ing at the remote sites. An entirely parallel strategy

is beneﬁcial when the computational complexity and

the amounts of data transmitted is more or less the

same for all requests. Accordingly to a mixed sequen-

tial and parallel strategy some requests are submitted

sequentially while the others in parallel. Optimiza-

tion of data integration plans is either static when the

plans are optimized before a stage of data integration

or it is dynamic when the plans are changed during

the processing of the requests.

The problem of static optimization of data inte-

gration plans can be formulated in the following way.

Given a global information system that integrates a

number of remote and independent sources of data.

141

R. Getta J..

STATIC OPTIMIZATION OF DATA INTEGRATION PLANS IN GLOBAL INFORMATION SYSTEMS.

DOI: 10.5220/0003423901410150

In Proceedings of the 13th International Conference on Enterprise Information Systems (ICEIS-2011), pages 141-150

ISBN: 978-989-8425-53-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Consider a user request q to a global information sys-

tem and its decomposition into the individual requests

,...,q

simultaneously submitted for the process-

ing at the remote sites. Let a request q be equivalent

to an expression E (q

,...,q

) over the individual re-

quests. If the remote sites return the results r

,...,r

in the response to the individual requests q

,...,q

then the ﬁnal result a user request q is equal to the

result of an expression E (r

,...,r

). Then, static op-

timization of data integration plan is equivalent to the

optimization of an expression E (r

,...,r

A naive and quite ineffective approach would be

to postpone data integration until the results r

,...,r

returned from the remote sites are available at a cen-

tral site. A more effective solution is to consider

an individual reply r

as a sequence of data pack-

ets r

,...,r

k−1

and to perform data integra-

tion each time a new packet of data is received at

a central site. Such approach to data integration is

more efﬁcient because there is no need to wait for the

complete results when a data integration expression

E (r

,...,r

) is evaluated accordingly to a given or-

der of operations. Instead, whenever a new packet

of data is received at a central site it is immedi-

ately integrated into the intermediate result no mat-

ter which partial result it comes from. Then, static

optimization of data integration plan ﬁnds the best

processing strategy for the sequences of packets of

data r

,...,r

k−1

where i = 1, . . . , n. Such ob-

jective requires the transformation of data integra-

tion expression E (r

,...,r

) into the individ-

ual data integration plans for the sequences of packets

,...,r

k−1

where i = 1,...,n.

A starting point for the optimization is a data inte-

gration expression E (r

,...,r

) obtained from

decomposition of a user request to a global infor-

mation system. Processing of the individual pack-

ets means that an expression E (r

,...,r

⊕ δ

,...,r

)

must be recomputed each time a data packet δ

is ap-

pended to an argument r

. Of course reprocessing of

the entire data integration expression is too time con-

suming and a better idea is to perform an incremental

processing of the expression, i.e. to ﬁnd how the pre-

vious result of an expression E (r

,...,r

) must

be changed after δ

is appended to an argument r

. A

data integration expression is transformed into a set

of data integration plans where each plan represents

an integration procedure for the increments of one ar-

gument of the original expression. In our approach

a data integration plan is a sequence of so called id-

operations on the increments or decrements of data

containers and other ﬁxed size containers. In order

to reduces the size of arguments, static optimization

of data integration plans moves the unary operation

towards the beginning of a plan. Additionally, the fre-

quently updated materializations are eliminated from

the plan and constant arguments and subexpressions

are replaced with the pre-computed values.

The paper is organized in the following way. First,

we overview the related works in an area of optimiza-

tion of data integration in distributed global informa-

tion systems. Next, we derive the incremental pro-

cessing of modiﬁcation and we ﬁnd a system of oper-

ation on modiﬁcations of data items for the system of

operations included in the relational algebra. Trans-

formation of data integration expressions into the sets

of individual data integration plans is discussed in a

section 4 and it is followed by presentation of static

optimization of data integration plans in the next sec-

tion. Finally, section 6 concludes the paper.

2 RELATED WORK

Optimization of data integration in global information

systems can be traced back to optimization of query

processing in multidatabase and federated database

systems (Ozcan et al., 1997).

The external factors affecting the performance of

query processing in multidatabase systems promote

the reactive query processing techniques. The early

works on the reactive query processing techniques

are based on partitioning (Kabra and DeWitt, 1998)

and dynamic modiﬁcation of query processing plans

(Getta, 2000). A dynamic modiﬁcation technique

ﬁnds a plan equivalent to the original one and such

that it is possible to continue integration of the avail-

able data sets. The similar approaches dynamically

change an order in which the join operations are exe-

cuted depending on the arguments available at a cen-

tral site. These techniques include query scrambling

(Amsaleg et al., 1998) and dynamic scheduling of

operators (Urhan and Franklin, 2001), and Eddies

(Avnur and Hellerstein, 2000),

Optimization of relational algebra operations used

for the data integration includes new versions of join

operation customised to online query processing, e.g.

pipelined join operator XJoin (Urhan and Franklin,

2000), ripple join (Haas and Hellerstein, 1999), dou-

ble pipelined join (Ives et al., 1999), and hash-merge

join (Mokbel et al., 2002).

A technique of redundant computations simulta-

neously processes a number of data integration plans

leaving a plan that that provides the most advanced

results (Antoshenkov and Ziauddin, 2000).

A concept of state modules described in (Raman

et al., 2003) allows for concurrent processing of the

tuples through the dynamic division of data integra-

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

142

tion tasks. Adaptive data partitioning (Ives et al.,

2004) technique processes different partitions of the

same argument using different data integration plans.

The works on adaptive data partitioning (Ives et al.,

2004) and optimization of data stream processing

(Getta and Vossough, 2004) were the ﬁrst attempts to

use the associativity of join operation to integrate the

separate partitions of the same argumentswith the dif-

ferent integration plans. An adaptive and online pro-

cessing of data integration plans proposed in (Getta,

2005) and later on in (Getta, 2006) considers the sets

of elementary operations for data integration and the

best integration plan for recently transmitted data.

Many of the techniques developed for the efﬁ-

cient processing of data streams (Getta and Vossough,

2004) can be applied to data integration. The reviews

of the most important data integration techniques pro-

posed so far are included in (Gounaris et al., 2002).

3 INCREMENTAL PROCESSING

OF MODIFICATIONS

Initially we consider a process of data integration in

the separation from any particular model of data. We

adopt a general view of a database where a set of

generic unary and binary operations processes a col-

lection of data containers. We do not consider any

particular internal structure of data containers and any

particular system of operations on data containers.

Data containers may represent relational tables, XML

documents, set of persistent objects, ﬁles of records,

etc.

3.1 Id-operations

Let r

,...,r

be data containers whose structure is

consistent with a given model of data. A generic op-

eration of the model is an operation P such that its

arguments are the data containers r and s and whose

result is another data container.

A modiﬁcation of a data container r is denoted by

δ and it is deﬁned as a pair of disjoint data containers

<δ

−

, δ

> such that r∩ δ

−

= δ

−

and r∩ δ

An data integration operation that applies a mod-

iﬁcation δ to a data container r is denoted by r ⊕ δ

For example, in the relational model integration of a

modiﬁcation δ to a relational table r is deﬁned as an

expression (r− δ

−

) ∪ δ

Consider a generic operation P (r,s) on the data

containers r and s. An incremental/decremental oper-

ation later on called as an id-operation of an argument

r of a generic operation P (r,s) is denoted by α

(δ,s)

and it is deﬁned as the smallest modiﬁcation δ

that

should be integrated with the result of P (r,s) to obtain

the result of P (r⊕ δ,s), i.e.

P (r,s) ⊕ α

(δ,r) = P (r ⊕ δ,s) (1)

An incremental/decremental operation of an argu-

ment s of a generic operation P (r, s) is denoted by

(r,δ) and it is deﬁned as the smallest modiﬁcation

that should be integrated with the result of P (r,s)

to obtain the result of V(r,s⊕ δ), i.e.

P (r,s) ⊕ β

(r,δ) = P (r, s⊕ δ) (2)

If a generic operation P (r,s) is a component of

an integration expression then id-operations allow for

faster re-computation of P (r,s) when one of its ar-

guments is integrated with data transmitted from an

external data site An ineffective approach would be to

integrate the transmitted data with an argument and to

re-compute entire generic operation. It is represented

by the right hand sides of the equations (1) and (2).

A better idea is to apply an id-operation to trans-

mitted data and the other argument of the base opera-

tion to get a modiﬁcation that can be integrated with

the previous result of generic operation. It is repre-

sented by the left hand sides of the equations (1) and

(2). The application of id-operations speeds up data

integration because it is possible to immediately pro-

cess data received at a central site. Id-operations al-

low for the incremental processing of data integration

expressions such that an increment of one of the ar-

guments triggers the computations of a sequence of

id-operations that return a modiﬁcation, which should

be applied to the ﬁnal result of integration.

3.2 Relational Algebra based

Id-operations

Let x be a nonempty set of attribute names later on

called as a schema and let dom(a) denotes a domain

of attribute a ∈ x. A tuple t deﬁned over a schema x

is a full mapping t : x → ∪

a∈x

dom(a) and such that

∀a ∈ x, t(a) ∈ dom(a). A relational table created on

a schema x is a set of tuples over a schema x.

Let r, s be the relational tables such that

schema(r) = x, schema(s) = y respectively and let

z ⊆ x, v ⊆ (x∩ y), and v 6=

0. The symbols σ

, π

, ⋊⋉

∼

, ⋉

, ∪, ∩, − denote the relational algebra opera-

tions of selection, projection, join, antijoin, semijoin,

and set algebra operations of union, intersection, and

difference. All join operations are considered to be

equijoin operations over a set of attributes v.

To ﬁnd the analytical solutions of the equations (1)

and (2) we we assume that a data integrationoperation

is computed in the relational model in the following

way.

r⊕ δ = (r − δ

−

) ∪ δ

. (3)

STATIC OPTIMIZATION OF DATA INTEGRATION PLANS IN GLOBAL INFORMATION SYSTEMS

143

Then, the id-operations α

and β

can be decom-

posed into the pairs of operations each one acting on

either negative (δ

−

) or positive (δ

) component of a

modiﬁcation δ.

(δ,s) =< α

−

(δ

−

,s),α

(δ

,s) >, (4)

(r,δ) =< β

−

(r,δ

−

),β

(r,δ

) > . (5)

If we separately consider the negative and positive

components of a modiﬁcation δ and we replace data

integration operation with its relational deﬁnition then

we get the following equations.

P (r,s) − α

−

(δ

−

,s) = P (r− δ

−

,s) (6)

P (r,s) ∪ α

(δ

,s) = P (r∪ δ

,s) (7)

P (r,s) − β

−

(r,δ

−

) = P (r,s− δ

−

) (8)

P (r,s) ∪ β

(r,δ

) = P (r,s∪ δ

) (9)

To ﬁnd the id-operations we solve the equations

above for the generic operations of union (∪), join

(⋊⋉), and antijon (∼) and we assume that selection op-

eration is always directly applied to the arguments of

binary operations and projection is applied only one

time to the ﬁnal result of query processing.

The analytical solutions of the equations (6) and

(7) provide the following results.

∪

(δ,s) =< δ

−

− s,δ

− s > (10)

∪

(r,δ) =< δ

−

− r,δ

− r > (11)

The derivations of id-operation for the generic opera-

tions of join and antijoin can be obtained in the same

way.

⋊⋉

(δ,s) =< δ

−

⋊⋉

s,δ

⋊⋉

s > (12)

⋊⋉

(r,δ) =< δ

−

⋊⋉

r,δ

⋊⋉

r > (13)

∼

(δ,s) =< δ

−

∼

s,δ

∼

s > (14)

∼

(r,δ) =< r⋉

,r⋉

−

> (15)

3.3 Application of Id-operations to Data

Integration

Consider a data integration expression E (r,s,t) =

t ∼

(r ⋊⋉

s) where r, s, t are the remote data

sources and v = schema(r) ∩ schema(t) and z =

schema(r) ∩ schema(s). Assume, that a new incre-

ment δ

0,δ

> of an argument s has been just

transmitted to a central site. We would like like to re-

compute a data integration expression E (r,s ⊕ δ

,t)

immediately after the integration of an increment δ

with an argument s.

To avoid re-computation of entire data integration

expression we ﬁrst ﬁnd a modiﬁcation δ

that should

be applied to a result of r ⋊⋉

s after the extension of

an argument s with δ

. Next, we ﬁnd a modiﬁcation

rst

that should be applied to the result of t ∼

(r ⋊⋉

after modiﬁcation δ

is applied to the result of r ⋊⋉

From the equation (13) we get δ

0,δ

⋊⋉ r >.

Next, to ﬁnd δ

rst

we ﬁnd β

∼

(t,δ

). From the equation

(15) we get δ

rst

=< t ⋉

,r⋉

0 >. Finally, δ

rst

t ⋉

(δ

⋊⋉ r),

0 >. Hence, in order to get the result of

E (r,s⊕ δ

,t) after the extension of s with δ

we have

to compute E (r,s,t) − (t⋉

(δ

⋊⋉ r)).

Next, we consider the same data integration ex-

pression t ∼

(r ⋊⋉

s) and a new increment δ

of a

remote data source t. Now, processing of an incre-

ment δ

needs either materialization of an intermedi-

ate result of a subexpression (r ⋊⋉

s) or transforma-

tion of the data integration expression into an equiv-

alent one with either left (right)-deep syntax tree and

with an argumentt in the leftmost (rightmost)position

of the tree. Materialization of an intermediate results

decreases the overall performance because when one

of its arguments is extended then entire subexpression

of materialization must be re-computed. On the other

hand it is not always possible to transform an inte-

gration expression into a left or right deep syntax tree

such that a modiﬁcation is located at the lowest leaf

level of the tree.

If materialization m

= r ⋊⋉ s is available then

from an equation (14) we get δ

rst

0,δ

∼

Hence, in order to get the result of E (r,s,t ⊕ δ

) af-

ter the extension of t we have to compute E (r, s,t) ∪

(δ

∼

If materialization m

is not available then an in-

teresting option is to transform an expression δ

∼

(r ⋊⋉ s)) into δ

∼

((r⋉ δ

) ⋊⋉ s). Such transforma-

tion is correct because we do not need the entire result

of r ⋊⋉ s to be computed, we only need the rows from

r that can be joined with δ

over the attributes in v. A

subexpression r⋉ δ

will reduce the size of an argu-

ment r before join with s and its computation can be

done faster because δ

is small.

4 DATA INTEGRATION PLANS

In this section we introduce a concept of data inte-

gration plan and we show how to transform a data

integration expression into a set of data integration

plans. Let E (r

,...r

) be a data integration expres-

sion built over the generic operations and data con-

tainers r

...,r

. A syntax tree T

of an expression E

is a binary tree such that:

(i) for each instance of argument r

,...r

there is a

leaf node are labelled with a name of argument,

(ii) for each subexpression P (e

′

′′

) of E where P is

a generic operation and e

′

and e

′′

are the subex-

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

144

pression of E there is a node labelled with P that

has two subtrees T

′

and T

′′

A data integration plan is a sequence of assignment

statements s

,...,s

where the right hand side of each

statement is either an application of a modiﬁcation to

a data container (m

:= m

⊕ δ

) or an application of

left or right id-operation (δ

:= α

(δ

)).

Consider an argument r

of a data integration expres-

sion E . An implementation of data integration ex-

pression for an argument r

is constructed in the fol-

lowing steps.

1: Assign unique numbers to each node of a syntax

tree T

2: Make an implementation p

empty.

3: Start from a leaf node of T

labelled with r

4: While not in the root node of T

move to ancestor

node of the current node and execute a procedure

moveToAncestor(i, j) where i is an identiﬁer of

the current node and j is an identiﬁer of the an-

cestor node.

5: When in the root node i append a statement

result := result⊕ δ

;

A procedure moveToAncestor(i, j) consists of the fol-

lowing steps.

1: If i is a leaf node and it is the left descendant of

a node j then append a statement that computes

id-operation p

: δ

:= α

operation

(δ

, m

); where

is either a data container or a materialization

in the right descendant of node j.

2: If i is a leaf node and it is the right descendant of

a node j then append a statement that computes

id-operation p

: δ

:= β

operation

, δ

); where

is either a data container or a materialization

in the left descendant of node j.

3: If i is not a leaf node of T

then append a statement

:=m

⊕ δ

; where m

is a materialization in a

node i.

4: If i is not a leaf node and it is the left descendant

of a node j then append a statement that computes

id-operation δ

:= α

operation

(δ

, m

); where m

either a data container or a materialization in the

right descendant of node j.

5: If i is not a leaf node and it is the right descendant

of a node j then append a statement that computes

id-operation δ

:= β

operation

( m

, δ

); where m

either data container or a materialization in the left

descendant of node j.

As a simple example consider a data integration

expression P (r, s,t) = t ⋊⋉

(r ∼

s) and its syntax tree

r s

2: 3:

4: 5:

Figure 1: A syntax tree of an expression t ⋊⋉

(r ∼

s).

with the numbered nodes given in Figure (1). Starting

from a node 4 we get a data integration plan for the

modiﬁcations of argument r:

: δ

:= α

∼

(δ

,s);

:= m

⊕ δ

;

:= β

⋊⋉

(t,δ

);

result := result ⊕ δ

;

An equivalent relational data integration plan is ob-

tained by the substitution of id-operations with the

equivalent relational algebra expressions.

: δ

:= δ

∼ s;

:= m

⊕ δ

;

:= t ⋊⋉ δ

;

result := result ⊕ δ

;

The relational algebra operations on the relational

tables and modiﬁcations process both negative and

positive components of the modiﬁcations. For ex-

ample, δ

:= δ

∼ s; is equivalent to δ

−

:= δ

−

∼ s;

:= δ

∼ s;

The data integration plans for the arguments s and t

are the following.

: δ

:= β

∼

(r,δ

);

:= m

⊕ δ

;

:= β

⋊⋉

(t,δ

);

result := result ⊕ δ

;

: δ

:= α

⋊⋉

(δ

);

result := result ⊕ δ

;

5 STATIC OPTIMIZATION OF

DATA INTEGRATION PLANS

Static optimization of data integration plan means

that transformation of the plans is performed before a

stage of data integration while dynamic optimization

of data integration plans changes the plans during a

stage of data integration. In order to get more sound

results we consider a language of data integration ex-

pressions to be the relational algebra and we consider

data integration expressions built from the operations

of join, antijoin, and union only. We also assume, that

operation of selection is always processed together

with an adjacent binary operation and projection is

computed at the very end of data integration.

STATIC OPTIMIZATION OF DATA INTEGRATION PLANS IN GLOBAL INFORMATION SYSTEMS

145

5.1 Preliminary Optimizations

The preliminary optimizations are performed be-

fore the transformation of data integration expression

into data integration plans and it includes the stan-

dard transformations of relational algebra expressions

where the selections and projections are ”pushed”

down the syntax trees of data integration expression

and join operations are reordered such that the joins

on the small arguments are performed ﬁrst. Next, the

selection operations are associated with the binary op-

erations such that the rows that satisfy a selection con-

dition are directly ”piped” to the ﬁrst stage of the com-

putations of the binary operations. For example, if

a join operation implemented as hash-based join fol-

lows a selection then a row that satisﬁes a selection

condition is not saved in a temporary results of se-

lection and instead is hashed in the ﬁrst stage of the

computations of a hash-based join. The same tech-

niques is applied to the selection operations that can-

not be ”pushed” down below the binary operations,

for example σ

a>c

(r(ab) ⋊⋉

s(bc)) are computed by

”piping” the rows obtained as the results of binary

operation r(ab) ⋊⋉

s(bc) directly to a selection op-

erations. It is also possible to implement a selection

operation as an additional comparison when the rows

from the arguments are matched during the process-

ing of binary operation, for example in the example

above, testing of equality condition r.b = s.b can be

followed by testing of a condition r.a > s.c. After

the preliminary stage of optimizations data integra-

tion expressions are transformed into data integration

plans.

5.2 Optimization through Reordering of

Operations

The following example shows how further optimiza-

tion of data integration plans can be achieved through

reordering of the operations. Consider a fragment of

data integration plan p

: δ

:= δ

⋊⋉

r; m

:= m

⊕ δ

;

:= δ

∼

s; The following two observations lead to

the transformations that may have a positive impact

on the performance of data integration plans. First, if

a data container r is signiﬁcantly larger than a data

container s then it would be more efﬁcient to start

from the computations on a data container s because

the results would be smaller. Second, some of the

data items in δ

may not contribute to the results of

:= δ

∼

s and can be removed from δ

before the

computations of antijoin operation. We compute an

operation δ

÷ s to partition both s and δ

into pairs

(δ

−)

> and <δ

(s+)

,δ

(s−)

> such that only

(δ

and δ

(s+)

have an impact on the result of

:= δ

∼

s; The partitioning is performed such that

(s+)

:= δ

⋉ s, δ

(s−)

:= δ

∼ s, s

(δ

:= s⋉ δ

, and

(δ

−)

:= s ∼ δ

. Then, we compute <δ

(s+)

,δ

(s−)

>⋊⋉ r

and later on the result of δ

(s−)

⋊⋉ r is directly passed

to δ

and it is unioned with with the result of (δ

(s+)

⋊⋉

r) ∼ s

(δ

The complexity of the partitioning δ

÷ s is com-

parable with the complexity of an ordinary join op-

eration. The complexity of the computations of

<δ

(s+)

,δ

(s−)

>⋊⋉ r is the same as the complexity of

⋊⋉ r. It is expected that the additional computations

of δ

÷ s will take less time than the difference be-

tween the computations of (δ

⋊⋉ r) ∼ s and the com-

putations of (δ

(s+)

⋊⋉ r) ∼ s

(δ

. The beneﬁts depend

on how far the partitioning reduces the size of s

(δ

and (δ

(s+)

⋊⋉ r).

5.3 Elimination of Materializations

An algorithm that transforms a syntax tree of data

integration expression into the data integration plans

creates the references to so called materializations.

Materialization is a relational table that contains the

intermediate results of processing of one of subex-

pression of a data integration plan. Materializations

are needed when when an online processing plan is

created for an argument which is not at the bottom

level of a syntax tree or a syntax tree is not left-

/right-deep syntax tree. Materializations require the

additional integration operations in online processing

plans and because of that frequently performed in-

tegrations of the partial results with materializations

may consume a lot of additional time. To avoid this

problem we ﬁnd the ways how to remove materializa-

tions from data integration plans.

Consider an integration expression t ⋊⋉

(r ⋊⋉

s).

An integration plan for an argument t, i.e. p

: δ

trs

,⋊⋉

; result := result ⊕ δ

trs

; uses a materializa-

tion m

which contains the intermediate results of a

subexpression r ⋊⋉

s. The plans for the arguments r

and s integrate the intermediate results of the same

subexpression with the materialization m

, for exam-

ple p

: δ

:= δ

⋊⋉

s; m

:= m

⊕δ

; δ

rst

:= δ

,⋊⋉ t;

result := result ⊕ δ

rst

; A simple solution to eliminate

a materialization m

is to apply the associativity of

join operation and to transform the expression into an

expression which has left-/right-deep syntax tree and

argument t is located at at one of the bottom leaf level

nodes of the tree. To do so, we transform the integra-

tion expression into (t ⋊⋉

r) ⋊⋉

s and we create a new

integration plan p

′

: result := result ⊕ (δ

⋊⋉ r) ⋊⋉

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

146

Additionally, we eliminate the integration of partial

results with a materialization m

from the integration

plans for r and s.

The next example shows, that associativity of the

operations involved in an expression is not a neces-

sary condition for the elimination of materialization.

We consider a data integration expression (r ∼

s) ⋊⋉

t. The objective is to eliminate a materialization that

contains the intermediate results of r ∼

s from an

integration plan for the argument t, i.e. p

: δ

rst

⋊⋉ δ

; result ⊕ δ

rst

. We transform an expression

⋊⋉ δ

into a form where a modiﬁcation δ

is lo-

cated at the bottom level of left-/right-deep syntax

tree in the following way. First we substitute m

with

r ∼

s. Next, we replace an argument r with an equiv-

alent expression (r⋉

) + (r ∼

) where opera-

tion + denotes a concatenation of the disjoint results

of semijoin and antijoin operations. The distribu-

tivity of concatenation operation over semijoin and

join operations allows us to transform an expression

(((r⋉

)+(r ∼

)) ∼

s) ⋊⋉

into concatenation

of two subexpressions ((r⋉

) ∼

s) ⋊⋉

+ ((r ∼

) ∼

s) ⋊⋉

. The results of processing a subex-

pression ((r ∼

) ∼

s) ⋊⋉

are always empty be-

cause y-values removed from r by (r ∼

return no

results when join with δ

is performed later on. We

get an expression ((r⋉

) ∼

s) ⋊⋉

equivalent to

an expression m

⋊⋉ δ

in the original integration plan

. Because the transformations above eliminated a

materialization m

s from p

it is possible to elimi-

nate it from the remaining plans p

and p

. Hence,

the data integration plans for an integration expres-

sion (r ∼

s) ⋊⋉

t are as follows.

: result := result ⊕ (δ

∼

s) ⋊⋉

: result := result ⊕ (r⋉

) ⋊⋉

: result := result ⊕ (((r⋉

) ∼

s) ⋊⋉

);

Next, we discuss how to eliminate materializa-

tion in a more general case. Consider an argument

r whose integration plan uses a materialization. If r

has at least one common attribute with a modiﬁca-

tion of δ

of another argument s of integration ex-

pression than it is always possible to replace r with

(r⋉

) + (r ∼

). Then it is possible to apply dis-

tributivity of concatenation operation and to eliminate

one of the components of the expression later on like

in the example above. A problem is how to ﬁnd when

such transformation is possible. Consider an imple-

mentation of online processing plan where an oper-

ation p

(δ(x),m(y)) acts on a modiﬁcation δ(x) and

materialization m(y) such that x ∩ y = z and z 6=

It is possible to eliminate materialization m(y) from

the online processing plan when there exists an ar-

gument s(v) of subexpression of materialization m(y)

(see Figure 2) such that:

δ (x)

m(y)

... ...

Figure 2: Elimination of materialization m(y).

(i) v∩ x 6=

0 and

(ii) (v∩ x) ∈ z.

Elimination of materialization m(y) is performed

by the substitution of s(v) in a subexpression of

the materialization with (s(v)⋉

v∩x

δ(x)) + (s(v) ∼

v∩x

δ(x)). Then processing of modiﬁcation δ(x) triggers

the computations along a path that leads from the pro-

cessing of semijoin and antijoin of s(v) with δ(x) to a

materialization m(y). Unfortunately, it does not solve

the problem from performance point of view . The

substitution of s(v) with a concatenation of semijoin

and antijoin of s(v) with a modiﬁcation δ(x) still pro-

vides a complete s(v) and requires the reprocessing of

entire materialization m(y). In fact, when modiﬁca-

tion δ(x) is small then only a fraction of materializa-

tion m(y) affects the result of p

(δ(x),m(y)). Then, a

solution would be to recompute only such component

of materialization that affect the result of operation

. If it is possible to eliminate one of semijoin of

s(v) with δ(x) or antijoin of s(v) with δ(x) then only

a subset of argument s(v) is involved in the process-

ing. Next, we show a formal method that ﬁnds when

a materialization can be removed and what transfor-

mations of the arguments of a relational implemen-

tation of data integration plan are required to do so.

Let T

be a syntax tree of a relational algebra expres-

sion e(r

,...,r

) built over the operations of set dif-

ference, join, semijoin, and antijoin. Let a node n

in T

represents a binary operation p

(r(x),s(y)) such

that v = x∩ y. Labelling of T

is performed in the fol-

lowing way.

(i) An edge between a leaf node that represent an ar-

gument r(x) can be labeled with z

where z ⊆ x

and z 6=

(ii) If a node n

in T

represents an operation p that

produces a result r(x) and ”child” edge of a node

is labeled with one of the symbols z, z−, −z,

z∗ then a ”parent” edge of n

can be labeled with

a symbol located in a row indicated by a label of

”child” edge and a column indicated by an opera-

tion p

in a Table 1.

Labelling of syntax tree is performed to discover the

types of coincidences between the z-values of one or

more arguments of relational algebra expression. The

STATIC OPTIMIZATION OF DATA INTEGRATION PLANS IN GLOBAL INFORMATION SYSTEMS

147

Table 1: The labelling rules for syntax trees of relational algebra expressions.

⋊⋉

(left) ∼

∼

(right) (left)⋉

⋉

(right) (left)− −(right)

z z− z− −(z∩ v) z− (z∩ v)− z− −z

z− z− z− (z∩ v)∗ z− (z∩ v)− z− z∗

−z −z −z (z∩ v)∗ −z (z∩ v)∗ −z z∗

z∗ z∗ z∗ (z∩ v)∗ z∗ z∗ z∗ z∗

(yz)

−y

y −

r(xy)

s(yz)

Figure 3: A labelled syntax tree of online processing plan

: result := result ⊕((r(xy) ∼

s(yz)) ⋊⋉

(yz)).

coincidencesand their types are needed to ﬁnd out if it

is possible to remove the materializationsand whether

their elimination is beneﬁcial.

As an example, consider an integration expres-

sion (r(xy) ∼

s(yz)) ⋊⋉

t(yz) and an integration plan

: result := result ⊕ (m

(xy) ⋊⋉

(yz)); for pro-

cessing the increments of an argument t. A materi-

alization is computed as m

(xy) = r(xy) ∼

s(yz). A

syntax tree of the plan with the materialization m

re-

placed with r(xy) ∼

s(yz) is given in Figure 3. To

eliminate the materialization we try to ﬁnd the coin-

cidences between y-values of r(xy) and δ

(yz) and we

perform the labelling of the syntax tree in a way de-

scribed above. The ”parent” edges of the nodes r(xy),

s(yz), and δ

(yz) obtain the labels y

, y

, and y

. A

left ”child” edge of the root node obtained a label y

−

indicated by a location in the ﬁrst row and the sec-

ond column in Table 1. Moreover, the same edge ob-

tained a label −y

indicated by a location in the ﬁrst

row and the third column in Table 1. The ﬁnal la-

belling of the syntax tree is given in Figure 3. The

interpretations of the labels are the following. A label

attached to a ”child” edge of join operation at root

node of the tree indicate that all y-values of an argu-

ment δ

(yz) are processed by the operation. A label

− attached to a ”child” edge of the same operation

indicates that only a subset of y-values of an argument

r(xy) and no other y-values are processed by the oper-

ation. A label −y

attached to the same edge indicates

that none of y-values in s(y,z) is included in the result

of r(xy) ∼

s(yz). The above interpretation of the la-

bels y

− and y

in a context of join operation over a

set of attributes y means that y-values not included in

the arguments r(xy) and δ

(yz) have no impact on the

result of join operation. It means that r(xy) can be re-

placed with r(xy)⋉

(yz) and δ

(yz) can be replaced

with δ

(yz)⋉

r(xy) without changing the result of the

expression. The interpretation of the labels −y

and y

in a context of join operation allows for the elimina-

tion from δ

(yz) of all y-values, which are included in

s(yz) because these values have no impact on join op-

eration. It means that δ

(yz)⋉

r(xy) can be replaced

with (δ

(yz)⋉

r(xy)) ∼

s(yz) with changing the re-

sult of the expression. It is also possible to replace

s(yz) with s(yz)⋉ δ

(yz) because all y-valuesincluded

in s(yz) and not included in δ

(yz) have no impact on

the result of join operation. However, the last modi-

ﬁcation is questionable from a performance point of

view. It deﬁnitely, speeds up antijoin operation but it

also delays join operation because the results of anti-

join operation are larger after the reduction of s(y,z).

The labelling and the possible replacements of ar-

guments are summarized in the Tables 2 and 3. The

interpretations of the Tables are the following. Con-

sider a relational algebra expression e(r, r

,...,r

,s)

such that operation p

is included in the root node of

its syntax T

. If an operation p

is either join or semi-

join operations then the possible replacements of the

arguments r and s are included in a Table 2. If an op-

eration p

is either antijoin or set difference the the

possible replacements are included in a Table 3. The

replacements of the arguments r and s over a com-

mon set of attributes z ⊆ x can be found after the la-

belling of both paths from the leaf nodes representing

the arguments r and s towards the root node of T

la-

beled with p

. The replacements of the arguments

r and s are located at the intersection of a row la-

beled with a label of left ”child” edge and a column

labeled with a label of ”right” child edge of the root

node. For instance, consider a subtree of the argu-

ments s and r such that an operation ⋊⋉

is in the root

node of the subtree. If a left ”child” edge of the root

node is labeled with − z

, and a right ”child” edge of

the root node is labeled with z

∗ then Table 2 indi-

cates that it is possible to replace the contents of an

argument s with an expression s ∼

r. A sample

justiﬁcation of the replacements included in the Ta-

ble 2 at the intersection of a row labeled with −z

and a column labeled with z

∗ is the following. Let

be a syntax tree of a relational algebra expression

e(r,r

,...,r

,s) built of the operations of join, semi-

join, antijoin, and set difference, and such that root

node of the tree is labeled with either ⋊⋉

or ⋉

and

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

148

Table 2: The replacements of arguments in integration

plans.

⋊⋉

, ⋉

− −z

∗

n/a r⋉

s r ∼

s s⋉

s⋉

r s⋉

− r⋉

s r⋉

s r ∼

s s⋉

s⋉

r s⋉

r s ∼

−z

s ∼

r s ∼

r either s ∼

r s ∼

r⋉

s r ∼

s or r ∼

∗ r⋉

s r⋉

s r ∼

s none

Table 3: The replacements of arguments in integration

plans.

∼

,− z

− −z

∗

n/a s⋉

r s⋉

− s⋉

r s⋉

−z

either s ∼

r s ∼

r either s ∼

r s ∼

or r ∼ s or r ∼ s

∗ r ∼

s r ∼

s none none

its left ”child” edge is labeled with −z

and its right

”child” edge is labeled with z

∗, z ⊆ x. Then, for

any values of the arguments r,r

,...,r

,s an expres-

sion e(r,r

,...,r

,s) = e(r,r

,...,r

,(s ∼

r)). A la-

bel −z

attached to a left ”child” edge of join of op-

eration ⋊⋉

or ⋉

means that none of z-values of an

argument r is include in an argument of join or semi-

join. Then, these z-values can be removed from an

argument s because they will never participate in join

or semijoin operation. On the other hand we cannot

replace an argument r because label z

∗ means that

some new z-values can be added to the original set of

z-values in s.

Let e

,...,r

) be an expression that deﬁnes a

materialization m in an integration plan for the in-

crements δ

of an argument s. Elimination of mate-

rialization m from integration plan for δ

is possible

when some of the arguments r

,...r

can be replaced

with the subexpressions involving δ

such that syntax

tree of e

′

,...,r

,δ

) does not contain a subexpres-

sion that does not involves δ

. In the other words, we

try to replace some of the arguments in an expression

that deﬁnes a materialization such that entire expres-

sion can be recomputed with an argument δ

and no

subexpression exists that does not involve δ

An interesting problem is whether any material-

ization can be removed using the replacements de-

scribed above. The analysis of the Tables 2 and 3 and

the structural properties of relational implementations

of integration plans reveal three cases when material-

izations cannot be removedthrough the replacements.

(1) An operation at the root node syntax tree of online

processing plan is a join operation and its both left

and right ”child” edges are labeled with z

∗ and

∗ respectively, see a location in the right lower

corner of Table 2.

(2) An operation at the root node syntax tree of on-

line processing plan is either a join operation or

semijoin operation, a materialization is the ﬁst ar-

gument of the operation and right ”child” edge is

labeled with z

∗. This is because all reduction in

the last column of 2 are applicable to the second

argument of the operation which is obtained from

the processing of modiﬁcation and not material-

ization.

(3) An operation at the root node of a syntax tree of

online processing plan is either an antijoin opera-

tion or difference operation, materialization is the

second argument of the operation and left ”child”

edge of the node is labeled with z

∗. This is be-

cause all replacements in the last row of Table

3 are applicable to the ﬁrst argument of the op-

eration which is obtained from the processing of

modiﬁcation and not materialization.

6 SUMMARY, CONCLUSIONS,

AND FUTURE WORK

This work addresses a problem of static optimization

of data integration plans in the global informationsys-

tems. The users’ requests submitted at a central site

are decomposed into the individual requests and si-

multaneously submitted for processing at the remote

sites. We show how data integration plans for the in-

crements of the individual arguments can be derived

from a data integration expression and we propose a

number of static optimization techniques for data in-

tegration plans implemented as relational algebra ex-

pressions.

A technique of immediate processing of data

packets as they are received from the remote sites al-

lows for better utilization of data processing resources

available at a central site. The continuous processing

of small portions of data transmitted from the remote

sites eliminates idle time when a data integration sys-

tem has to wait for the transmission of an entire argu-

ment. Decomposition of data integration expression

into the individual plans allows for more precise op-

timization of data integration and it also allows for

better scheduling of data processing on multiproces-

sor systems. Identiﬁcation of coincidences between

the arguments of data integration expression leads to

elimination of materializations from data integration

plans and reduction of the processing load when ma-

terializations are frequently change.

STATIC OPTIMIZATION OF DATA INTEGRATION PLANS IN GLOBAL INFORMATION SYSTEMS

149

A number of problems remains to be solved.

Elimination of materialization from data integration

plans depends on the parameters of transmission of

the arguments and a problem is how predict these pa-

rameters at static optimization phase. Another inter-

esting problem is identiﬁcation of all materializations

that can be eliminated in a given moment of time

and scheduling of the replacements in a process of

online data integration. The other problems include

the derivations of more sophisticated systems of id-

operations from the systems of binary operations dif-

ferent from the relational algebra e.g. a system in-

cluding aggregation operations, further investigations

of the properties of data integration plans and more

advanced data integration algorithms where the ap-

plication of a particular online plan depends on what

increments of data are available at the moment.

REFERENCES

Amsaleg, L., Franklin, J., and Tomasic, A. (1998). Dynamic

query operator scheduling for wide-area remote ac-

cess. Journal of Distributed and Parallel Databases,

6:217–246.

Antoshenkov, G. and Ziauddin, M. (2000). Query process-

ing and optmization in oracle rdb. VLDB Journal,

5(4):229–237.

Avnur, R. and Hellerstein, J. M. (2000). Eddies: Contin-

uously adaptive query processing. In Proceedings of

the 2000 ACM SIGMOD International Conference on

Management of Data, pages 261–272.

Getta, J. R. (2000). Query scrambling in distributed multi-

database systems. In 11th Intl. Workshop on Database

and Expert Systems Applications, DEXA 2000.

Getta, J. R. (2005). On adaptive and online data integration.

In Intl. Workshop on Self-Managing Database Sys-

tems, 21st Intl. Conf. on Data Engineering, ICDE’05,

pages 1212–1220.

Getta, J. R. (2006). Optimization of online data integration.

In Seventh International Conference on Databases

and Information Systems, pages 91–97.

Getta, J. R. and Vossough, E. (2004). Optimization of data

stream processing. SIGMOD record, 33(3):34–39.

Gounaris, A., Paton, N. W., Fernandes, A. A., and Sakellar-

iou, R. (2002). Adaptive query processing: A survey.

In Proceedings of 19th British National Conference

on Databases, pages 11–25.

Haas, P. J. and Hellerstein, J. M. (1999). Ripple joins for

online aggregation. In SIGMOD 1999, Proceedings

ACM SIGMOD Intl. Conf. on Management of Data,

pages 287–298.

Ives, Z. G., Florescu, D., Friedman, M., Levy, A. Y., and

Weld, D. S. (1999). An adaptive query execution sys-

tem for data integration. In Proceedings of the 1999

ACM SIGMOD International Conference on Manage-

ment of Data, pages 299–310.

Ives, Z. G., Halevy, A. Y., and Weld, D. S. (2004). Adapt-

ing to source properties in processing data integration

queries. In Proceedings of the 2004 ACM SIGMOD

International Conference on Management of Data.

Kabra, N. and DeWitt, D. J. (1998). Efﬁcient mid-query re-

optimization of sub-optimal query execution plans. In

Proceedings of the 1998 ACM SIGMOD International

Conference on Management of Data.

Mokbel, M. F., Lu, M., and Aref, W. G. (2002). Hash-merge

join: A non-blocking join algorithm for producing fast

and early join results.

Ozcan, F., Nural, S., Koksal, P., Evrendilek, C., and Do-

gac, A. (1997). Dynamic query optimization in mul-

tidatabases. Bulletin of the Technical Committee on

Data Engineering, 20:38–45.

Raman, V., Deshpande, A., and Hellerstein, J. M. (2003).

Using state modules for adaptive query processing. In

Proceedings of the 19th International Conference on

Data Engineering, pages 353–.

Urhan, T. and Franklin, M. J. (2000). Xjoin: A reactively-

scheduled pipelined join operator. IEEE Data Engi-

neering Bulletin 23(2), pages 27–33.

Urhan, T. and Franklin, M. J. (2001). Dynamic pipeline

scheduling for improving interactive performance of

online queries. In Proceedings of International Con-

ference on Very Large Databases, VLDB 2001.

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

150