Consider a user request q to a global information sys-
tem and its decomposition into the individual requests
q
1
,...,q
n
simultaneously submitted for the process-
ing at the remote sites. Let a request q be equivalent
to an expression E (q
1
,...,q
n
) over the individual re-
quests. If the remote sites return the results r
1
,...,r
n
in the response to the individual requests q
1
,...,q
n
then the final result a user request q is equal to the
result of an expression E (r
1
,...,r
n
). Then, static op-
timization of data integration plan is equivalent to the
optimization of an expression E (r
1
,...,r
n
).
A naive and quite ineffective approach would be
to postpone data integration until the results r
1
,...,r
n
returned from the remote sites are available at a cen-
tral site. A more effective solution is to consider
an individual reply r
i
as a sequence of data pack-
ets r
i
1
,r
i
2
,...,r
i
k−1
,r
i
k
and to perform data integra-
tion each time a new packet of data is received at
a central site. Such approach to data integration is
more efficient because there is no need to wait for the
complete results when a data integration expression
E (r
1
,...,r
n
) is evaluated accordingly to a given or-
der of operations. Instead, whenever a new packet
of data is received at a central site it is immedi-
ately integrated into the intermediate result no mat-
ter which partial result it comes from. Then, static
optimization of data integration plan finds the best
processing strategy for the sequences of packets of
data r
i
1
,r
i
2
,...,r
i
k−1
,r
i
k
where i = 1, . . . , n. Such ob-
jective requires the transformation of data integra-
tion expression E (r
1
,...,r
i
,...,r
n
) into the individ-
ual data integration plans for the sequences of packets
r
i
1
,r
i
2
,...,r
i
k−1
,r
i
k
where i = 1,...,n.
A starting point for the optimization is a data inte-
gration expression E (r
1
,...,r
i
,...,r
n
) obtained from
decomposition of a user request to a global infor-
mation system. Processing of the individual pack-
ets means that an expression E (r
1
,...,r
i
⊕ δ
i
,...,r
n
)
must be recomputed each time a data packet δ
i
is ap-
pended to an argument r
i
. Of course reprocessing of
the entire data integration expression is too time con-
suming and a better idea is to perform an incremental
processing of the expression, i.e. to find how the pre-
vious result of an expression E (r
1
,...,r
i
,...,r
n
) must
be changed after δ
i
is appended to an argument r
i
. A
data integration expression is transformed into a set
of data integration plans where each plan represents
an integration procedure for the increments of one ar-
gument of the original expression. In our approach
a data integration plan is a sequence of so called id-
operations on the increments or decrements of data
containers and other fixed size containers. In order
to reduces the size of arguments, static optimization
of data integration plans moves the unary operation
towards the beginning of a plan. Additionally, the fre-
quently updated materializations are eliminated from
the plan and constant arguments and subexpressions
are replaced with the pre-computed values.
The paper is organized in the following way. First,
we overview the related works in an area of optimiza-
tion of data integration in distributed global informa-
tion systems. Next, we derive the incremental pro-
cessing of modification and we find a system of oper-
ation on modifications of data items for the system of
operations included in the relational algebra. Trans-
formation of data integration expressions into the sets
of individual data integration plans is discussed in a
section 4 and it is followed by presentation of static
optimization of data integration plans in the next sec-
tion. Finally, section 6 concludes the paper.
2 RELATED WORK
Optimization of data integration in global information
systems can be traced back to optimization of query
processing in multidatabase and federated database
systems (Ozcan et al., 1997).
The external factors affecting the performance of
query processing in multidatabase systems promote
the reactive query processing techniques. The early
works on the reactive query processing techniques
are based on partitioning (Kabra and DeWitt, 1998)
and dynamic modification of query processing plans
(Getta, 2000). A dynamic modification technique
finds a plan equivalent to the original one and such
that it is possible to continue integration of the avail-
able data sets. The similar approaches dynamically
change an order in which the join operations are exe-
cuted depending on the arguments available at a cen-
tral site. These techniques include query scrambling
(Amsaleg et al., 1998) and dynamic scheduling of
operators (Urhan and Franklin, 2001), and Eddies
(Avnur and Hellerstein, 2000),
Optimization of relational algebra operations used
for the data integration includes new versions of join
operation customised to online query processing, e.g.
pipelined join operator XJoin (Urhan and Franklin,
2000), ripple join (Haas and Hellerstein, 1999), dou-
ble pipelined join (Ives et al., 1999), and hash-merge
join (Mokbel et al., 2002).
A technique of redundant computations simulta-
neously processes a number of data integration plans
leaving a plan that that provides the most advanced
results (Antoshenkov and Ziauddin, 2000).
A concept of state modules described in (Raman
et al., 2003) allows for concurrent processing of the
tuples through the dynamic division of data integra-
ICEIS 2011 - 13th International Conference on Enterprise Information Systems
142