ONE-TO-MANY DATA TRANSFORMATION OPERATIONS

Optimization and Execution on an RDBMS

Paulo Carreira

Faculty of Sciences, University of Lisbon, C6 - Piso 3, 1749-016 Lisboa, Portugal

Helena Galhardas, Jo

ao Pereira and Andrzej Wichert

INESC-ID, Avenida Prof. Cavaco Silva, Tagus Park, 2780-990, Porto-Salvo, Portugal

Keywords:

Data Warehousing, Data Cleaning, Data Integration, ETL, Query optimization.

Abstract:

The optimization capabilities of RDBMSs make them attractive for executing data transformations that support

ETL, data cleaning and integration activities. However, despite the fact that many useful data transformations

can be expressed as relational queries, an important class of data transformations that produces several output

tuples for a single input tuple cannot be expressed in that way.

To address this limitation a new operator, named data mapper, has been proposed as an extension of Re-

lational Algebra for expressing one-to-many data transformations. In this paper we study the feasibility of

implementing the mapper operator as a primitive operator on an RDBMS. Data transformations expressed as

combinations of standard relational operators and mappers can be optimized resulting in interesting perfor-

mance gains.

1 INTRODUCTION

The setup of modern information systems comprises

a number of activities that rely, to a great extent, in

the use of data transformations (Lomet and Runden-

steiner, 1999). Well known cases are the migration

of legacy data, the ETL processes that support data

warehousing, data cleaning processes and the integra-

tion of data from multiple sources.

One natural way of expressing data transforma-

tions is to use a declarative query language and to

specify the data transformations as queries (or views)

over the source data. Because of the broad adop-

tion of RDBMSs, such language is often SQL, a lan-

guage based on Relational Algebra (RA). Unfortu-

nately, due to its limited expressive power (Aho and

Ullman, 1979), RA alone cannot be used to specify

many important classes of data transformations.

An important class of data transformations that

may not be expressible in RA are the so called

one-to-many data transformations (Carreira et al.,

2006), that are characterized by producing several

output tuples for each input tuple. One-to-many data

transformations occur normally due to the existence

of data heterogeneities, i.e., due to the use of different

Relation

LOANS

Relation

PAYMENTS

ACCT AM

12 20.00

3456 140.00

901 250.00

ACCTNO AMOUNT SEQNO

0012 20.00 1

3456 100.00 1

3456 40.00 2

0901 100.00 1

0901 100.00 2

0901 50.00 3

Figure 1: Illustration of an unbounded data-transformation.

(a). The source relation

LOANS

on the left, and (b) the target

relation

PAYMENTS

on the right.

representations, of the same data of source and target

schemas (Rahm and Do, 2000). For instance, source

data may consist of salaries aggregated by year, while

the target data consists of salaries aggregated by

month. Hence, each input row has to be converted

into multiple output rows, one for each month. In

this case, each input row corresponds to at most

twelve output rows. However, expressing such data

transformations as RA expressions is hampered by

the fact such bound cannot always be established

a-priori. Consider the following example:

Carreira P., Galhardas H., Pereira J. and Wichert A. (2007).

ONE-TO-MANY DATA TRANSFORMATION OPERATIONS - Optimization and Execution on an RDBMS.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 21-27

DOI: 10.5220/0002371900210027

 SciTePress

EXAMPLE 1.1 Consider the source relation

LOANS

[

ACCT

] (represented in Figure 1) that

stores the details of loans per account. Sup-

pose that

LOANS

data must be transformed into

PAYMENTS

[

ACCTNO

AMOUNT

SEQNO

], the target

relation, according to the following requirements:

1. In the target relation, all the account numbers

are left padded with zeroes. Thus, the attribute

ACCTNO

is obtained by (left) concatenating zeroes

to the value of

ACCT

2. The target system does not support payment

amounts greater than 100. The attribute

AMOUNT

is obtained by breaking down the value of

into

multiple parcels with a maximum value of 100, in

such a way that the sum of amounts for the same

ACCTNO

is equal to the source amount for the same

account. Furthermore, the target ﬁeld

SEQNO

is a

sequence number for the parcel. This sequence

number starts at 1 for each sequence of parcels of

a given account.

The implementation of data transformations sim-

ilar to those requested for producing the target rela-

tion

PAYMENTS

of Example 1.1 is challenging, since

the number of output rows, for each input row,

is determined by the value of the attribute

. In

this case, the upper bound on the number of out-

put rows cannot be determined by analyzing the

data transformation speciﬁcation. We designate these

data transformations as unbounded one-to-many data

transformations. Other sources of unbounded data

transformations exist like, for example, converting

collection-valued attributes of SQL:1999 (Melton and

Simon, 2002), where each element of the collection is

mapped to a new row in the target table. In the context

of data-cleaning, one commonplace transformation is

converting a list of elements encoded as a string at-

tribute.

Currently, one has to resort, either to a general

purpose programming language, to some ﬂavor of

proprietary scripting of an ETL tool, or to an RDBMS

using recursive queries of SQL:1999 (Melton and Si-

mon, 2002), or some sort of Persistent Stored Mod-

ules (PSMs) (Garcia-Molina et al., 2002, Section 8.2)

like stored procedures or table functions (Eisenberg

et al., 2004).

To address the problem of expressing one-to-many

data transformations in a declarative and optimizeable

fashion, specialized relational operator named map-

per was recently proposed as an extension to RA with

a (Carreira et al., 2006). Informally, a mapper is ap-

plied to an input relation and produces an output re-

lation. It iterates over each input tuple and gener-

ates one or more output tuples, by applying a set of

domain-speciﬁc functions. This way, it supports the

dynamic creation of tuples based on a source tuple

contents.

Although mappers appear implicitly in systems

supporting schema and data transformations underly-

ing ETL, data cleaning and data warehousing (Gal-

hardas et al., 2000; Raman and Hellerstein, 2001; Cui

and Widom, 2001; Amer-Yahia and Cluet, 2004), as

far as we know, their execution and optimization has

never been, properly studied.

This paper studies the feasibility of extending

RDBMSs with the mapper operator. There are sev-

eral reasons to do so: First, implementing the map-

per operator as a relational operator opens interest-

ing optimization opportunities since expressions that

combine the mapper operator with standard RA oper-

ators can be optimized. Second, many data transfor-

mations are naturally expressed as relational expres-

sions, leveraging the optimization strategies already

implemented by RDBMSs (Chaudhuri, 1998). Third,

such extension further equips RDBMSs for data trans-

formation activities, broadening their applicability in

a wider range of data management activities. We re-

mark that our idea of using RDBMSs as data transfor-

mation engines is not revolutionary, see (Haas et al.,

1999). Furthermore, several RDBMSs like Microsoft

SQL Server and Oracle already include additional

software packages speciﬁc for ETL tasks.

Our contributions of are the following: (i) an

SQL-like concrete syntax for the mapper operator ac-

complished by extending the select statement, (ii) the

study of several query rewriting possibilities to be in-

corporated in the query optimizer and (iii) an exper-

imental validation of the usefulness of implementing

the mapper operator by comparing its physical imple-

mentation with alternative RDBMS solutions.

The rest of the paper is organized as follows. Sec-

tion 2 introduces the mapper operator and exposes its

concrete syntax by example. Then, in Section 3 we

discuss how to extend the query optimizer to handle

mappers. In Section 4, we report on a series of exper-

iments to ascertain the feasibility of implementing the

mapper operator and ﬁnally Section 5 concludes.

2 THE MAPPER OPERATOR

The mapper operator is formalized as a unary operator

that takes a relation instance of the source relation

schema as input and produces a relation instance of

the target relation schema as output. The operator is

parameterized by a set F of functions, which we des-

ignate as mapper functions. The intuition is that each

mapper function f

expresses a part of the envisaged

ICEIS 2007 - International Conference on Enterprise Information Systems

data transformation, focused on a non-empty set A

of attributes of the target schema. A key insight is

that, when applied to a tuple, a mapper function can

produce a set of values in the domain of its target at-

tributes Dom(A

), rather than a single value. Further

details can be found in (Carreira et al., 2005b).

The mapper operator is formally deﬁned as

follows: Given a set of mapper functions F =

{ f

,..., f

}, the mapper of a relation s with respect

to F, denoted by µ

(s), is the relation instance of the

target relation schema deﬁned by

(s)

def

= {t ∈ Dom(Y) | ∃u ∈ s s.t.

t[A

] ∈ f

(u), ∀1 ≤ i ≤ k}

(1)

We can express the data transformation of Ex-

ample 1.1 by means of a mapper µ

acct,amt

, compris-

ing two mapper functions. The function acct is the

mapper function that returns a singleton with the ac-

count number

ACCT

properly left padded with zeroes,

while amt is a mapper function that produces the at-

tributes

[AMOUNT,SEQNO]

, s.t., amt(am) is given by

{(100,i) | 1 ≤ i ≤ (am/100)} ∪ {(am%100, (am/100) +

1) | am%100 6= 0}, where % represents the modu-

lus operation. For instance, if v is the source tuple

(

901

250.00

), the result of evaluating amt(v) is the

set {(

100

),(

100

),(

)}. Given a source relation

s including v, the result of the expression µ

acct,amt

(s)

is another relation that contains the set of tuples

’0901’

100

i, h

’0901’

100

i, h

’0901’

i}.

2.1 Concrete Syntax

The mapper operator can be easily embedded into

the SQL syntax by incorporating mapper functions as

expressions into the select block. The main change

consists of replacing the the standard list of columns

and expressions that follow the select keyword with

a list of mapper functions. A mapper function can

be identiﬁed by an expression or, alternatively, by an

inline speciﬁcation of a mapper function. A particu-

lar kind of expression is a mapper function call. In

this case, the function is speciﬁed outside the select

statement using a more appropriate programming lan-

guage. We note that a similar assumption occurs w.r.t.

the computation of aggregates in SQL, in the sense

that aggregate functions like e.g.,

COUNT

SUM

, are

implemented elsewhere and then embedded in the se-

lect statement as parameters of the aggregation oper-

ator. An inline mapper function is speciﬁed by a map

clause followed by a list of attributes. These attributes

are the mapper function’s attributes that contribute to

the schema of the output relation. The logic of the

mapper function in this case is enclosed within a be-

gin end block.

2.2 Filters

Filters are speciﬁed through the boolean expression of

the where clause. Two kinds of ﬁlters can be speci-

ﬁed, (i) a-priori ﬁlters, that apply to each tuple of the

input relation, which are evaluated before the mapper

and (ii) a-posteriori ﬁlters that are evaluated on the

output of the mapper, which are used to limit the map-

per results. They are identiﬁed by sub-expressions de-

ﬁned over particular sets of columns. Sub-expressions

that are deﬁned only over the columns of the input

relation deﬁne a-priori ﬁlters, while sub-expressions

that are deﬁned over columns generated by the map-

per functions deﬁne a-posteriori ﬁlters.

3 OPTIMIZATION

While parsing a select block, as soon as mapper func-

tion is found, the parser knows that a mapper op-

erator is present. In this case, upon parsing the

query successfully, the parser identiﬁes all the map-

per functions being used and the computes the out-

put schema of the mapper. The input schema of the

mapper is determined by the schema of input rela-

tions. Once the input schema is known, the input

columns speciﬁed for each mapper function can be

validated. The following step is to rewrite the ﬁl-

ter condition into the conjunctive normal form and

validate it considering the input and output schemas.

Then, each conjunct is analyzed to decide whether

it constitutes a candidate to an a-priori or to an a-

posteriori ﬁlter. In the query presented in Figure 2, the

sub-expression

ACCOUNTS.STATUS

’O’

deﬁnes an a-

priori ﬁlter while the sub-expression

AMOUNT

de-

ﬁnes an a-posteriori ﬁlter.

1: select map acct(ACCT) as ACCTNO,

2: map amt(AM) as AMOUNT

3: from LOANS, ACCOUNTS

4: where ACCOUNTS.ACCTN = LOANS.ACCT

5: and ACCOUNTS.STATUS = ’O’

5: and AMOUNT < 50

Figure 2: A query that selects small payments of open ac-

counts by implementing a mapper together with a-priori and

a-posteri ﬁlters.

We note also that in some situations it is not pos-

sible to clearly separate these two kinds of ﬁlters.

For example, if the condition is dependent on both

input and output columns of the mapper like e.g.,

AMOUNT < ACCOUNTS.WDRAWLIMIT

, where

AMOUNT

is an

output attribute produced by a mapper function and

ACCOUNTS.WDRAWLIMIT

is an attribute of the input rela-

ONE-TO-MANY DATA TRANSFORMATION OPERATIONS - Optimization and Execution on an RDBMS

AMOUNT < 50

acct,amt

ACCOUNTS.STATUS = ’O’

⋊⋉

ACCOUNTS.ACCTN=LOANS.ACCT

ACCOUNTS LOANS

Figure 3: Query plan representation for the query presented

in Figure 2.

tion. In these cases, the predicate can only be evalu-

ated after all the mapper functions, i.e., a-posteriori.

The speciﬁcation of a-posteriori ﬁlters in the

where clause opens an interesting possibility of deﬁn-

ing the condition using mapper functions. The sets

of values returned by mapper functions can be tested

with set operators like in or exists.

Moreover, whenever the input relation is deﬁned

through join operations, some of the conjuncts can be

immediately pushed down into the appropriate join

operators. Generically, the query plan that results

from this process applies an a-posteriori ﬁlter to a

mapper operator. This mapper operator, in turn, is

evaluated over the input relation resulting from ap-

plying an a-priori ﬁlter to a query sub-plan that rep-

resents the input relation. This concept is illustrated

in Figure 3. Therein, the ﬁlter σ

AMOUNT < 50

is applied

to the mapper µ

acct,amt

which takes as input the tu-

ples of

ACCOUNTS

⋊⋉

ACCOUNTS.ACCTN=LOANS.ACCT

LOANS

that

are not ﬁltered by σ

ACCOUNT.STATUS = ’O’

. The plans

so obtained are then handed to the query optimizer

where they undergo a series of rewritings that turn

them into equivalent ones that are more efﬁcient to

evaluate. Besides the usual rewritings implemented

by RDBMSs, others, speciﬁc to mappers can be in-

troduced. Some of these rewritings are interesting

because they take direct advantage of the mapper se-

mantics. Herein, due to space limitations, we brieﬂy

sketch the main ideas. Please refer to (Carreira et al.,

2005b; Carreira et al., 2006) for details about rewrit-

ing rules and their corresponding proofs of correct-

ness.

Projections A projection applied to a mapper is an

expression of the form π

(µ

(s)). Those map-

per functions whose output attributes are not con-

tained in the list Z can be dropped from the map-

per since the values that they produce will not be

available for subsequent operations. Thus,

(µ

(s)) = π

(µ

′

(s)) (2)

where F

′

= { f

∈ F | A

contains at least one

attribute in Z}.

Selections When applying a selection to a mapper,

we can take advantage of the fact that many at-

tributes are mapped by arithmetic expressions.

Some are even simple identity functions. A selec-

tion σ

where C

is a condition on the attributes

produced by some mapper function f

∈ F, can

be pushed through a mapper. Hence,

(µ

(s)) = µ

(σ

′

(s)) (3)

where C

′

is a rewritten condition that uses the

attributes of the input relation schema.

Joins It is often the case that mappers are applied to

joins as an expression of the form µ

(r ⋊⋉ s). De-

pending of the type of join being performed the

output of the relation r ⋊⋉ s can be very large. In

these cases, whenever the join is being performed

in attributes mapped by identity mapper functions,

it is possible to use the rule

(r ⋊⋉ s) = µ

(r) ⋊⋉ µ

(s) (4)

where the mapper functions in F do not produces

duplicate values.

3.1 Plan Selection

The choice of a particular plans is governed by the

minimization of a metric of cost. The cost of a map-

per operator depends fundamentally on the costs of

evaluation each mapper function and on the cardinal-

ity of the input relation. For more details about the

cost model for the mapper operator, please refer to

(Carreira et al., 2006).

In order to estimate the cost of operators whose

input is produced by mappers, the cardinality of the

output relation produced by a mapper also needs to

be estimated. This estimation, is an interesting prob-

lem in itself, because mappers can generate variably

multiple output tuples for each input tuple. One way

to approach this issue consists of estimating the aver-

age mapper fanout factor

. If a mapper is being exe-

cuted for the ﬁrst time, an initial estimate for its fanout

needs to be computed. This can be done by combining

the estimated the fanout factors of the mapper func-

tions involved in the mapper operator. Another inter-

esting observation is that when mapper functions re-

turn empty sets, no output tuples are produced. Thus,

the mapper in some situations may act as a ﬁlter,

which turns the selectivity of the mapper into another

Similarly to (Chaudhuri and Shim, 1993), we designate

the average cardinality of the output produced for each input

tuple by mappers and mapper functions as fanout.

ICEIS 2007 - International Conference on Enterprise Information Systems

relevant factor. Like fanout, the initial mapper selec-

tivity can also be estimated from the selectivities of

the mapper functions.

4 EXPERIMENTS

In this section we analyze the performance of the

mapper operator and consider the gains obtained with

the proposed logical optimizations. Our results in-

dicate that one-to-many data transformations can be

evaluated substantially faster than traditional database

solutions like table functions or recursive queries.

Moreover, we shall see that the optimizations deﬁned

for mappers impart performance gains that are not

matched by traditional RDBMS solutions.

To that aim, we contrast alternative implementa-

tions of the data transformation proposed in Exam-

ple 1.1 using the mapper operator with alternative im-

plementations developed as table functions and recur-

sive queries using two leading commercial RDBMSs.

For more details on how to implement one-to-many

transformations using RDBMSs, Please refer to (Car-

reira et al., 2005a). The mapper operator was im-

plemented top of the XXL DBMS library (van den

Bercken et al., 2001) which provides database query

processing and optimization functionalities.

The database implementations were tested on two

systems henceforth designated as DBX and OEX

The parameters of both RDBMSs were carefully

aligned and the same I/O conditions where enforced

by through the usage of the same raw devices. The

hardware used was a single CPU machine (running at

3.4 GHz), with 1GB main memory RAM, and Linux

(kernel version 2.4.2) installed. Concerning work-

load, a synthetic version the input relation

LOANS

used

in Example 1.1 was employed. To equalize the record

length on XXL, DBX and OEX, a dummy column

was added to the input table.

4.1 Results

We compare throughput, i.e., the amount of work

done per second, of the distinct implementations of

one-to-many data transformations. Throughput is ex-

pressed as the ratio of source records transformed per

second and it is computed by measuring the response

time of data transformation that consists of reading

the input table, transforming it and materializing the

output table. All the timings reported were obtained

with logging disabled.

Due to the restrictions imposed by the license agree-

ments, the true names of the systems under test cannot be

revealed.

10000

20000

30000

40000

50000

60000

70000

80000

90000

Mapper/XXL TF/DBX TF/OEX SP/DBX SP/OEX RQ/DBX

Throughput [in records/sec]

Figure 4: Average throughput of the different implementa-

tions of Example 1.1 testes with input relation sizes varying

from 100K to 5M rows graphed with standard deviation. No

query results are reported for recursive queries on the OEX

system since the subset of recursive queries supported by

OEX is not powerful enough for expressing one-to-many

transformations.

The ﬁrst experiment, we intended to test the raw

performance of the mapper operator for the three dis-

tinct implementations. The results depicted on Fig-

ure 4 shows that one-to-many data transformations

implemented with the mapper operator is more than 2

times better than table functions over DBX, which is

the best alternative using RDBMSs. Since the amount

I/O incurred by all the systems is similar, even con-

sidering the overhead of the RDBMSs by comparison

with XXL, we conjecture that one-to-many data trans-

formations implemented as mappers running inside

the RDBMS are very efﬁcient. We also considered

the implementations using stored procedures. How-

ever, it turns out that the performance is quite poor

because logging cannot be disabled during their exe-

cution.

In the second set of experiments, we ana-

lyze the potential gains of logical optimizations

like those suggested in Section 3. To that aim,

we considered the evaluation of the expression

ACCTNO

(µ

acct,amt

(s)) together with its optimized

equivalent µ

acct,amt

(σ

ACCT

(s)) obtained by pushing

down the selection. The constant p is used to in-

duce different selectivities. Moreover, we consider

that the function acct performs a direct mapping, i.e.,

is an identity function. In Figure 5, we depict the per-

formance of both the original and the optimized ex-

pressions with varying selectivities. We observe that

smaller selectivities correspond to the highest gains

of the optimized expression over the original. For

comparison, we draw the evolution of the selection

applied to one-to-many transformations implemented

using table functions on the OEX system, represented

ONE-TO-MANY DATA TRANSFORMATION OPERATIONS - Optimization and Execution on an RDBMS

0.1

100

0.01 0.1 1 10 100

Response Time [in seconds]

Predicate selectivity factor [in %]

mapper original

mapper optimized

table function original

table function optimized

Figure 5: Evolution of response time of applying selections

to one-to-many transformations for increasing selectivity

factors plotted in logarithmic scale. The mapper original

expression refers to σ

ACCTNO

(µ

acct,amt

(s)), the mapper

optimized expression refers to µ

acct,amt

(σ

ACCT

(s)),

the table function original expression refers to

ACCTNO

(TFacct,amt(s)) and the table function op-

timized expression refers to TFacct,amt(σ

ACCT

(s)). The

size of the input relation s is ﬁxed to 1M tuples.

as σ

ACCTNO

(TFacct,amt(s)). The performance of

this solution is similar the unoptimized version of

mapper and only for small selectivities. This is due

to the fact that the whole relation is read but few tu-

ples output tuples are generated.

Surprisingly, by analyzing the query plans gen-

erated by RDBMSs, we came across the fact that,

whenever table functions or recursive queries are used

to encode one-to-many data transformations, neither

DBX nor OEX are capable of pushing down a se-

lection on a directly mapped attribute. Hence, for

comparison, we also tested the corresponding opti-

mized expression TFacct, amt(σ

ACCT

(s)) obtained

manually. We observed that the manually optimized

expression of the table functions bring higher gains

specially on relatively high selectivities. For high

selectivities, the response time of the original (non-

optimized) RDBMS solution increases sharply. We

conjecture that this behavior has to do with idiosyn-

crasies of OEX related with the pipelining of the tu-

ples resulting from the table function into the selec-

tion operator.

5 CONCLUSION

In this paper we focused on the feasibility of incor-

porating a specialized operator for handling one-to-

many data transformations on an RDBMSs. This ex-

tension is attractive, not only because one-to-many

data transformations cannot be expressed using rela-

tional algebra but also because data usually resides

in an RDBMS. We outlined the concrete syntax for

this operator and then examined how a query op-

timizer can be extended to consider more advanta-

geous execution plans in the presence of mappers.

To test our ideas we analyzed experimentally differ-

ent implementations of one-to-many data transforma-

tions using mappers and contrasted them with tradi-

tional implementations using table functions and re-

cursive queries using two industry-leading RDBMSs.

To best of our knowledge, this is the ﬁrst experimen-

tal assessment of one-to-many data transformations

on RDBMSs.

The experiments showed that a native implemen-

tation of the mapper operator outperformed the best

RDBMS solution by almost 3 times. We have also

observed that RDBMSs do not in general perform

even very simple but highly valuable optimizations

when table functions and recursive queries are used.

Thus, we posit that one-to-many data transformations

expressed by combining standard relational operators

and mappers constitute a valid alternative.

The simple iterator-based semantics of the map-

per operator enables efﬁcient executions of one-to-

many data transformations and favors an easy inte-

gration into the query processor of a database sys-

tem. Towards physical optimization, we are devel-

oping different algorithms for the mapper operator

to take advantage of duplicate values by employing

caching techniques and hybrid-hashing proposed by

(Hellerstein, 1998). Additionally, we consider incor-

porating the mapper operator in Apache Derby open

source RDBMS (Apache, 2005).

One limitation of our work is that, despite the ef-

fort to conﬁgure the different systems so that they run

in similar conditions, the alignment of these conﬁgu-

rations lacks quantiﬁcation. To address this shortcom-

ing we consider running TPC-H (TPC, 1999) loads on

the different systems in order to obtain a metric for

comparing their respective conﬁgurations.

REFERENCES

Aho, A. V. and Ullman, J. D. (1979). Universality of data

retrieval languages. In Proc. of the 6th ACM SIGACT-

SIGPLAN Symposium on Principles of Programming

Lang., pages 110–119. ACM Press.

Amer-Yahia, S. and Cluet, S. (2004). A declarative ap-

proach to optimize bulk loading into databases. ACM

Transactions of Database Systems, 29(2):233–281.

Apache (2005). Derby homepage.

http://db.apache.org/derby.

Carreira, P., Galhardas, H., Lopes, A., and Pereira, J.

(2005a). Extending the relational algebra with the

ICEIS 2007 - International Conference on Enterprise Information Systems

Mapper operator. DI/FCUL TR 05–2, Depart-

ment of Informatics, University of Lisbon. URL

http://www.di.fc.ul.pt/tech-reports

Carreira, P., Galhardas, H., Lopes, A., and Pereira, J.

(2006). One-to-many transformation through data

mappers. Data and Knowledge Engineering Journal

(DKE), Elsevier Science.

Carreira, P., Galhardas, H., Pereira, J., and Lopes, A.

(2005b). Data mapper: An operator for expres-

siong one-to-many data transformations. In 7th Int’l

Conf. on Data Warehousing and Knowledge Discov-

ery, DaWaK ’05, volume 3589 of LNCS. Springer-

Verlag.

Chaudhuri, S. (1998). An overview of query optimization

in relational systems. In PODS ’98: Proc. of the ACM

Symp. on Principles of Database Systems, pages 34–

43. ACM Press.

Chaudhuri, S. and Shim, K. (1993). Query optimization in

the presence of foreign functions. In Proc. of the Int’l

Conf. on Very Large Data Bases (VLDB’93), pages

529–542.

Cui, Y. and Widom, J. (2001). Lineage tracing for general

data warehouse transformations. In Proc. of the Int’l

Conf. on Very Large Data Bases (VLDB’01).

Eisenberg, A., Melton, J., Michels, K. K. J.-E., and Zemke,

F. (2004). SQL:2003 has been published. ACM SIG-

MOD Record, 33(1):119–126.

Galhardas, H., Florescu, D., Shasha, D., and Simon, E.

(2000). Ajax: An extensible data cleaning tool. ACM

SIGMOD Int’l Conf. on Management of Data, 2(29).

Garcia-Molina, H., Ullman, J. D., and Widom, J. (2002).

Database Systems – The Complete Book. Prentice-

Hall.

Haas, L. M., Miller, R. J., Niswonger, B., Roth, M. T.,

Schwarz, P. M., and Wimmers, E. L. (1999). Trans-

forming heterogeneous data with database middle-

ware: Beyond integration. IEEE Data Engineering

Bulletin, 22(1):31–36.

Hellerstein, J. M. (1998). Optimization techniques for

queries with expensive methods. ACM Transactions

on Database Systems, 22(2):113–157.

Lomet, D. and Rundensteiner, E. A., editors (1999). Spe-

cial Issue on Data Transformations, volume 22. IEEE

Data Engineering Bulletin.

Melton, J. and Simon, A. R. (2002). SQL:1999 Understand-

ing Relational Language Components. Morgan Kauf-

mann Publishers, Inc.

Rahm, E. and Do, H.-H. (2000). Data Cleaning: Problems

and current approaches. IEEE Bulletin of the Techni-

cal Comittee on Data Engineering, 24(4).

Raman, V. and Hellerstein, J. M. (2001). Potter’s Wheel:

An Interactive Data Cleaning System. In Proc. of the

Int’l Conf. on Very Large Data Bases (VLDB’01).

TPC (1999). Benchmark H standard speciﬁcation.

http://www.tpc.org.

van den Bercken, J., Dittrich, J. P., Kr

aamer, J., Sch

aafer,

T., Schneider, M., and Seeger, B. (2001). XXL A

library approach to supporting efﬁcient implementa-

tions of advanced database queries. In Proc. of the

Int’l Conf. on Very Large Data Bases (VLDB’01).

ONE-TO-MANY DATA TRANSFORMATION OPERATIONS - Optimization and Execution on an RDBMS