EXAMPLE 1.1 Consider the source relation
LOANS
[
ACCT
,
AM
] (represented in Figure 1) that
stores the details of loans per account. Sup-
pose that
LOANS
data must be transformed into
PAYMENTS
[
ACCTNO
,
AMOUNT
,
SEQNO
], the target
relation, according to the following requirements:
1. In the target relation, all the account numbers
are left padded with zeroes. Thus, the attribute
ACCTNO
is obtained by (left) concatenating zeroes
to the value of
ACCT
.
2. The target system does not support payment
amounts greater than 100. The attribute
AMOUNT
is obtained by breaking down the value of
AM
into
multiple parcels with a maximum value of 100, in
such a way that the sum of amounts for the same
ACCTNO
is equal to the source amount for the same
account. Furthermore, the target field
SEQNO
is a
sequence number for the parcel. This sequence
number starts at 1 for each sequence of parcels of
a given account.
The implementation of data transformations sim-
ilar to those requested for producing the target rela-
tion
PAYMENTS
of Example 1.1 is challenging, since
the number of output rows, for each input row,
is determined by the value of the attribute
AM
. In
this case, the upper bound on the number of out-
put rows cannot be determined by analyzing the
data transformation specification. We designate these
data transformations as unbounded one-to-many data
transformations. Other sources of unbounded data
transformations exist like, for example, converting
collection-valued attributes of SQL:1999 (Melton and
Simon, 2002), where each element of the collection is
mapped to a new row in the target table. In the context
of data-cleaning, one commonplace transformation is
converting a list of elements encoded as a string at-
tribute.
Currently, one has to resort, either to a general
purpose programming language, to some flavor of
proprietary scripting of an ETL tool, or to an RDBMS
using recursive queries of SQL:1999 (Melton and Si-
mon, 2002), or some sort of Persistent Stored Mod-
ules (PSMs) (Garcia-Molina et al., 2002, Section 8.2)
like stored procedures or table functions (Eisenberg
et al., 2004).
To address the problem of expressing one-to-many
data transformations in a declarative and optimizeable
fashion, specialized relational operator named map-
per was recently proposed as an extension to RA with
a (Carreira et al., 2006). Informally, a mapper is ap-
plied to an input relation and produces an output re-
lation. It iterates over each input tuple and gener-
ates one or more output tuples, by applying a set of
domain-specific functions. This way, it supports the
dynamic creation of tuples based on a source tuple
contents.
Although mappers appear implicitly in systems
supporting schema and data transformations underly-
ing ETL, data cleaning and data warehousing (Gal-
hardas et al., 2000; Raman and Hellerstein, 2001; Cui
and Widom, 2001; Amer-Yahia and Cluet, 2004), as
far as we know, their execution and optimization has
never been, properly studied.
This paper studies the feasibility of extending
RDBMSs with the mapper operator. There are sev-
eral reasons to do so: First, implementing the map-
per operator as a relational operator opens interest-
ing optimization opportunities since expressions that
combine the mapper operator with standard RA oper-
ators can be optimized. Second, many data transfor-
mations are naturally expressed as relational expres-
sions, leveraging the optimization strategies already
implemented by RDBMSs (Chaudhuri, 1998). Third,
such extension further equips RDBMSs for data trans-
formation activities, broadening their applicability in
a wider range of data management activities. We re-
mark that our idea of using RDBMSs as data transfor-
mation engines is not revolutionary, see (Haas et al.,
1999). Furthermore, several RDBMSs like Microsoft
SQL Server and Oracle already include additional
software packages specific for ETL tasks.
Our contributions of are the following: (i) an
SQL-like concrete syntax for the mapper operator ac-
complished by extending the select statement, (ii) the
study of several query rewriting possibilities to be in-
corporated in the query optimizer and (iii) an exper-
imental validation of the usefulness of implementing
the mapper operator by comparing its physical imple-
mentation with alternative RDBMS solutions.
The rest of the paper is organized as follows. Sec-
tion 2 introduces the mapper operator and exposes its
concrete syntax by example. Then, in Section 3 we
discuss how to extend the query optimizer to handle
mappers. In Section 4, we report on a series of exper-
iments to ascertain the feasibility of implementing the
mapper operator and finally Section 5 concludes.
2 THE MAPPER OPERATOR
The mapper operator is formalized as a unary operator
µ
F
that takes a relation instance of the source relation
schema as input and produces a relation instance of
the target relation schema as output. The operator is
parameterized by a set F of functions, which we des-
ignate as mapper functions. The intuition is that each
mapper function f
A
i
expresses a part of the envisaged
ICEIS 2007 - International Conference on Enterprise Information Systems
22