This simple model favors the integration of our
mapper operator in the query execution mechanisms
of an RDBMS. However, it turns out that in the pres-
ence of expensive functions, like, e.g., string match-
ing or check-digit computations, this na
¨
ıve execution
of the mapper operator can be very inefficient.
The total cost of evaluating a mapper can be min-
imized by avoiding superfluous function evaluations.
First, columns often have duplicate values. This sug-
gests the use of caching techniques. In the presence
of potentially many functions and tables with multi-
million tuples, the choice of the functions is an opti-
mization problem in itself. Second, some functions
return empty sets. When an empty set is found, no
output tuples are produced for a given input tuple.
Thus, there is no need to evaluate the remaining func-
tions. This observation suggests an interesting strat-
egy that consists of evaluating the functions that are
more selective first.
4 CURRENT STATUS
Defining a new operator is a significant research effort
as it requires both theoretical and practical insight. In
such effort, two issues need to be addressed forefront.
The usefulness of the operator needs to be validated
and the class of problems being solved has to be for-
mally defined.
To address the first issue, we pursued a commer-
cial venture that resulted in the inclusion of native
support for one-to-many data transformations in a
commercial tool (Carreira and Galhardas, 2004). The
tool is being used in several real-world legacy-data
migration projects that corroborate the need for sup-
porting one-to-many data transformations.
Up to this moment, we have been able to put for-
ward a formal semantics for the new operator that
enabled us to perform a formal study of the expres-
siveness of the operator. We developed the formal
demonstration that the mapper-extended RA (MRA)
is strictly more expressive than standard RA.
A formal definition of the class of one-to-many
data transformations is underway. We conjecture that
two sub-classes of one-to-many data transformations
exist: One comprising data transformations express-
ible through RA and another comprising those ex-
pressible only through MRA.
A set of algebraic rewriting rules for generat-
ing logical query plans involving mappers and some
standard relational operators have been developed to-
gether with their formal proofs of correctness (Car-
reira et al., 2005a). A first set of rewriting rules
for expressions involving mappers and joins has also
emerged. Currently, a set of experiments is being con-
ducted to determine the factors that influence the ef-
fectiveness of the proposed rewritings (Carreira et al.,
2005b). Prototypical implementations for physical
mapper operator algorithms are being developed in
Java using the XXL framework (van den Bercken
et al., 2000). These algorithms adapt ideas of mem-
oization and hybrid hashing proposed by (Hellerstein
and Naughton, 1996) to multiple functions.
5 CONCLUSIONS
In this work, we address the problem of specify-
ing one-to-many data transformations that are fre-
quently required in data integration, data cleaning,
legacy-data migration, and ETL scenarios. Since
one-to-many data transformations are not expressible
through standard RA queries, we proposed the map-
per operator. The new operator allows to naturally
express one-to-many data transformations, while ex-
tending the expressive power of RA at the same time.
Up to now some operators have been proposed
for addressing the problem of expressing one to many
data-transformations (Cunningham et al., 2004; Gal-
hardas et al., 2001; Raman and Hellerstein, 2001;
Amer-Yahia and Cluet, 2004). Although these opera-
tors show similarities with mappers, most of them are
only capable of expressing a subset of one-to-many
transformations.
As data often resides in RDBMSs, data transfor-
mations specified as relational expressions can take
direct advantage of their optimization capabilities. In
this trend, several RDBMSs, like e.g., Microsoft SQL
Server, already include additional software packages
specific for ETL tasks. However, as far as we know,
none of these extensions is supported by the corre-
sponding theoretical background in terms of existing
database theory. Therefore, the capabilities of rela-
tional engines, in terms of optimization opportuni-
ties are not fully exploited in activities involving data
transformations, like ETL or data-cleaning.
REFERENCES
Amer-Yahia, S. and Cluet, S. (2004). A declarative ap-
proach to optimize bulk loading into databases. ACM
Transactions of Database Systems, 29(2):233–281.
Carreira, P. and Galhardas, H. (2004). Efficient develop-
ment of data migration transformations. In ACM SIG-
MOD Int’l Conf. on the Managt. of Data.
Carreira, P., Galhardas, H., Lopes, A., and Pereira, J.
(2005a). Extending relational algebra to express one-
ICEIS 2007 - International Conference on Enterprise Information Systems
506