approximately 54% of the difference in the average
runtimes is due to the DBMS optimizer choosing to
use index with the star schema where table scan is
faster. We let the DBMS re-analyze the star schema
using a large sample (20%) of the data but, the
behaviour of the optimizer remained unaffected. We
don’t have explanation for it, but even discounting
these 54% as a result of some inconsistency within
the Oracle 10g optimizer (e.g. wrong selectivity
estimates), the remaining advantage of the
fragmented schema over the star schema is still
impressive.
Comparing the update performance of the Star
schema and the Fragmented Schema
We updated 10% of the tuples in the LOPS table of
the star schema and the L1, L2 and L3 fragments of
the fragmented schema. Again, the fragmented
schema offered better performance, however the
difference was negligible – only 8% faster than the
star scheme.
4 RELATED LITERATURE
Related literature includes works on data warehouse
modelling - (Golfarelli et al., 1998; Kimball et al.,
1998; Bizarro et al., 2002), vertical partitioning -
(Papadomano-lakis et al., 2004; Agrawal et al.,
2004) and new storage models – (Stonebraker et al.,
2005). Our work differs from them through one or
more of the following:
• we use query model as input;
• we use overlapping partitioning instead of disjoint
partitioning;
• we focus on reducing the overhead I/O, and not
on minimizing the joins;
• our method allows for storage constraint;
• we focus on the read performance;
• our method is accommodated within the existing
database technology.
We will treat the relevant literature in more detail in
a full text variant of this paper.
5 CONCLUSIONS AND
OUTLOOK
We think that the current paradigm of data
warehouse modelling commits the mistake of
ignoring important information about the future
workload. In this way many opportunities for
performance improvement are wasted. We propose a
simple, yet effective algorithm to derive a more
query-responsive data warehouse schema. The
schema created in this way was found to offer more
than 3 times better read performance using the same
storage as a star scheme.
We are at the start of our research and many
questions are open – “Are there algorithms which
construct even more effective schema (e.g. merge
not just two fragments at a time but more or merge
unrelated fragments)?”, “Can any performance
guarantees be established using reasonable
assumptions (e.g. self-similarity of the attribute
network)?” and others. We hope that other
researchers will find these questions as interesting as
we do.
ACKNOWLEDGEMENTS
I would like to thank Peter Stoehr for his support.
REFERENCES
Agrawal, S., et al., 2004: Integrating Vertical and Hori-
zontal Partitioning into Automated Physical Database
Design. Proc. 2004 SIGMOD Int. Conf. on Manag. of
Data.
Bizarro, P., Madeira, H., 2002: Adding a Performance-
Oriented Perspective to Data Warehouse Design.
Proc. of 4
th
Int. Conf. on Data Warehousing and
Knowledge Discovery (DaWaK).
Golfarelli, M. et al., 1998 Conceptual Design of Data
Warehouses from E/R Schemes. In Proc. 32th HICSS.
Inmon, W., 1996. Building the data warehouse, John
Wiley & Sons, Inc. New York, NY, USA.
Kimball, R., 1996. The data warehouse toolkit: practical
techniques for building dimensional data warehouses,
John Wiley & Sons, Inc. New York, NY, USA.
Kimball, R. et al., 1998. The data warehouse lifecycle
toolkit, John Wiley & Sons, Inc. New York, NY, USA.
Martello, S., Toth, P., 1990. Knapsack problems:
algorithms and computer implementations, John
Wiley & Sons, Inc. New York, NY, USA.
Papadomanolakis E., Ailamaki, A., 2004: AutoPart:
Automating Schema Design for Large Scientific
Databases Using Data Partitioning. Proc. 16
th
Int.
Conf. on Scient. and Stat. Datab. Manag. (SSDBM).
Stonebraker, M. et al., 2005. C-Store: A Column-oriented
DBMS. Proc of the 31st Int. Conf. on Very Large
Databases (VLDB).
TPC-H Standard Specification Revision 2.1.0, 2002.
http://www.tpc.org
A NEW LOOK INTO DATA WAREHOUSE MODELLING
543