records satisfying the query predicates.
Joshi and Jermaine (2008) introduced the ACE
Tree which is a binary tree index structure for
efficiently drawing samples for processing database
queries. They demonstrated the effectiveness of this
structure for single and two attribute database
queries, but did not deal with multi-attribute
aggregate queries. For extending the ACE Tree to k
key attributes, Joshi and Jermaine proposed binary
splitting of one attribute range after another at
consecutive levels of the binary tree starting from
the root; from level k+1, the process is repeated with
each attribute in the same sequence as before. This
process could lead to an index tree of very large
height for a data warehouse even if only a relatively
small number of attributes are considered.
Li et al. (2008) proposed a sampling cube
framework for answering analytical queries on a
data warehouse which calculates confidence
intervals for any multidimensional query. The
sampling cube is constructed from a random sample
of the data warehouse. After building the sampling
cube, there is no further access to the original data
records should a query require a different sample
from the one already drawn. If a query has too few
sample records in the sampling cube, they expand
the query to gather more sample records from the
sampling cube itself in an attempt to improve the
quality of the query result.
In this paper, we propose the k-MDI Tree which
extends the ACE Tree structure to deal with multi-
dimensional data warehouse queries. Unlike the
ACE Tree, the k-MDI tree allows non-binary splits
of data ranges for key values that do not split evenly
into 2
n
distinct ranges. The number of levels in the k-
MDI tree can be limited to the number of key
attributes. The shallow tree structure resulting from
multi-way branching also facilitates quicker retrieval
of leaf nodes from disk storage. Unlike the sampling
cube of Li et al. (2008), new samples that contain
relevant records are drawn for each query. These
records can be considered as drawn from a subset of
the data warehouse that satisfies the query
predicates. In estimating the query results, we take
into account the proportion of relevant records for
the query in the whole data warehouse. The
sampling and estimation methods are evaluated
experimentally using a real life data set.
The rest of the paper is organized as follows: In
Section 2, we define some relevant terms and briefly
describe the ACE Tree structure. In Section 3, our k-
way multi-dimensional (k-MDI) indexing structure
is described in detail. We also introduce the concept
of relevancy ratios, both for the database and a
specific sample. Section 4 reports the experimental
results that evaluate the efficacy of our scheme.
Section 5 is the conclusion of the paper.
2 TERMS, DEFINITIONS AND
ACE TREE STRUCTURE
In this section, we define some terms pertaining to
data warehousing, define confidence interval and
then review briefly the ACE Tree structure (Joshi
and Jermaine, 2008) that has preceded the k-MDI
tree we propose in Section 3.
2.1 Dimensions and Measure
To support decision support queries, data is usually
structured in large databases called data warehouses.
Typically, data warehouses are relational databases
with a large table in the middle called the fact table
connected to other tables called dimensions. For
example, consider the fact table Sales shown as
Table 1. A dimension table Store linked to StoreNo
in this fact table will contain more information on
each of the stores such as store name, location, state,
and country (Kimball and Moss, 2002). Other
dimension tables could exist for items and date. The
remaining attributes like quantity and amount are
typically, but not necessarily, numerical and are
termed measures. A typical decision support query
aggregates a measure using functions such as Sum(),
Avg() or Count(). The fact table Sales along with all
its dimension tables form a star schema.
Table 1: Fact table Sales.
SALES
Store
No
Date Item Quantity Amount
21 12-Jan-11 iPad 223 123,455
21 12-Jan-11 PC 20 24,800
24 11-Jan-11 iMac 11 9,990
77 25-Jan-11 PC 10 12,600
In decision support queries a measure is of
interest for calculation of averages, totals and
counts. For example, a sales manager may like to
know the total sales quantity and amount for certain
item(s) in a certain period of time for a particular
store or even all (or some) stores in a region. This
may then allow her to make decisions to order more
or less stocks as appropriate at a point in time.
AnEfficientSamplingSchemeforApproximateProcessingofDecisionSupportQueries
17