that satisfy the conditions of a given query) could be
drawn more efficiently using the k-MDI tree.
The sampling scheme using the k-MDI tree index
(Rudra et al., 2012) facilitates picking rich samples
for queries with highly specific selection conditions.
If the sample contains an adequate number of
records that satisfy the query conditions, the average
values can be estimated from these records.
However, to estimate sum (= avg
x count) we need
to estimate both the average as well as count of the
records that satisfy the query in the whole database.
Therefore, from the sample we need to project the
number of records in the entire database that satisfy
the given query conditions. Chaudhuri and Mukerjee
(1985) proposed an unbiased estimator based on
inverse simple random sampling without
replacement (SRSWOR) where random sampling is
carried out on a finite population until a predefined
number of domain members are observed. In this
paper, we propose the adaptation of inverse
SRSWOR to estimate adequate sample sizes for
queries using the k-MDI tree index. The method is
empirically evaluated on a large real world data
warehouse.
The rest of the paper is organized as follows: In
Section 2, we briefly describe the k-way multi-
dimensional (k-MDI) indexing structure and the
storage structure of data records. Section 3 discusses
how to pick adequate samples using inverse
SRSWOR. In Section 4 we discuss the results of our
experiments. Finally, Section 5 concludes the paper.
2 TERMS, DEFINITIONS AND
k-MDI TREE INDEX
In this section, we define some terms pertaining to
data warehousing, define confidence interval and
then review the k-MDI tree index for retrieving
relevant samples from a data warehouse.
2.1 Dimension and Measures
To execute decision support queries, data is usually
structured in large databases called data warehouses.
Typically, data warehouses are relational databases
with a large table in the middle called the fact table
connected to other tables called dimensions. For
example, consider the fact table
Sales shown as Table
1. A dimension table
Store linked to StoreNo in this
fact table will contain more information on each of
the stores such as store name, location, state, and
country (Hobbs et al., 2003). Other dimension tables
could exist for items and date. The remaining
attributes like quantity and amount are typically, but
not necessarily, numerical and are termed measures.
A typical decision support query aggregates a
measure using functions such as Sum(), Avg() or
Count(). The fact table Sales along with all its
dimension tables forms a star schema.
Table 1: Fact table SALES.
SALES
StoreNo Date Item Quantity Amount
21 12-Jan-11 iPad 223 123,455
21 12-Jan-11 PC 20 24,800
24 11-Jan-11 iMac 11 9,990
77 25-Jan-11 PC 10 12,600
In decision support queries a measure is of
interest for calculation of averages, totals and
counts. For example, a sales manager may like to
know the total sales quantity and amount for certain
item(s) in a certain period of time for a particular
store or even all (or some) stores in a region. This
may then allow her to make decisions to order more
or less stocks as appropriate at a point in time.
2.2 Multidimensional Indexing
The k-ary multi-dimensional index tree (k-MDI tree)
proposed in Rudra et al., (2012) extends the ACE
Tree index (Joshi and Jermaine, 2008) for multiple
dimensions. The height of the k-MDI tree is limited
to the number of key attributes. As a multi-way tree
index, it is relatively shallow even for a large
number of key value ranges and so requires only a
small number of disk accesses to traverse from the
root to the leaf nodes.
The k-MDI tree is a k-ary balanced tree (Bentley
1975) as described below:
1. The root node of a k-MDI tree corresponds to the
first attribute (dimension) in the index.
2. The root points to k
1
(k
1
≤ k) index nodes at level
2, with each node corresponding to one of the k
1
splits of the ranges for attribute a
1
.
3. Each of the nodes at level 2, in turn, points to up
to k
2
(k
2
≤ k) index nodes at level 3
corresponding to k
2
splits of the ranges of values
of attribute a
2
; similarly for nodes at levels 3 to
h, corresponding to attributes a
3
,..., a
h
.
4. At level h, each of up to k
h-1
nodes points to up to
k
h
(k
h
≤ k) leaf nodes that store data records.
5. Each leaf node has h+1 sections; for sections 1 to
h, each section i contains random subset of
records in the key range of the node i in
SelectingAdequateSamplesforApproximateDecisionSupportQueries
47