
improve the total performance of evaluating a group
of range queries issued on a relational table.
Meanwhile, we also point out that there often exist
many very slender leaf nodes when the R*-tree is
used to index business data, which causes some
problems both with the R*-tree construction and
with queries, where Slender nodes means those
having a very narrow side (even the side length is
zero) in some dimension. We also propose an
approach to solve the problems of slender nodes.
The rest of the paper is organized as follows.
Section 2 introduces the R*-tree and describe how to
use multidimensional indices for relational table.
Section 3 presents our new observations when the
R*-tree is used to business data. Section 4
describes our proposal. Section 5 gives experimental
comparison using synthetic data as well as realistic
data, and Section 6 concludes the paper.
2 R*-TREE IN BUSINESS DATA
This section gives a brief review of the R*-tree and
describe how to use it to business data.
2.1 R*-tree
Let us briefly recall the R*-tree.
The R*-tree (Beckmann and Kriegel, 1990) is a
hierarchy of nested d-dimensional MBRs (Minimum
Bounding Rectangles). Each non-leaf node of the
R*-tree contains an array of entries, each of which
consists of a pointer and an MBR. The pointer
refers to one child node of this non-leaf node and the
MBR is the minimum bounding rectangle of the
child node referred to by the pointer. Each leaf node
of the R-tree contains an array of entries, each of
which consists of an object identifier and its
corresponding point (for point-object databases) or
its MBR (for extended-object databases). The
R*-tree is built by inserting the objects (tuples for
relational tables) one by one. Throughout the
remainder of this paper, no distinction is made
between R*-tree nodes and their corresponding
MBRs in the corresponding multidimensional space
when the meaning is clear in the context. Also, the
terms of tuple and object are also used
interchangeably.
The R*-tree is one of the most successful
variants of the well-known R-tree family. It uses
sophisticated insertion and node splitting algorithms
with the forced reinsertion mechanism.
2.2 Indexing Business Data Using
R*-tree
Now, we briefly recall how the R*-tree index
business data stored in a relational table and give
some terms. Let T be a relational table with n
attributes, denoted by T(A
1
, A
2
, …, A
n
). Attribute A
i
(1≤i≤n) has domain D(A
i
), a set of possible values
for A
i
. The attributes often have types such as
Boolean, integer, floating point, character string,
date and so on. Each tuple t in T is denoted by <a
1
,a
2
,
…,a
n
>, where a
i
(1≤i≤ n) is an element of D(A
i
).
When the R*-tree is used in relational tables,
some of the attributes are usually chosen as index
attributes, which are used to build the R*-tree. For
simplification of description, it is supposed without
loss of generality that the first k (1≤k≤n) attributes of
T, <A
1
,A
2
, … ,A
k
>, are chosen as index attributes.
Since the R*-tree can only deal with numeric data,
an order-preserving transformation is necessary for
each non-numeric index attributes.
After necessary transformations, the k index
attributes form an k-dimensional space, called index
space, where each tuple of T corresponds to one
point.
It is not difficult to find such a mapping
function for Boolean attributes and date attributes.
For Boolean data, “True” and “False” can be
mapped onto 1 and 0, respectively, if “True” >
“False” is assumed forcedly. This ordering has no
practical problems, because the predicate of
“equality” such as “A = True” or “A = False” is the
only predicate pattern for the Boolean attribute.
Although implementation of “date” depends on
DBMS, typical example of “date” in TPC-H
benchmark consists of three integers representing
year, month, and day. A simple function to get a
numeric value for a “date” is to use the number of
days from some reference date to this ``date''. In
this paper, the day of Jan. 1, 1900 is used as the
reference day, that is, the number of days from Jan. 1,
1900 to Apr. 5, 1998 is used to represent the date of
Apr. 5, 1998.
It is not easy to map an arbitrary character
string to a unique numeric data. The work (H. V.
Jagadish and Srivastava, 2000) proposes an efficient
approach that maps character strings to real
numeric values within [0,1], where the mapping
preserves the lexicographic order. This approach is
also used in this study to deal with attributes of
character string.
We call the value range of A
i
, [l
i
,u
i
] (1 ≤i
≤k), data range of A
i
attribute (in this paper,
“dimension” and “index attribute” are used
interchangeably). The length of the data range of A
i
,
|u
i
-l
i
|, is denoted by R(A
i
). The k-dimensional
IMPROVING QUERY PERFORMANCE ON OLAP-DATA USING ENHANCED MULTIDIMENSIONAL INDICES
283