of choosing subtrees for new coming tuples,
determines the clustering pattern of tuples among the
leaf nodes. After that, we make it clear that the
present clustering criterion in the insert algorithm of
R*-tree is not suitable to R*-tree applied to business
data. Instead, a hybrid clustering criterion is
proposed. Our discussion and experiment indicate
that query performance of R*-tree on business data
is improved much by the new clustering creation.
The rest of the paper is organized as follows.
Section 2 describes how to use multidimensional
indices for relational data. Section 3 presents our
observations when R*-tree is used to business data
and the reason of our observations is discussed in
detail. Section 4 is our proposal: a hybrid clustering
criterion for R*-tree. Section 5 gives experimental
result, and Section 6 concludes the paper.
2 INDEXING BUSINESS DATA
USING R*-TREE
In this section, let us see how to use R*-tree to
business data and give some terms. Due to the
limitation of pages, R*-tree is not introduced in this
paper. Readers can refer the works (Beckmann and
Kriegel, 1990, Y. Feng, A. Makinouchi and H. Ryu,
2004).
Let T be a relational table with n attributes,
denoted by T(A1, A2, …, An). Attribute Ai (1 ≤ i ≤
n) has domain D(Ai), a set of possible values for Ai.
The attributes often have types such as Boolean,
integer, floating, character string, date, and so on.
Each tuple t in T is denoted by <a1,a2, … ,an>,
where ai (1 ≤ i ≤ n) is an element of D(Ai).
When R*-tree is used in relational tables, some
of the attributes are usually chosen as index
attributes, which are used to build R*-tree. For
simplification of description, it is supposed without
loss of generality that the first k (1≤ k ≤ n) attributes
of T, <A1,A2, … ,Ak>, are chosen as index
attributes. Since R*-tree can only deal with numeric
data, an order-preserving transformation is necessary
for each non-numeric index attributes. After
necessary transformations, the k index attributes
form a k-dimensional space, called index space,
where each tuple of T corresponds to one point.
It is not difficult to find such a mapping
function for Boolean attributes and date attributes (Y.
Feng, A. Makinouchi and H. Ryu, 2004). The work
(H. V.Jagadish and Srivastava, 2000) proposes an
efficient approach that maps character strings to real
numeric values within [0,1], where the mapping
preserves the lexicographic order. This approach is
also used in this study to deal with attributes of
character string.
We call the value range of Ai, [li, ui] (1≤ i ≤ k)
data range of Ai, an index attribute (in this paper,
“dimension” and “index attribute” are used
interchangeably). The length of the data range of Ai,
|ui-li|, is denoted by R(Ai). The k-dimensional
hyper-rectangle, [l1,u1]× [l2,u2]×…×[lk, uk], forms
the index space. Attributes specified in the range
query condition is called query attributes.
If R*-tree is used to index business data stored
in a relational table, all the tuples are clustered in
R*-tree leaf nodes. See Figure 1.
leaf nodes
query range
Figure 1. Leaf nodes and query range.
tuple
Figure 1: Leaf nodes and query range
Figure 1 shows an example of leaf nodes and query
range. Query range, given by user, refers to the
region, where the user wants to find the result.
Clearly, from Figure 1, if the tuples are properly
clustered among the leaf nodes, the number of leaf
nodes to be accessed for this range query will drop.
Thus, the clustering pattern is a deceive factor on
query performance. The question is that who decides
the clustering pattern? The answer is “clustering
criterion” in the insert algorithm of R*-tree.
R*-tree is constructed by inserting the objects
one by one. In constructing procedure, the insert
algorithm has to choose a proper subtree to contain
each new-coming tuple. The criterion that decides
which subtree should be chosen is called insert
criterion or clustering criterion in this paper. Of
course, for a given dataset, this criterion decides the
final clustering pattern of the tuples among leaf
nodes. In this paper, it will be pointed out that the
present clustering criterion of R*-tree cannot lead to
a proper clustering pattern when R*-tree is used to
business data. And a novel clustering criterion will
be proposed.
3 OBSERVATIONS AND OUR
EXPLANATION
In this section is our observations on R*-tree used
for business data. And, the observations are also
explained.
A HYBRID CLUSTERING CRITERION FOR R*-TREE ON BUSINESS DATA
347