form is A ≤ v. For a categorical attribute, if its
cardinality is small, all subsets of its domain can be
candidate splits; otherwise, we can use a greedy
strategy to create candidate splits.
SLIQ (Mehta, Agrawal et al. 1996) and
SPRINT (Shafer, Agrawal et al. 1996) are more
recent decision-tree classifiers that address the
scalability issues for large data sets. Both use Gini
index as impurity function, presorting (for numerical
attributes), and breadth-first-search to avoid
resorting at each node. Both SLIQ and SPRINT are
still multi-pass algorithms for large data sets due to
the necessity of external sorting and out-of-memory
structures such as attribute lists.
Surajit et al. (Chaudhuri, Fayyad et al. 1999)
give a scalable classifier over a SQL database
backend. They develop a middleware that batches
query executions and stages data into its memory or
local files to improve performance. At its core is a
data structure called count table or CC table, a four-
column table (attribute-name, attribute-value, class-
value, count). Gehrke et al. give a uniform
framework algorithm RainForest based on AVC-
group (a data structure similar to CC tables but as
independent work) for providing scalable versions of
most decision tree classifiers without changing the
quality of trees (Gehrke, Ramakrishnan et al. 1998).
With usually much smaller sizes of CC tables or
AVC-group than the original data or attribute lists in
SPRINT, these two algorithms generally improve
the mining performance. However, they together
with all other classification algorithms (as far as we
know) including SLIQ and SPRINT still need to
physically access (sometimes in multiple scans)
original data set to compute the best splits, and
partition the data sets in the nodes according to the
splitting criteria. Different from these algorithms,
our cube-based decision tree construction does not
compute and store the F-sets (all the records
belonging to an internal node) to find best splits, nor
does it partition the data set physically. Instead, we
compute the splits through the data cubes, as shown
in more detail in Sec. 5.
The BOAT algorithm (Gehrke, Ganti et al.
1999) constructs a decision tree and coarse split
criteria from a large sample of original data using a
statistical technology called bootstrapping. Other
classification methods include Bayesian
classification (Cheeseman and Stutz 1996), back
propagation (Lu, Setiono et al. 1995), association
rule mining (Lent, Swami et al. 1997), k-Nearest
neighbor classification (Duda and Hart 1973), etc.
Recently, a statistics-based classifier is built on top
of data cube (Fu 2003).
Since cubeDT is built on top of the technologies
of OLAP and data cube, the performance of cube
computation has a direct influence on it. Next, we
briefly introduce some of the cube systems and cube
computation algorithms. To compute data cubes,
various ROLAP (relational OLAP) systems,
MOLAP (multidimensional OLAP) systems, and
HOLAP (hybrid OLAP) systems are proposed
(Chaudhuri and Dayal 1997). Materialized views
and indexing are often used to speedup the
evaluation of data cubes and OLAP queries.
Materializing all the aggregate GROUP_BY
views may incur excessive storage requirements and
maintenance overhead for these views. A view
selection algorithm proposed by Harinarayan et al.
(Harinarayan, Rajaraman et al. 1996) uses a greedy
strategy to choose a set of views over the lattice
structure under the constraint of certain space or
certain number of views to materialize. Agarwal et.
al (Agarwal, Agrawal et al. 1996) overlap or
pipeline the computation of the views so that the
cost of the processing tree is minimized. For sparse
data, Zhao et al. proposed the chunking method and
sparse data structure for sparse chunks (Zhao,
Deshpande et al. 1997).
For dimensions with small cardinalities, bitmap
indexing is very effective (O'Neil 1987). It is
suitable for ad-hoc OLAP queries and has good
performance due to quick bitwise logical operations.
However, it is inefficient for large domains, where
encoded bitmap (Chan and Ioannidis 1998) or B-
trees (Comer 1979) can be used. Other work related
to indexing includes variant indexes (O'Neil and
Quass 1997), join indexes, etc. Beyer and
Ramakrishnan develop BUC (bottom-up cubing)
algorithm for cubing the group-bys that are above
some threshold (Beyer and Ramakrishnan 1999).
Johnson and Shasha (Johnson and Shasha 1997)
propose cube trees and cube forests for cubing. In
order to improve the performance of ROLAP
algorithms, which often require multiple passes for
large data sets, a multidimensional data structure
called Statistics Tree (ST) (Fu and Hammer 2000)
has been developed. The computation of data cubes
that have arbitrary combination of different
hierarchy levels is optimized in (Hammer and Fu
2001). Other important recent work include Dwarf
(Sismanis, Deligiannakis et al. 2002) and QC-trees
(Lakshmanan, Pei et al. 2003).
3 SPARSE STATISTICS TREES
An ST tree is a multi-way and balanced tree with
each level in the tree (except the leaf level)
corresponding to an attribute. Leaf nodes contain the
aggregates and are linked to facilitate the storage
and retrieval. An internal node has one pointer for
each domain value, and an additional “star” pointer
ICEIS 2005 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
120