The test protocol we adopted (see Algorithm 5)
has been executed for each estimation technique
(LOGLOG, probabilistic counting and GIBBONS-
TIRTHAPURA), GROUP BY query, random seed and
memory size. At each step corresponding to those pa-
rameter values, we compute the estimated-size values
of GROUP BY s and time required for their compu-
tation. For the multifractal estimation technique, we
computed at the same way the time and estimated size
for each GROUP BY, sampling ratio value and ran-
dom seed.
Algorithm 5 Test protocol.
1: for GROUP BY query q ∈ Q do
2: for memory budget m ∈ M do
3: for random seed value r ∈ R do
4: Estimate the size of GROUP BY q with m mem-
ory budget and r random seed value
5: Save estimation results (time and estimated
size) in a log file
US Census 1990. Figure 2 plots the largest 95
th
-
percentile error observed over 20 test estimations
for various memory size M ∈ {16, 64,256, 2048}.
For the multifractal estimation technique, we rep-
resent the error for each sampling ratio p ∈
{0.1%,0.3%,0.5%, 0.7%}. The X axis represents
the size of the exact GROUP BY values. This
95
th
-percentile error can be related to the theoreti-
cal bound for ε with 19/20 reliability for GIBBONS-
TIRTHAPURA (see Corollary 1): we see that this up-
per bound is verified experimentally. However, the er-
ror on “small” view sizes can exceed 100% for prob-
abilistic counting and LOGLOG.
Synthetic data set. Similarly, we computed the
19/20 error for each technique, computed from the
DDBGEN data set . We observed that the four tech-
niques have the same behaviour observed on the US
Census data set. Only, this time, the theoretical bound
for the 19/20 error is larger because the synthetic data
sets has many views with less than 2 dimensions.
Speed. We have also computed the time needed for
each technique to estimate view-sizes. We do not rep-
resent this time because it is similar for each tech-
nique except for the multifractal which is the fastest
one. In addition, we observed that time do not depend
on the memory budget because most time is spent
streaming and hashing the data. For the multifrac-
tal technique, the processing time increases with the
sampling ratio.
The time needed to estimate the size of all
the views by GIBBONS-TIRTHAPURA, probabilis-
tic counting and LOGLOG is about 5 minutes for
US Census 1990 data set and 7 minutes for the syn-
thetic data set. For the multifractal technique, all
the estimates are done on roughly 2 seconds. This
time does not include the time needed for sampling
data which can be significant: it takes 1 minute (resp.
4 minutes) to sample 0.5% of the US Census data set
(resp. the synthetic data set – TPC H) because the
data is not stored in a flat file.
6 DISCUSSION
Our results show that probabilistic counting and
LOGLOG do not entirely live up to their theoretical
promise. For small view sizes, the relative accuracy
can be very low.
When comparing the memory usage of the var-
ious techniques, we have to keep in mind that the
memory parameter M can translate in different mem-
ory usage. The memory usage depends also on
the number of dimensions of each view. Generally,
GIBBONS-TIRTHAPURA will use more memory for
the same value of M than either probabilistic counting
or LOGLOG, though all of these can be small com-
pared to the memory usage of the lookup tables T
i
used for k-wise independent hashing. In this paper,
the memory usage was always of the order of a few
MiB which is negligible in a data warehousing con-
text.
View-size estimation by sampling can take min-
utes when data is not layed out in a flat file or in-
dexed, but the time required for an unassuming es-
timation is even higher. Streaming and hashing the
tuples accounts for most of the processing time so for
faster estimates, we could store all hashed values in a
bitmap (one per dimension).
7 CONCLUSION AND FUTURE
WORK
In this paper, we have provided unassuming tech-
niques for view-size estimation in a data warehousing
context. We adapted an estimator due to Gibbons and
Tirthapura. We compared this technique experimen-
tally with stochastic probabilistic counting, LOGLOG,
and multifractal statistical models. We have demon-
strated that among these techniques, only GIBBONS-
TIRTHAPURA provides stable estimates irrespective
of the size of views. Otherwise, (stochastic) proba-
bilistic counting has a small edge in accuracy for rela-
tively large views, whereas the competitive sampling-
based technique (multifractal) is an order of mag-
nitude faster but can provide crude estimates. Ac-
cording to our experiments, LOGLOG was not faster
UNASSSUMING VIEW-SIZE ESTIMATION TECHNIQUES IN OLAP - An Experimental Comparison
149