18.04.5 OS, powered by Intel Core i5 CPU, 4GB of
RAM, and a 20GB disk. The EVACAR program was
developed using the C language and compiled with
the GCC compiler, resulting in a program size of 40
KB.
The primary dataset used for testing the Method
was the STATS dataset (Han et al., 2021), specifically
designed for evaluating Cardinality Estimation
(CardEst) methods. This dataset exhibits intricate
characteristics, including a high number of attributes,
substantial distributed skewness, strong attribute
correlation, and complex table join structures.
Notably, some table key values can be linked to zero
or one record in one table, while hundreds of records
are associated with another table. We deployed the
STATS database within the PostgreSQL 15 Database
Management System (DBMS) and employed the
libpq library to interface with the EVACAR prototype
(Matthew et al., 2005).
One of the most challenging queries in the STATS
dataset is Query Q57 (Han et al., 2021), involving six
table joins and several filtering conditions (we will
call them subqueries). The query, represented below,
served as a focal point for our assessments.
SELECT COUNT(*) // Q57
FROM posts as p, postLinks as pl,
postHistory as ph, votes as v,
badges as b, users as u
WHERE p.Id = pl. RelatedPostId AND
u.Id = p. OwnerUserId AND u.Id = b.
UserId AND u.Id = ph. UserId AND
u.Id = v.UserId
AND p.CommentCount>=0 AND
p.CommentCount<=13
AND ph.PostHistoryTypeId=5 AND
ph.CreationDate<='2014-08-13
09:20:10'::timestamp
AND v.CreationDate>='2010-07-19
00:00:00'::timestamp
AND b.Date<='2014-09-09
10:24:35'::timestamp
AND u.Views>=0 AND u.DownVotes>=0
AND u.CreationDate>='2010-08-04
16:59:53'::timestamp AND
u.CreationDate<='2014-07-22 15:15
:22'::timestamp;
(10)
Running Query Q57 on the virtual machine within
the PostgreSQL environment took approximately 17
minutes (measured considering interruptions by the
host machine). In terms of pure virtual machine time,
the query execution ran roughly 7 minutes, ultimately
yielding 17,849,233,970 records.
Table 1 below presents the characteristics of the
tables involved in Query Q57.
In the assessments, the 'users' table (No. 1 in Table
1) was designated as the root table for evaluating the
cardinality of query (10) and its subplans using the
Method, as defined in Algorithm 1 (Section 3).
Across all experiments, the product of the number of
blocks was maintained at N = Π
j =1...6
N
j
= 10
5
(as
indicated in Table 1), and the number of samples (g)
was set at n = 10. The experiments aimed to
investigate the impact of the number of blocks (1, 2,
4) in the root subquery on the accuracy and time
required for cardinality estimation, as denoted by
options 1, 2, and 3 in Table 1.
To evaluate accuracy, we employed the q-error
estimate, as defined in (Leis et al., 2015):
𝑞− 𝑒𝑟𝑟𝑜𝑟
=
𝑐
𝑐
,𝐼𝐹𝑐
≥𝑐
,
−
𝑐
𝑐
,𝐼𝐹𝑐
>𝑐
,
(11)
where c
true
represents the true cardinality value, and
c
estimate
signifies the estimated cardinality value.
Table 2 outlines the results of the experiments and
presents confidence intervals for the q-errors of the
EVACAR method. In Table 2, the second column
signifies the subplan tables, each denoted by their
respective numbers as presented in Table 1. For
instance, subplan {1,5} represents a join of
subqueries from the 'users' and 'posts' tables, while
subplan {1,2,3,4,5,6} comprises all subqueries of the
original query (10). The third column provides the
actual cardinality values of these subplans. Columns
4 to 6 depict the q-errors, representing the
discrepancies between the true and estimated
cardinality values when using the BayesCard (Wu et
al., 2020), DeepDB (Hilprecht et al., 2020), and
FLAT (Zhu et al., 2021) methods.
Columns 7 to 12 present the medians (M) and the
boundaries of the 95% confidence intervals for q-
errors across the examined options 1, 2, and 3. These
boundaries are derived from the results of 50 runs of
Algorithm 1 (see the Method in Section 2) for each
option. A specific notation system is employed for
these options, formatted as "n10-uX-vY," where n =
10 signifies the number of samples g, while X and Y
represent the number of blocks associated with the
'users' and 'votes' subquery tables, respectively.
An intriguing observation in Table 2 is that the
lengths of the confidence intervals tend to increase
with the number of blocks in the root subquery,
ranging from 1 to 4.
This phenomenon can be attributed to the
substantial distributed skewness within the STATS
dataset. This is explained by the fact that due to the
strong distributed asymmetry of the STATS data set,
the number of combinations g with non-empty block
joins decreases, the N/K ratio in (8) increases, and the
error Δ also increases.