partitioning step are those where the inputs all have
common hash attributes or joins where indirect parti-
tioning is possible. One partitioning step is sufficient
as the cleanup can be performed by loading all parti-
tions at the same index in memory and then probing.
Star joins can be processed using multi-dimensional
partitioning. Multi-dimensional partitioning requires
the build relations be read multiple times during the
cleanup phase, but this still may be more efficient than
the corresponding binary plans. The rest of this sec-
tion provides background on existing multi-way join
implementations.
2.2.1 Hash Teams
Hash teams (Graefe et al., 1998a) used in Microsoft
SQL Server 7.0 perform a multi-way hash join where
the inputs share common hash attributes. A hash
team consists of hash operators and a team manager
that coordinates the memory management for all in-
puts. Performance gains of up to 40% were reported.
The probe ordering is the same as the original binary
plan. The team manager is separate from the regu-
lar plan operators. The operators are binary and are
co-ordinated externally by the team manager.
2.2.2 Generalized Hash Teams
Hash teams were extended to generalized hash teams
(Kemper et al., 1999) that allows tables to be joined
using indirect partitioning. Indirect partitioning par-
titions a relation on an attribute that functionally de-
termines the partitioning attribute. A TPC-H (TPC,
2013) query joining the relations Customer, Orders,
and LineItem can be executed using a generalized
hash team that partitions the first two relations on
custkey and probes with the LineItem relation by us-
ing its orderkey attribute to indirectly determine a par-
tition number using a mapping constructed when par-
titioning Orders. This mapping provides a partition
number given an orderkey.
The major issue with indirect partitioning is that
storing an exact representation of the mapping func-
tion is memory-intensive. In (Kemper et al., 1999),
bitmap approximations are used that consume less
space but introduce the possibility of mapping errors.
These errors do not affect algorithm correctness but
do affect performance. The bitmap approximation
works by associating a bitmap of a chosen size with
each partition. A key to be stored or queried with the
mapping function is hashed based on the bitmap size
to get an index I in the bitmap. The bit at index I is
set to 1 in the bitmap for the partition where the tuple
belongs. Note that due to collisions in the hashing of
the key to the bitmap size, it is possible for a bit at
index I to be set in multiple partition bitmaps which
results in false drops. A false drop is when a tuple
gets put into a partition where it does not belong.
The generalized hash team algorithm does not
have a “hybrid step” where it uses additional mem-
ory to buffer tuples beyond what is required for parti-
tioning. Further, the bitmaps must be relatively large
(multiples of the input relation size) to reduce the
number of false drops. Consequently, even the bitmap
approximation is memory intensive as the number of
partitions increases.
2.2.3 SHARP
SHARP (Bizarro and DeWitt, 2006) is a multi-way
join algorithm for star joins that performs multi-
dimensional partitioning. An example TPC-H star
query involves Part, Orders and LineItem. In multi-
dimensional partitioning, Part and Orders are the
build tables and are partitioned on partkey and or-
derkey respectively. LineItem is the probe relation and
is partitioned simultaneously on (partkey,orderkey)
(in two dimensions). The number of partitions of the
probe table is the product of the number of partitions
in each build input. For example, if Part was par-
titioned into 3 partitions and Orders partitioned into
5 partitions, then LineItem would be partitioned into
5*3=15 partitions.
For a tuple to be generated in the memory phase,
the tuple of LineItem must have both its matching
Part and Orders partitions in memory. Otherwise,
the probe tuple is written to disk. The cleanup pass
involves iterating through all partition combinations.
The algorithm loads on-disk partitions of the probe
relation once and on-disk partitions of the build rela-
tion i a number of times equal to
∏
i−1
j=1
X
j
, where X
i
is the number of partitions for build relation i. Read-
ing build partitions multiple times may still be faster
than materializing intermediate results, and the opera-
tor benefits from memory sharing during partitioning
and the ability to adapt during its execution.
2.3 Other Join Algorithms
There are two other related but distinct areas of re-
search on join algorithms. There is work on paral-
lelizing main-memory joins based on hashing (Blanas
et al., 2011) and sorting (Albutiu et al., 2012). The as-
sumption in these algorithms is that there is sufficient
memory for the join inputs such that the dominate fac-
tor is CPU time rather than I/O time. In this work,
we are examining joins where I/O is still a major fac-
tor. The main-memory and cache-aware optimization
techniques employed in these works could also be ap-
plied to I/O bound joins.
AreMulti-wayJoinsActuallyUseful?
15