zontal (Sadalage and Fowler, 2012) partitioning. To
do so, the relation needs to be split either by using
(e.g.) range or hash partitioning. Hashing algorithms
are used to split a given domain – which is usually
achieved through a primary key in database notation
– into a group of buckets with the same cardinality of
the number of computing nodes. An hash algorithm is
defined by an hash function (H) that makes each com-
puting node accountable for a set of keys, allowing to
map which node is responsible for a given key.
The distributed execution of queries leverages on
data partitioning as a way to attain gains associated
with parallel execution. Nevertheless, the partition-
ing strategies typically rely on a primary table key to
govern the partitioning, which only benefits the cases
where the partitioning of a query matches that same
key. When the query has to partition data according
to a different attribute in a relation, it becomes likely
that the members of each partition will not all reside
in the same node.
PK
1
1
1
1
1
1
2
3
1
2
1
1
A B PK
2
2
2
2
1
2
2
3
1
3
3
1
A B PK
3
3
3
3
2
2
2
2
1
1
3
3
A B
node #2 node #3node #1
select rank() OVER (partition by A) from table
Figure 1: Data partitioning among 3 workers.
Figure 1 presents the result from hash partitioning
a Relation into 3 workers according to the primary
key (PK). The query presented holds a window oper-
ator that should produce a derived view induced by
partitioning attribute A.
Non-cumulative aggregations such as rank require
all members of a given partition to be collocated, in
order not to incur in the cost of reordering and recom-
puting the aggregate after the reconciliation of results
among nodes. However, different partitions do not
share this requirement, thus enabling different parti-
tions to be processed in different locations. To ful-
fill the data locality requirement, rows need to be for-
warded in order to reunite partitions.
The shuffle operator arises as way to reunite parti-
tions and to reconcile partial computations originated
by different computing nodes. The shuffler in each
computing node has to be aware of the destination
where to send each single row, or if it should not send
it at all. Typically, this is achieved by using the mod-
ular arithmetic operation of the result of hashing the
partition key over the number of computing nodes.
However, this strategy is oblivious to the volume of
data each node holds of each partition. In the worst
case, it might need relocate all partitions to different
nodes, producing unnecessary use of bandwidth and
processing power.
In this position paper we show that if data distri-
bution of each column in a relation is approximately
known beforehand, the system is able to adapt and
save network and processing resources by forward-
ing data to the right nodes. The knowledge needed is
the cardinality and size (in bytes) in each tuple par-
tition rather than considering the actual tuple value
as seen in common use of database indexes. This
knowledge would then be used by an Holistic shuf-
fler which according to the partitioning considered by
the ongoing window function would instruct workers
to handle specific partitions, minimizing data transfer
among workers.
2 STATISTICS
Histograms are commonly used by query optimizers
as they provide a fairly accurate estimate on the data
distribution, which is crucial for the query planner.
An histogram is a structure which allows to map keys
to their observed frequencies. Database systems use
these structures to measure the cardinality of keys or
key ranges. Relying on statistics such as histograms
takes special relevance in workloads where data is
skewed (Poosala et al., 1996), a common character-
istic of non synthetic data, as their absence would
induce the query optimizer to consider uniformity
across partitions.
While most database engines use derived ap-
proaches of the previous technique, they only allow to
establish an insight regarding the cardinality of given
attributes in a relation. When considering a query
engine that has to generate parallel query execution
plans to be dispatched to distinct workers, each one
holding a partition of data; such histograms do not
completely present a technique that could be used to
enhance how parallel workers would share prelimi-
nary and final results. This is so as they only introduce
and insight about the cardinality of each partition key.
In order to minimize bandwidth usage, thus reducing
the amount of traded information, the histogram also
needs to reflect the volume of data existing in each
node. To understand the relevance of having an intu-
ition regarding the row size, please consider that we
have 2 partitions with exactly the same cardinality of
rows that need to be shuffled among workers. From
this point of view the cost of shuffling each row is the
same. However, if the first row has an average size
of 10 bytes, and the second 1000 bytes, then shuf-
fling the second implies transferring 100 times more
data over the network. This will be exacerbated as
DataDiversityConvergence 2016 - Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP
and OLAP
344