Reducing Data Transfer in Parallel Processing of SQL Window Functions

abio Coelho, Jos

e Pereira, Ricardo Vilac¸a and Rui Oliveira

INESC TEC & Universidade do Minho, Braga, Portugal

Keywords:

Window Functions, Reactive Programming, Parallel Systems, OLAP, SQL.

Abstract:

Window functions are a sub-class of analytical operators that allow data to be handled in a derived view of a

given relation, while taking into account their neighboring tuples. We propose a technique that can be used

in the parallel execution of this operator when data is naturally partitioned. The proposed method beneﬁts the

cases where the required partitioning is not the natural partitioning employed. Preliminary evaluation shows

that we are able to limit data transfer among parallel workers to 14% of the registered transfer when using a

naive approach.

1 MOTIVATION

Window functions (WF) are a sub-group of analyti-

cal functions that allow to easily formulate analyti-

cal queries over a derived view of a given relation R.

They allow operations like ranking, cumulative aver-

ages or time series to be computed over a given data

partition. Each window function is expressed in SQL

by the operator OVER, which is complemented with

a partition by (PC), an order by (OC) and a grouping

clause (GC). A given analytical query may hold sev-

eral analytical operators, each one bounded by a given

window function. Each partition or ordering clause

represents one, or a combination of columns from a

given relation R.

Despite its relavance, optimizations considering

this operator are almost nonexisting in the literature.

The work by (Cao et al., 2012) or (Zuzarte et al.,

2003) are some of the execeptions. Respectively, the

ﬁrst overcomes optimization challenges related with

having multiple window functions in the same query,

while the second presents a more broad use of win-

dow functions, showing that it is possible to use them

as a way to avoid sub-queries and reducing execution

time down from quadratic time.

Listing 1 depicts a SQL query where the analyt-

ical operator rank is bounded by a window function,

which holds a partition and ordering clause respec-

tively for columns A and B. The induced partitioning

is built from each group of distinct values in the parti-

tion clause and ordered through the order by clause. A

single value corresponding to the analytical operator

is added to each row in the derived relation.

s e l e c t r a n k ( ) OVER( P a r t i t i o n By A

Order By B) from t a b l e

L i s t i n g 1 : Window F u n c t i o n examp le .

With the Big Data trend and growing volume of data,

the need for real-time analytics is increasing, thus re-

quiring systems to produce results directly from pro-

duction data, without having to transform, conform

and duplicate data as systems currently do. Therefore,

parallel execution becomes the crux of several hybrid

databases that ﬁt in the category of Hybrid Transac-

tional and Analytical Processing (HTAP). These sys-

tems typically need to leverage all parallelization and

optimization opportunities, being usually deployed in

a distributed mesh of computing nodes, where data

and processing are naturally partitioned. In one of

the methods to query such systems, each node com-

putes the results for the data partitions they hold, con-

tributing to the overall ﬁnal result. Nonetheless, the

data partitioning is usually achieved by means of a

single primary column in a relation. This impacts

mainly cases where the partitioning clause does not

match the natural partitioning, thus compromising the

ﬁnal result for a sub group of non-cumulative analyt-

ical operators, as all members of a distinct partition

need to be handled by a single entity. A naive solu-

tion would be to forward the partition data among all

nodes, but the provision of such a global view in every

node would compromise bandwidth and scalability.

Data distribution is among one of the cornerstones of

a new class of data substrates that allows data to be

sliced into a group of physical partitions. In order

to split a relation, there are usually two main trends,

namely: Vertical (Navathe et al., 1984) and Hori-

Coelho, F., Pereira, J., Vilaça, R. and Oliveira, R.

Reducing Data Transfer in Parallel Processing of SQL Window Functions.

In Proceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016) - Volume 1, pages 343-347

ISBN: 978-989-758-182-3

343

zontal (Sadalage and Fowler, 2012) partitioning. To

do so, the relation needs to be split either by using

(e.g.) range or hash partitioning. Hashing algorithms

are used to split a given domain – which is usually

achieved through a primary key in database notation

– into a group of buckets with the same cardinality of

the number of computing nodes. An hash algorithm is

deﬁned by an hash function (H) that makes each com-

puting node accountable for a set of keys, allowing to

map which node is responsible for a given key.

The distributed execution of queries leverages on

data partitioning as a way to attain gains associated

with parallel execution. Nevertheless, the partition-

ing strategies typically rely on a primary table key to

govern the partitioning, which only beneﬁts the cases

where the partitioning of a query matches that same

key. When the query has to partition data according

to a different attribute in a relation, it becomes likely

that the members of each partition will not all reside

in the same node.

A B PK

A B

node #2 node #3node #1

select rank() OVER (partition by A) from table

Figure 1: Data partitioning among 3 workers.

Figure 1 presents the result from hash partitioning

a Relation into 3 workers according to the primary

key (PK). The query presented holds a window oper-

ator that should produce a derived view induced by

partitioning attribute A.

Non-cumulative aggregations such as rank require

all members of a given partition to be collocated, in

order not to incur in the cost of reordering and recom-

puting the aggregate after the reconciliation of results

among nodes. However, different partitions do not

share this requirement, thus enabling different parti-

tions to be processed in different locations. To ful-

ﬁll the data locality requirement, rows need to be for-

warded in order to reunite partitions.

The shufﬂe operator arises as way to reunite parti-

tions and to reconcile partial computations originated

by different computing nodes. The shufﬂer in each

computing node has to be aware of the destination

where to send each single row, or if it should not send

it at all. Typically, this is achieved by using the mod-

ular arithmetic operation of the result of hashing the

partition key over the number of computing nodes.

However, this strategy is oblivious to the volume of

data each node holds of each partition. In the worst

case, it might need relocate all partitions to different

nodes, producing unnecessary use of bandwidth and

processing power.

In this position paper we show that if data distri-

bution of each column in a relation is approximately

known beforehand, the system is able to adapt and

save network and processing resources by forward-

ing data to the right nodes. The knowledge needed is

the cardinality and size (in bytes) in each tuple par-

tition rather than considering the actual tuple value

as seen in common use of database indexes. This

knowledge would then be used by an Holistic shuf-

ﬂer which according to the partitioning considered by

the ongoing window function would instruct workers

to handle speciﬁc partitions, minimizing data transfer

among workers.

2 STATISTICS

Histograms are commonly used by query optimizers

as they provide a fairly accurate estimate on the data

distribution, which is crucial for the query planner.

An histogram is a structure which allows to map keys

to their observed frequencies. Database systems use

these structures to measure the cardinality of keys or

key ranges. Relying on statistics such as histograms

takes special relevance in workloads where data is

skewed (Poosala et al., 1996), a common character-

istic of non synthetic data, as their absence would

induce the query optimizer to consider uniformity

across partitions.

While most database engines use derived ap-

proaches of the previous technique, they only allow to

establish an insight regarding the cardinality of given

attributes in a relation. When considering a query

engine that has to generate parallel query execution

plans to be dispatched to distinct workers, each one

holding a partition of data; such histograms do not

completely present a technique that could be used to

enhance how parallel workers would share prelimi-

nary and ﬁnal results. This is so as they only introduce

and insight about the cardinality of each partition key.

In order to minimize bandwidth usage, thus reducing

the amount of traded information, the histogram also

needs to reﬂect the volume of data existing in each

node. To understand the relevance of having an intu-

ition regarding the row size, please consider that we

have 2 partitions with exactly the same cardinality of

rows that need to be shufﬂed among workers. From

this point of view the cost of shufﬂing each row is the

same. However, if the ﬁrst row has an average size

of 10 bytes, and the second 1000 bytes, then shuf-

ﬂing the second implies transferring 100 times more

data over the network. This will be exacerbated as

DataDiversityConvergence 2016 - Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP

and OLAP

344

key PK A B

1 4 2 3

2 0 1 1

3 0 1 0

(a) physical partition #1

(p1)

key PK A B

1 0 1 2

2 4 2 0

3 0 1 2

(b) physical partition #2

(p2)

key PK A B

1 0 0 2

2 0 4 0

3 4 0 2

(p3)

key PK A B

1 p1 p1 p1

2 p2 p3 p1

3 p3 p1 / p2 p2 / p3

(d) Global

Figure 2: Partition and Global Histogram construction.

the difference in row size and the number of workers

grows. Figure 1 presented the result of hash partition-

ing a relation in 3 workers according to key PK. The

histogram to be built would consider the cardinality

and size of each value in each attribute of the rela-

tion for each single partition. The construction of the

histogram should not be done during query planning

time as it cannot know beforehand the partitioning

clauses induced by queries. Therefore, prior to execu-

tion, we consider all distinct groups of values in each

attribute. Each partition will contribute to the his-

togram with the same number of attributes as the orig-

inal relation, plus a key, reﬂecting the data in that par-

tition. Afterwards, each worker would share its partial

histogram with the remainder workers in order to pro-

duce the global histogram. The global histogram will

map a given key to the partition that should handle it,

by having the largest volume (in bytes) for that given

key. Figure 2 depict the partial an global histograms

produced over the paritions in Figure 1.

3 HOLISTIC SHUFFLER

Non-cumulative aggregations such as rank require the

processing of partition members to be done under the

same worker, in order not to incur in further unneces-

sary ordering stages (which present to be one of the

most costly operations (Cao et al., 2012)). The Holis-

tic Shufﬂer leverages the data distribution collected

by the Global Histogram, in order to expedite shuf-

ﬂing operations. The Shufﬂe operator can be trans-

lated into a SEND primitive that forwards a bounded

piece of data to a given destination. We consider the

underlying network to be reliable.

LOCAL

SHUFFLE

SORT RANK

GLOBAL

SHUFFLE

SCAN

MERGE

WORKER #1

PARTITION #1

Figure 3: Parallel worker design.

During the work ﬂow for processing a window op-

erator, as depicted in Figure 3, there are two different

moments where data needs to be shufﬂed. The ﬁrst

moment occurs immediately after the operator start,

and its goal is to reunite partitions, thus fulﬁlling the

locality requirement. The second moment occurs in

the end of the operator and is intended to reconcile

partial results in order to produce the ﬁnal result. Both

operators deﬁne distinct goals regarding the destina-

tions that need to be chosen for each forwarding oper-

ation. Therefore, we establish two shufﬂe operators,

the local shufﬂe and the global shufﬂe each contem-

plating each set of requirements.

The Local Shufﬂe operator will be used whenever

the window operator needs to reunite partition mem-

bers between parallel workers. It will dispatch rows

of a given partition to the worker that holds the largest

volume of data for that partition. Considering the con-

ﬁguration in Figure 3, when for example, worker 1 re-

trieves one row from scanning its physical partition, it

must assess whether or not it should hold that row for

later computing the analytical function, or by other

means forward it to the responsible worker.

The information collected in the Global His-

togram will enable each worker to know if it should

hold or forward the row to the the node holding the

largest volume for that given partition.

The Global Shufﬂer operator will be used when-

ever the window operator needs to reconcile partial

results from workers. It will forward all aggregated

rows to the worker that will hold the overall largest

data volume, the master worker. By instructing the

workers that hold the least volume of data to forward

rows, we are promoting the minimal usage of band-

width possible.

The input data considered by the Global Shufﬂer

is composed by the ordered and aggregated rows, both

produced by earlier stages of the worker work ﬂow.

Such rows will have to be reconciled by a common

node, which for this case will be dictated by the mas-

ter node. Upon start, the Global Shufﬂe will interro-

gate the histogram regarding the identity of the master

node. Afterwards, as each aggregated row is handled

by the operator, it is forwarded to the master worker,

if it is not the current one.

4 PRELIMINARY ASSESSMENT

Reactive Programming (RXJ, 2015) was used in the

undertaken micro-benchmark; allowing to establish a

Reducing Data Transfer in Parallel Processing of SQL Window Functions

345

series of data streams from which entities can be con-

structed. We used this idea to map to the individual

components that build a window operator.

We employed a single query holding a win-

dow function over the synthetically-generated relation

from the TPC-C (Council, 2010) bechmark, Order

Line. This relation holds 10 attributes, all uniformely

distributed. The generated data composes 100 distinct

partitions, each one with 500 rows. Globally, the Or-

der Line relation held 500 K tuples, totaling 3906 Mb.

To highlight the properties of our proposal, we con-

sidered the following ranking query:

s e l e c t r a n k ( ) OVER ( p a r t i t i o n by

OL D ID orde r by OL NUMBER) from

Order Li n e

The experiments were performed on a system with an

Intel i3-2100-3.1GHz 64 bit processor with 2 physical

cores (4 virtual), 8GB of RAM memory and SATA II

(3.0Gbit/s) hard drives, running Ubuntu 12.04 LTS as

the operating system.

For comparison purposes, we report the results by

using a naive approach and our Holistic Shufﬂer. The

naive approach, disseminates data among all partici-

pating workers. The results in both pictures are de-

picted according to a logarithmic scale, in the average

of 5 independent tests for each conﬁguration.

The projected gain achieved by our contribution

stems from the reduction of rows that need to be

forwarded among nodes to reunite and reconcile re-

sults among workers, thus promoting shorter band-

width consumption. Therefore, the evaluation re-

sults we present highlight the registered differences

between the naive and the holistic approach for both

shufﬂing stages. According to Figure 4, the Holistic

128

256

0 2 4 6 8 10 12

K tuple rows

workers

Naive

Holistic

Figure 4: Average number of shufﬂed rows in both shufﬂe

operations.

technique we propose induced a signiﬁcantly smaller

number of rows required to reunite and reconcile all

the partitions in each computing node. Speciﬁcally,

the Holistic technique required in average only 14.7%

of the rows required by the Naive approach. The

large difference is directly justiﬁed by the fact that

the naive approach reunites partitions by forwarding

data among all participating nodes, which intrinsi-

cally creates duplicates in each node. Moreover, it

is also possible to verify that the row cardinality re-

quired by the Naive approach is proportional to the

number of nodes. On the contrary, the Holistic tech-

nique discriminates each row, so that it is forwarded

to the node responsible for it, as dictated by the pro-

posed technique.

5 CONCLUSION

In this paper, we proposed a technique to reduce the

amount of data transfered among computing nodes of

a distributed query engine. It is based on the obser-

vation that the information regarding the existing data

distribution on a set of computing nodes (each one

with a disjoint partition of data) can be used to enable

improovements on bandwidth consumption. We show

how to implement it, which we tailored to be used for

the efﬁcient parallel processing of queries with non-

cumulative window functions. We show that by con-

ceptually applying this methodology, we were able to

project an improvement, requiring only 14% (in aver-

age) of rows in bandwidth consumption, when com-

pared with the naive technique. The research path to

be followed will translate this methodology to a real

distributed query engine.

ACKNOWLEDGEMENTS

This work was part-funded by project LeanBigData :

Ultra-Scalable and Ultra-Efﬁcient Integrated and Vi-

sual Big Data Analytics (FP7-619606).

REFERENCES

(2015). Reactive programming for java. https://git

hub.com/ReactiveX/RxJava.

Cao, Y., Chan, C.-Y., Li, J., and Tan, K.-L. (2012).

Optimization of analytic window functions. Pro-

ceedings of the VLDB Endowment,5(11):1244–

1255

Council, T. P. P. (2010). TPC Benchmark C.

Navathe, S., Ceri, S., Wiederhold, G., and Dou,

J. (1984). Vertical partitioning algorithms

for database design. ACM Transactions on

Database Systems (TODS), 9(4):680–710.

Poosala, V., Haas, P. J., Ioannidis, Y. E., and Shekita,

E. J. (1996). Improved histograms for selectiv-

DataDiversityConvergence 2016 - Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP

and OLAP

346

ity estimation of range predicates. In ACM SIG-

MOD Record, volume 25, pages 294–305. ACM.

Sadalage, P. J. and Fowler, M. (2012). NoSQL dis-

tilled: a brief guide to the emerging world of

polyglot persistence. Pearson Education.

Zuzarte, C., Pirahesh, H., Ma, W., Cheng, Q., Liu,

L., and Wong, K. (2003). Winmagic: Subquery

elimination using window aggregation. In Pro-

ceedings of the 2003 ACM SIGMOD interna-

tional conference on Management of data, pages

652–656. ACM.

Reducing Data Transfer in Parallel Processing of SQL Window Functions

347