PIPELINED PARALLELISM IN MULTI-JOIN QUERIES ON

HETEROGENEOUS SHARED NOTHING ARCHITECTURES

Mohamad Al Hajj Hassan and Mostafa Bamha

LIFO, University of Orl´eans, BP 6759, 45067 Orl´eans cedex 2, France

Keywords:

PDBMS (Parallel Database Management Systems), Intra-transaction parallelism, Parallel joins, Multi-joins,

Data skew, Dynamic load balancing.

Abstract:

Pipelined parallelism was largely studied and successfully implemented, on shared nothing machines, in sev-

eral join algorithms in the presence of ideal conditions of load balancing between processors and in the absence

of data skew. The aim of pipelining is to allow ﬂexible resource allocation while avoiding unnecessary disk

input/output for intermediate join results in the treatment of multi-join queries.

The main drawback of pipelining in existing algorithms is that communication and load balancing remain

limited to the use of static approaches (generated during query optimization phase) based on hashing to re-

distribute data over the network and therefore cannot solve data skew problem and load imbalance between

processors on heterogeneous multi-processor architectures where the load of each processor may vary in a

dynamic and unpredictable way.

In this paper, we present a new parallel join algorithm allowing to solve the problem of data skew while

guaranteeing perfect balancing properties, on heterogeneous multi-processor Shared Nothing architectures.

The performance of this algorithm is analyzed using the scalable portable BSP (Bulk Synchronous Parallel)

cost model.

1 INTRODUCTION

The appeal of parallel processing becomes very

strong in applications which require ever higher per-

formance and particularly in applications such as :

data-warehousing, decision support, On-Line Analyt-

ical Processing (OLAP) and more generally DBMS

(Liu and Rundensteiner, 2005; Datta et al., 1998).

However parallelism can only maintain acceptable

performance through efﬁcient algorithms realizing

complex queries on dynamic, irregular and distributed

data. Such algorithms must be designed to fully ex-

ploit the processing power of multi-processor ma-

chines and the ability to evenly divide load among

processors while minimizing local computation and

communication costs inherent to multi-processor ma-

chines.

Research has shown that the join operation is par-

allelizable with near-linear speed-up on Shared Noth-

ing machines only under ideal balancing conditions.

Data skew can have a disastrous effect on perfor-

mance (Mourad et al., 1994; Hua and Lee, 1991; De-

Witt et al., 1992; Bamha and Hains, 2000; Bamha and

Hains, 1999) due to the high costs of communications

and synchronizations in this architecture.

Many algorithms have been proposed to handle

data skew for a simple join operation, but little is

known for the case of complex queries leading to

multi-joins (Lu et al., 1994; DeWitt et al., 1992; Hua

and Lee, 1991).

However, these algorithms cannot solve load imbal-

ance problem as they base their routing decisions on

incomplete or statistical information.

On the contrary, the algorithms we presented in

(Bamha and Hains, 1999; Bamha and Hains, 2000;

Bamha, 2005) for treating queries involving one join

operation use a total data-distribution information in

the form of histograms. The parallel cost model we

apply allows us to guarantee that histogram manage-

ment has a negligible cost when compared to the efﬁ-

ciency gains it provides to reduce the communication

cost and to avoid load imbalance between processors.

However the problem of data skew is more acute with

multi-joins because the imbalance of intermediate re-

sults is unknown during static query optimization (Lu

et al., 1994).

To this end, we introduced in (Bamha and

Exbrayat, 2003) a pipelined version of OSFA-Join

127

Al Hajj Hassan M. and Bamha M. (2008).

PIPELINED PARALLELISM IN MULTI-JOIN QUERIES ON HETEROGENEOUS SHARED NOTHING ARCHITECTURES.

In Proceedings of the Third International Conference on Software and Data Technologies - PL/DPS/KE, pages 127-134

DOI: 10.5220/0001889901270134

 SciTePress

based on a dynamic data redistribution approach al-

lowing to solve the problem of Attribute Value Skew

(AVS) and Join Product Skew (JPS) and guaranteeing

perfect load balancing on homogeneous distributed

Shared Nothing architecture.

In homogeneous multi-processor architectures,

these algorithms are insensitive to data skew and guar-

antee perfect load balancing between processors dur-

ing all the stages of join computation because data

redistribution is carried out jointly by all processors

(and not by a coordinator processor). Each processor

deals with the redistribution of the data associated to

a subset of the join attribute values, not necessarily its

“own” values. However the performance of these al-

gorithms degrades on heterogeneous multi-processor

architectures where the load of each processor may

vary in a dynamic and unpredictable way.

To this end we introduced in (Hassan and Bamha,

2008) a parallel join algorithm called DFA-Join (Dy-

namic Frequency Adaptive parallel join algorithm) to

handle join queries on heterogeneous Shared Noth-

ing architectures allowingto solvethe problem of data

skew while guaranteeing perfect balancing properties.

In this paper, we present a pipelined version of DFA-

Join algorithm called PDFA-Join (Pipelined Dynamic

frequency Adaptive join algorithm). The aim of

pipelining in PDFA-Join is to offer ﬂexible resource

allocation and to avoid unnecessary disk input/output

for intermediate join result in multi-join queries. We

show that PDFA-Join algorithm can be applied efﬁ-

ciently in various parallel execution strategies mak-

ing it possible to exploit not only intra-operator par-

allelism but also inter-operator parallelism. These al-

gorithms are used in the objective of avoiding the ef-

fect of load imbalance due to data skew, and to reduce

the communication costs due to the redistribution of

the intermediate results which can lead to a signiﬁcant

degradation of performance.

2 LIMITATIONS OF PARALLEL

EXECUTION STRATEGIES

FOR MULTI-JOIN QUERIES

Parallel execution of multi-join queries depends on

the execution plan of simple joins that compose it.

The main difference between these strategies lies in

the manner of allocating the simple joins to different

processors and in the choice of an appropriate degree

of parallelism (i.e. the number of processors) used to

compute each simple join.

Several strategies were proposed to evaluate

multi-join queries (Liu and Rundensteiner, 2005;

Wilschut et al., 1995). In these strategies intra-

operator, inter-operator and pipelined parallelisms

can be used. These strategies are divided into four

principal categories presented thereafter.

Sequential Parallel Execution is the simplest strat-

egy to evaluate, in parallel, a multi-join query. It does

not induce inter-operator parallelism. Simple joins

are evaluated one after the other in a parallel way.

Thus, at a given moment, one and only one simple

join is computed in parallel by all the available pro-

cessors.

This strategy is very restrictive and does not pro-

vide efﬁcient resource allocation owing to the fact that

a simple join cannot be started until all its operands

are entirely available, and whenever a join operation

is executed on a subset of processors, all the other

processors remain idle until the next join operation.

Moreover this strategy induces unnecessary disk In-

put/Output because intermediate results are written to

disk and not immediately used for the next operations.

To reach acceptable performance, join algorithms

used in this strategy should reduce the load imbalance

between all the processors and the number of idle pro-

cessors must be as small as possible.

Parallel Synchronous Execution uses in addition to

intra-operator parallelism, inter-operator parallelism

(Chen et al., 1992b). In this strategy several simple

join operations can be computed simultaneously on

disjoint sets of processors.

The parallel execution time of an operator depends

on the degree of parallelism. The execution time de-

creases by increasing the number of processors un-

til the arrival at a point of saturation (called opti-

mal degree of parallelism) from which increasing the

number of processors, increases the parallel execution

time (Rahm, 1996; Chen et al., 1992b). The main dif-

ﬁculty in this strategy lies in the manner of allocating

the simple joins to the available processors and in the

choice of an appropriate degree of parallelism to be

used for each join.

In this strategy, the objective of such allocation is

to reduce the latency where the global execution time

of all operators should be of the same order. This also

applies to the global execution time of each operator

in the same group of processors where the local com-

putation within each group must be balanced.

This Strategy combines only intra- and inter oper-

ator parallelism in the execution of multi-join queries

and does not introduce pipelined parallelism and large

number of processors may remain idle if aren’t used

in inter-operator parallelism. This constitutes the

main limitations of this strategy for ﬂexible resource

allocation in addition to unnecessary disk/input oper-

ICSOFT 2008 - International Conference on Software and Data Technologies

128

ation for intermediate join result.

Segmented Right-Deep Execution. Contrary to a

parallel synchronous strategy, a Segmented Right-

Deep execution (Chen et al., 1992a; Liu and Run-

densteiner, 2005) employs, in addition to intra-

operator parallelism, pipelined inter-operator paral-

lelism which is used in the evaluation of the right-

branches of the query tree.

This strategy offers more ﬂexible resource allocation

than parallel synchronous execution strategy: many

joins can be computed on disjoint sets of processors to

prepare hash tables for pipelined joins. Its main lim-

itation remains in the fact that pipelined parallelism

cannot be started until all the hash tables are com-

puted. Moreover no load balancing between proces-

sors can be performed wheneverpipelined parallelism

begins.

Full Parallel Execution. (Wilschut et al., 1995; Liu

and Rundensteiner, 2005) uses inter-operator paral-

lelism and pipelined inter-operator parallelism in ad-

dition to intra-operator parallelism. In this strategy,

all the simple joins, associated to the multi-join query,

are computed simultaneously in parallel using dis-

joint sets of processors. Inter-operator parallelism and

pipelined inter-operator parallelism are exploited ac-

cording to the type of the query tree.

The effectiveness of such strategy depends on the

quality of the execution plans generated during the

query optimization phase and on the ability to evenly

divide load between processors in the presence of

skewed data.

All existing algorithms using this strategy are

based on static hashing to redistribute data over the

network which makes them very sensitive to data

skew. Moreover pipelined parallelism cannot start un-

til the creation of hash tables of build relations. We

recall that all join algorithms used in these strategies

require data redistribution of all intermediate join re-

sults (and not only tuples participating to the join re-

sult) which may induce a high cost of communication.

In addition no load balancing between processors can

be performed when pipelined parallelism begins, this

can lead to a signiﬁcant degradation of performance.

In the following section we will present PDFA-

Join (Pipelined Dynamic Frequency Adaptive Join):

a new join algorithm which can be used in different

execution strategies allowing to exploit not only intra-

operator but also inter-operator and pipelined paral-

lelism. This algorithms is proved to induce a mini-

mal cost for communication (only relevant tuples are

redistributed over the network), while guaranteeing

perfect load balancing properties in a heterogeneous

multi-processor machine even for highly skewed data.

3 PARALLELISM IN MULTI-JOIN

QUERIES USING PDFA-JOIN

ALGORITHM

Pipelining was largely studied and successfully im-

plemented in many classical join algorithms, on

Shared Nothing (SN) multi-processor machine in the

presence of ideal conditions of load balancing and

in the absence of data skew (Liu and Rundensteiner,

2005). Nevertheless, these algorithms are generally

based on static hash join techniques and are thus very

sensitive to AVS and JPS.

The pipelined algorithm we introduced in (Bamha

and Exbrayat, 2003) solves this problem and guaran-

tees perfect load balancing on homogeneous SN ma-

chines. However its performance degrades on hetero-

geneous multi-processor architectures.

In this paper, we propose to adapt DFA-Join to

pipelined multi-join queries to solve the problem of

data skew and load imbalance between processors on

heterogeneous multi-processors architectures.

DFA-Join is based on a combination of two steps :

• a static step where data buckets are assigned to

each processor according to its actual capacity,

• then a dynamic step executed throughout the join

computation phase to balance load between pro-

cessors. When a processor ﬁnishes join process-

ing of its assigned buckets it asks a local head

node for untreated buckets of another processor.

This combination of static and dynamic steps allows

us to reduce the join processing time because in par-

allel systems the total executing time is the time taken

by the slowest processor to ﬁnish its tasks.

To ensure the extensibility of the algorithm, pro-

cessors are partitioned into disjoint sets. Each set has

a designed local head node. Load is ﬁrst balanced

inside each set of processors and whenever a set of

processors ﬁnishes its assigned tasks, it asks a head

node of another set of processors for additional tasks.

3.1 Detailed Algorithm

In this section, we present a parallel pipelined exe-

cution strategy for the multi-join query, Q = (R ⋊⋉

S) ⋊⋉

(U ⋊⋉

V), given in ﬁgure 1 (this strategy can

be easily generalized to any bushy multi-join query)

where R, S, U and V are source relations and a

, a

and b

are join attributes.

We will give in detail the execution steps to

evaluate the join query Q

= R ⋊⋉

S (the same

technique is used to evaluate Q

= U ⋊⋉

V).

PIPELINED PARALLELISM IN MULTI-JOIN QUERIES ON HETEROGENEOUS SHARED NOTHING

ARCHITECTURES

129

R S U

....

Figure 1: Parallel execution of a multi-join query using

PDFA join algorithm.

We ﬁrst assume that each relation T ∈ {R,S,U,V} is

horizontally fragmented among p processors and:

• T

is the fragment of T placed on processor i,

• Hist

(T) denotes the histogram

of T with respect

to the join attribute x, i.e. a list of pairs (v,n

)

where n

6= 0 is the number of tuples of T having

value v for x. The histogram is often much smaller

and never larger than the relation it describes,

• Hist

) denotes the histogram of fragment T

placed on processor i,

• Hist

(T) is processor i’s fragment of the global

histogram of relation T,

• Hist

(T)(v) is the frequency (n

) of value v in T,

• Hist

)(v) is the frequency of value v in T

• kTk denotes the number of tuples of T, and

• |T| is the size (in bytes or number of pages) of T.

Our algorithm (Algorithm 1) can be divided into

the following ﬁve phases. We give for each phase an

upper bound of the execution time using BSP (Bulk

Synchronous Parallel) cost model (Skillicorn et al.,

1997; Valiant, 1990). Notation O(...) only hides small

constant factors : they depend only on the implemen-

tation, but neither on data nor on the BSP machine

parameters.

Phase 1. Creating Local Histograms

In this phase, we create in parallel, on each proces-

sor i, the local histogram Hist

) (resp. Hist

))

(i = 1,..., p) of block R

(resp. S

) by a linear traver-

sal of R

(resp. S

) in time max

i=1,...,p

r/w

∗ |R

(resp. max

i=1,...,p

r/w

∗ |S

|)) where c

r/w

is the cost to

read/write a page of data from disk on processor i.

While creating the histograms, tuples of R

(resp. S

)

are partitioned on the ﬂy into N buckets using a hash

function in order to facilitate the redistribution phase.

The cost of this phase is:

Histograms are implemented as balanced trees (B-tree):

a data structure that maintains an ordered set of data to allow

efﬁcient search and insert operations.

Algorithm 1. Parallel PDFA-Join computation steps to

valuate the join of relations R and S on attribute a

and

preparing the next join on attribute b

In Parallel (on each processor) i ∈ [1, p] do

1 ◮ Create local histograms Hist

) of R

and, on the ﬂy,

hash the tuples of relation R

into different buckets

according to the values of join attribute a

⊲ Create local histograms Hist

) of S

and, on the ﬂy,

hash the tuples of relation S

into different buckets,

2 ◮ Create global histogram fragment’s Hist

(R) of R,

⊲ Create global histogram fragment’s Hist

(S) of S,

⊲ Merge Hist

(R) and Hist

(S) to create Hist

(R ⋊⋉ S),

3 ◮ Create communication templates for only relevant tuples,

⊲ Filter generated buckets to create tasks to execute on

each processor according to its capacity,

⊲ Create local histograms Hist

⋊⋉ S

) of join result

on attribute b

of the next join using histograms and

communication templates (See Algorithm 2.),

4 ◮ Exchange data according to communication templates,

5 ◮ Execute join tasks and store join result on local disk.

Loop until no task to execute

⊲ Ask a local head node for jobs from overloaded

processors,

⊲ Steal a job from a designated processor and execute it,

⊲ Store the join result on local disk.

End Loop

End Par

Time

phase1

= O



max

i=1,...,p

r/w

∗ (|R

| + |S



Phase 2. Computing the Histogram of R ⋊⋉ S

In this phase, we compute Hist

(R ⋊⋉ S) on each pro-

cessor i. This helps in specifying the values of the

join attribute that participate in the join result, so only

tuples of R and S related to these values are redis-

tributed in a further phase which allows us to mini-

mize the communication cost. The histogram of R ⋊⋉ S

is simply the intersection of Hist

(R) and Hist

(S), so

we must ﬁrst compute the global histograms Hist

(R)

and Hist

(S) by redistributing the tuples of the local

histograms using a hash function that distributes the

values of the join attribute in a manner that respects

the processing capacity of each processor.

The cost of this step is:

Time

phase2.a

= O



min



max

i=1,...,p

∗ p∗ (g∗ |Hist

(R)| +

∗ ||Hist

(R)||), max

i=1,...,p

∗ (g∗ |R| +γ

∗ ||R||)



+min



max

i=1,...,p

∗ p∗ (g∗ |Hist

(S)| + γ

∗ ||Hist

(S)||),

max

i=1,...,p

∗ (g∗ |S| + γ

∗ ||S||)



+ l



where ω

is the fraction of the total volume of data

assigned to processor i such that: ω

= (

)/(

∑

j=1

is the execution time of one instruction on proces-

sor i, g is the BSP communication parameter and l

the cost of synchronization (Skillicorn et al., 1997;

ICSOFT 2008 - International Conference on Software and Data Technologies

130

Valiant, 1990) (review proposition 1 of appendix A

for the proof of this cost (Hassan and Bamha, 2008)).

Nowwe can easily create Hist

(R⋊⋉ S) by comput-

ing in parallel, on each processor i, the intersection of

Hist

(R) and Hist

(S) in time of order: Time

phase2.b



max

i=1,...,p



∗ min(||Hist

(R)||,||Hist

(S)||)





While creating Hist

(R ⋊⋉ S), we also store for

each value v ∈ Hist

(R ⋊⋉ S) an extra information

index(v) ∈ {1,2} such that:







index(v) = 1 if Hist

(R)(v) ≥ f

or Hist

(S)(v) ≥ f

(i.e. values associated to high frequencies)

index(v) = 2 elsewhere (i.e. values of low frequencies).

The used threshold frequency is f

= p∗ log(p).

This information will be useful in the phase of the

creation of communication templates.

The total cost of this phase is the sum of Time

phase2.a

and Time

phase2.b

We recallthat the size of a histogram is, in general,

very small compared to the size of base relations.

Phase 3. Creating the Communication Template

In homogeneous systems workload imbalance may

be due to uneven distribution of data to be joined

among the processors, while in heterogeneous sys-

tems it may be the result of allocating to processors

an amount of tuples that is not proportional to actual

capabilities of used machines (Gounaris, 2005).

So in order to achieve an accepted performance,

the actual capacity of each machine must be taken

into account while assigning the data or tasks to each

processor. Another difﬁculty in such systems lies

in the fact that available capacities of machines in

multi-user systems may rapidly change after load

assignment : the state of an overloaded processor

may fastly become underloaded while computing

the join operation and vice-versa. Thus to beneﬁt

from the processing power of such systems, we must

not have idle processors while others are overloaded

throughout all the join computation phase.

To this end, we use, as in DFA-join algorithm, a

two-step (static then dynamic) load assignment ap-

proach which allows us to reduce the join processing

time.

3.a. Static Load Assignment Step

. In this step, we

compute, in parallel on each processor i, the size

of the join which may result from joining all tuples

related to values v ∈ Hist

(R ⋊⋉ S). This is simply the

sum of the frequencies Hist

(R ⋊⋉ S)(v) for all values

v of the join attribute in Hist

(R ⋊⋉ S). This value

is computed by a linear traversal of Hist

(R ⋊⋉ S) in

time: O



max

i=1,...,p

∗ ||Hist

(R ⋊⋉ S)||



After that all processors send the computed value

to a designated head node which in its turn calculates

the total number of tuples in R ⋊⋉ S (i.e. ||R ⋊⋉ S||) by

computing the sum of all received values in time of

order: O(p∗ g + l).

Now the head node uses the value of ||R ⋊⋉ S|| and the

information received earlier to assign to each proces-

sor i a join volume (vol

∗ ||R ⋊⋉ S||) proportional to its

resources where the value of vol

is determined by the

head node depending on the actual capacity of each

processor i such that

∑

i=1

vol

= 1.

Finally, the head node sends to each processor i the

value of vol

∗ ||R ⋊⋉ S|| in time of order: O(g∗ p+ l).

The cost of this step is:

Time

phase3.a

= O



max

i=1,...,p

∗ ||Hist

(R ⋊⋉ S)|| + p∗ g+ l



3.b. Communication Templates Creation Step

. Com-

munication templates are list of messages that con-

stitute the relations redistribution. Owing to fact that

values which could lead to AVS (those having high

frequencies) are also those which may cause join

product skew in ”standard” hash join algorithms, we

will create communication templates for only values

v having high frequencies (i.e. index(v) = 1). Tuples

associated to low frequencies (i.e. index(v) = 2) don’t

have effect neither on AVS nor on JPS. So these tuples

will be simply hashed into buckets in their source pro-

cessors using a hash function and their treatment will

be postponed to the dynamic phase.

So ﬁrst of all, Hist

(R ⋊⋉ S) is partitioned on each

processor i into two sub-histograms: Hist

(1)

(R ⋊⋉ S)

and Hist

(2)

(R ⋊⋉ S) such that: v ∈ Hist

(1)

(R ⋊⋉ S) if

index(v) = 1 and v ∈ Hist

(2)

(R ⋊⋉ S) if index(v) = 2.

This partitioning step is performed while comput-

ing

∑

Hist

(R ⋊⋉ S)(v) in step 3.a in order to avoid

reading the histogram two times.

We can start now by creating the communication tem-

plates which are computed, in a ﬁst time, on each pro-

cessor i for only values v in Hist

(R ⋊⋉ S) such that

the total join size related to these values is inferior

or equal to vol

∗ ||R ⋊⋉ S|| starting from the value that

generates the highest join result and so on.

It is important to mention here that only tuples

that effectively participate in the join result will be

redistributed. These are the tuples of

= R

⋉ S and

= S

⋉ R. These semi-joins are implicitly evaluated

due to the fact that the communication templates are

only created for values v that appear in join result

(i.e. v ∈ Hist

(R ⋊⋉ S)).

In addition, the number of received tuples of R (resp.

S) must not exceed vol

∗ ||

R|| (resp. vol

∗ ||S||). For

each value v in Hist

(R ⋊⋉ S), processor i creates

communicating messages order

to send( j,i,v)

asking each processor j holding tuples of R or S with

PIPELINED PARALLELISM IN MULTI-JOIN QUERIES ON HETEROGENEOUS SHARED NOTHING

ARCHITECTURES

131

values v for the join attribute to send them to it.

If the processing capacity of a processor i doesn’t

allow it to compute the join result associated to all

values v of the join attribute in Hist

(R ⋊⋉ S)

, then

it will not ask the source processors j holding the

remaining values v to redistribute their associated

tuples but to partition them into buckets using a

hash function and save them locally for further join

processing step in the dynamic phase. Hence it sends

an order to save(j,v) message for each processor j

holding tuples having values v of the join attribute.

The maximal complexity of creating the communi-

cation templates is:O



max



∗ p ∗ γ

∗ ||Hist

(1)

(R ⋊⋉

S)||





, because each processor i is responsible of cre-

ating the communication template for approximately

∗ ||Hist

(1)

(R ⋊⋉ S)|| values of the join attribute and

for a given value v at most (p − 1) processors can

send data.

After creating the communication templates,

on each processor i, order

to send( j,i,.) and

order to save( j,.) messages are sent to their

destination processors j when j 6= i in time:



max



g∗ ω

∗ p∗ |Hist

(1)

(R ⋊⋉ S)|



+ l



The total cost of this step is the sum of the above

two costs.

3.c. Task Generation Step. After the communication

templates creation, each processor i obeys the

order to send(.,.,.) messages that it has just received

to generate tasks to be executed on each processor. So

it partitions the tuples that must be sent to each pro-

cessor into multiple number of buckets greater than p

using a hash function. This partition facilitates task

reallocation in the join phase (phase 5) from over-

loaded to idle processors if a processor could not ﬁn-

ish its assigned load before the others. After this parti-

tion step, each bucket is sent to its destination proces-

sor. The cost of this step is : O



max

i=1,...,p

∗(||R

||+

||S

||)



In addition, each processor i will partition tu-

ples whose join attribute value is indicated in

order to save() messages into buckets using the same

hash function on all the processors. However, these

buckets will be kept for the moment in their source

processors where their redistribution and join pro-

cessing operations will be postponed till the dynamic

phase.

During this step, local histogram of the join result,

(Hist

(R ⋊⋉ S), on attribute b

is created directly from

Hist

) and Hist

) using Algorithm 2.

Owing to the fact that the access to the histogram

This is the case if vol

∗||R ⋊⋉ S|| <

∑

Hist

(R ⋊⋉ S)(v)

on processor i.

(equivalent to a search in a B-tree) is performed in a

constant time, the cost of the creation of the histogram

of join result is: O



max

i=1,...,p

∗ ||R



The global cost for this step:

Time

phase3.c



max

i=1,...,p

∗ (||R

|| + ||S

||)



Algorithm 2. Join result histogram’s creation algorithm on

ttribute b

Par (on each node) i ∈ [1, p] do

⊲ Hist

⋊⋉ S

)=NULL (Create an empty B-tree)

For each tuple t of each bucket of relation R

⊲ f req1 = Hist

(S)(t.a

) (i.e. frequency of t.a

of tuple t)

If ( freq1 > 0) (i.e. tuple t will be present in R ⋊⋉ S) Then

⊲ f req2 = Hist

⋊⋉ S

)(t.b

)

If ( freq2 > 0) (i.e value t.b

∈ Hist

⋊⋉ S

)) Then

⊲ Update Hist

⋊⋉ S

)(t.b

) = freq1+ freq2

Else

⊲ Insert a new couple (t.b

, freq1) into Hist

⋊⋉ S

)

End If

End For

End Par

Phase 4. Data Redistribution

According to communication templates, buckets are

sent to their destination processors.

It is important to mention here that only tuples of R

and S that effectively participate in the join result will

be redistributed. So each processor i receives a par-

tition of R (resp. S) whose maximal size is vol

∗ ||R||

(resp. vol

∗ ||R||). Therefore, in this algorithm, com-

munication cost is reduced to a minimum and the

global cost of this phase is:

Time

phase4

= O



g∗ max



vol

∗ (|R| + |S|



+ l



Phase 5. Join Computation

Buckets received by each processor are arranged in a

queue. Each processor executes successively the join

operation of its waiting buckets. The cost of this step

of local join computation is:

Time

local join

= O



max



r/w

∗vol

∗(|R|+|S|+|R ⋊⋉ S|)





If a processor ﬁnishes computing the join related to

its local data and the overall join operation is not ﬁn-

ished, it will send to the head node a message asking

for more work. Hence, the head node will assign to

this idle processor some of the buckets related to join

attribute values that were not redistributed earlier in

the static phase. However, if all these buckets are al-

ready treated, the head node checks the number of non

treated buckets in the queue of the other processors

and asks the processor that has the maximal number

of non treated buckets to forward a part of them to the

ICSOFT 2008 - International Conference on Software and Data Technologies

132

idle one. The number of sent buckets must respect the

capacity of the idle processor.

And thus, the global cost of join computation of

two relations R and S using PDFA-Join algorithm is:

Time

PDFA− join

= O



max

i=1,...,p

r/w

∗ (|R

| + |S

|) + l +

min



max

i=1,...,p

∗ p ∗ (g ∗ |Hist

(R)| + γ

∗ ||Hist

(R)||),

max

i=1,...,p

∗ (g ∗ |R| + γ

∗ ||R||)



+ max

i=1,...,p

∗

||R

|| + ||S



+min



max

i=1,...,p

∗ p ∗ (g ∗ |Hist

(S)| + γ

∗ ||Hist

(S)||),

max

i=1,...,p

∗ (g ∗ |S| + γ

∗ ||S||)



+ g ∗ max

i=1,...,p

vol

∗ (|

R| + |S|



max

i=1,...,p

∗ p ∗ (g ∗ |Hist

(1)

(R ⋊⋉ S)| + γ

∗ ||Hist

(1)

(R ⋊⋉ S)||)



+ max

i=1,...,p

r/w

∗ vol

∗ (|

R| + |S| + |R ⋊⋉ S|)





Remark

Sequential evaluation of the join of R and S on pro-

cessor i requires at least the following lower bound:

bound

inf

= Ω



r/w

∗ (|R| + |S| + |R ⋊⋉ S|)+

∗ (||R|| + ||S|| + ||R ⋊⋉ S||)



Therefore, parallel join processing on p heteroge-

neous processors requires:

bound

inf

= Ω



max



r/w

∗ ω

∗ (|R| + |S| + |R ⋊⋉ S|)+

∗ ω

∗ (||R|| + ||S|| + ||R ⋊⋉ S||)





PDFA-Join algorithm has optimal asymptotic com-

plexity when:

max

i=1,...,p



∗ p∗ |Hist

(1)

(R ⋊⋉ S)|



≤

max



r/w

∗ ω

∗ max(|R|,|S|, |R ⋊⋉ S|)



this is due to the fact that the other terms in

Time

PDFA− join

are bounded by those of bound

inf

Inequality (1) holds if the chosen threshold frequency

is greater than p (which is the case for our thresh-

old frequency f

= p∗ log(p)).

3.2 Discussion

To understand the whole mechanism of PDFA-Join

algorithm, we compare existing approaches (based on

hashing) to our pipelined join algorithm using dif-

ferent execution strategies to evaluate the multi-join

query Q = (R ⋊⋉

S) ⋊⋉

(U ⋊⋉

V) .

A Full Parallel execution of DFA-Join algorithm

(i.e. a basic use of DFA-Join where we do not use

pipelined parallelism) requires the evaluation of Q

(R ⋊⋉

S) and Q

= (U ⋊⋉

V) on two disjoint set

of processors, the join results of Q

and Q

are then

stored on the disk. The join result of query Q

and Q

are read from disk to evaluate the ﬁnal join query Q .

Existing approaches allowing pipelining ﬁrst start

by the evaluation of the join queries Q

and Q

, and

then each generated tuple in query Q

is immediately

used to build the hash table. However the join result

of query Q

is stored on the disk.

At the end of the execution of Q

, the join result of

query Q

is used to probe the hash table. This induces

unnecessary disk input/output. Existing approaches

require data redistribution of all intermediate join re-

sult (not only relevant tuples) this may induce high

communication cost. Moreover data redistribution

in these algorithms is based on hashing which make

them very sensitive to data skew.

In PDFA-Join algorithm, we ﬁrst compute in par-

allel the histograms of R and S on attribute a

, and at

the same time we compute the histograms of U and

V on attribute a

. As soon as these histograms are

available, we generate the communication templates

for Q

and Q

and by the way the histograms of the

join results of Q

and Q

on attribute b

are also com-

puted. Join histograms on attribute b

are used to cre-

ate the communication templates for Q which makes

it possible to immediately use the tuples generated by

and Q

to evaluate the ﬁnal join query Q .

PDFA-Join algorithm achieves several enhance-

ments compared to pipelined join algorithm presented

in the literature: During the creation of communica-

tion templates, we create on the ﬂy the histograms for

the next join, limiting by the way the number of ac-

cesses to data (and to the disks). Moreover data re-

distribution is limited to only tuples participating ef-

fectively to join result, this reduces communication

costs to a minimum. Dynamic data redistribution in

PDFA-Join makes it insensitive to data skew while

guaranteeing perfect load balance during all the stages

of join computation.

PDFA-Join can be used in various parallel strate-

gies, however in the parallel construction of the his-

tograms for source relations, we can notice that the

degree of parallelism might be limited by two fac-

tors : the total number of processors available, and

the original distribution of data. A simultaneous con-

struction of two histograms on the same processor

(which occurs when two relations are distributed, at

least partially, over the same processors) would not be

really interesting compared to a sequential construc-

tion. This intra-processor parallelism does not bring

acceleration, but should not induce noticeable slow-

down : histograms are generally small, and having

several histograms in memory would not necessitate

swapping. On the other hand, as relations use to be

much bigger than the available memory, we have to

access them by blocks. As a consequence, accessing

one or several relations does not really matter. Our

pipeline strategy will really be efﬁcient if different

join operators are executed on disjoint (or at least par-

PIPELINED PARALLELISM IN MULTI-JOIN QUERIES ON HETEROGENEOUS SHARED NOTHING

ARCHITECTURES

133

tially disjoint) sets of processors. This brings us to

limit the number of simultaneous builds. As a conse-

quence, we have to segment our query trees, similarly

to segmented right-deep trees, each segment (i.e. a

set of successive joins) being started when the former

is over. Once the histograms are produced for both

tables, we can compute the communication template,

then distribute data, and ﬁnally compute the join. Un-

fortunately, the computation of the communication

template is the implicit barrier within the execution

ﬂow, that prohibits the use of long pipeline chains.

4 CONCLUSIONS

In this paper, we presented PDFA-Join a pipelined

parallel join algorithm based on a dynamic data re-

distribution. We showed that it can be applied efﬁ-

ciently in various parallel execution strategies offer-

ing ﬂexible resource allocation and reducing disks in-

put/output of intermediate join result in the evaluation

of multi-join queries. This algorithm achieves several

enhancements compared to solutions suggested in the

literature by reducing communication costs to only

relevant tuples while guaranteeing perfect balancing

properties on heterogeneous multi-processors shared

nothing architectures even for highly skewed data.

The BSP cost analysis showed that the overhead

related to histogram management remains very small

compared to the gain it provides to avoid the effect

of load imbalance due to data skew, and to reduce

the communication costs due to the redistribution of

the intermediate results which can lead to a signiﬁcant

degradation of the performance.

Our experience with the BSP cost model and the

tests presented in our previous papers (Bamha and

Hains, 1999; Bamha and Hains, 2000; Hassan and

Bamha, 2008) prove the effectiveness of our approach

compared to standard hash-join pipelined algorithms.

REFERENCES

Bamha, M. (2005). An optimal and skew-insensitive join

and multi-join algorithm for distributed architectures.

In Proc. of DEXA’2005 International Conference.

Copenhagen, Danemark, pages 616–625.

Bamha, M. and Exbrayat, M. (2003). Pipelining a skew-

insensitive parallel join algorithm. Parallel Process-

ing Letters, 13(3), pages 317–328.

Bamha, M. and Hains, G. (2000). A skew insensitive al-

gorithm for join and multi-join operation on Shared

Nothing machines. Proc. of DEXA’2000 International

Conference, pages 644–653, London, UK.

Bamha, M. and Hains, G. (1999). A frequency adaptive

join algorithm for Shared Nothing machines. PDCP

Journal, Volume 3, Number 3, pages 333-345.

Chen, M.-S., Lo, M. L., Yu, P. S., and Young, H. C. (1992a).

Using segmented right-deep trees for the execution of

pipelined hash joins. Proc. of VLDB’92 International

Conference, 1992, Vancouver, Canada, pages 15–26.

Chen, M.-S., Yu, P. S., and Wu, K.-L. (1992b). Scheduling

and processor allocation for the execution of multi-

join queries. In International Conference on Data En-

gineering, pages 58–67, Los Alamos, Ca., USA.

Datta, A., Moon, B., and Thomas, H. (1998). A case for

parallelism in datawarehousing and OLAP. In Proc.

of DEXA 98 International Workshop, IEEE Computer

Society, pages 226–231, Vienna.

DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Se-

shadri, S. (1992). Practical Skew Handling in Parallel

Joins. In Proceedings of the 18th VLDB Conference,

pages 27–40, Vancouver, British Columbia, Canada.

Gounaris, A. (2005). Resource aware query processing on

the grid. Thesis report, University of Manchester, Fac-

ulty of Engineering and Physical Sciences.

Hassan, M. A. H. and Bamha, M. (2008). Dynamic data re-

distribution for join queries on heterogeneous shared

nothing architecture. Technical Report 2, LIFO, Uni-

versit´e d’Orl´eans, France.

Hua, K. A. and Lee, C. (1991). Handling data skew in mul-

tiprocessor database computers using partition tun-

ing. In Proc. of VLDB 17th International Conference,

pages 525–535, Barcelona, Catalonia, Spain.

Liu, B. and Rundensteiner, E. A. (2005). Revisiting

pipelined parallelism in multi-join query processing.

In Proc. of VLDB’05 International Conference, pages

829–840.

Lu, H., Ooi, B.-C., and Tan, K.-L. (1994). Query Process-

ing in Parallel Relational Database Systems. IEEE

Computer Society Press, Los Alamos, California.

Mourad, A. N., Morris, R. J. T., Swami, A., and Young,

H. C. (1994). Limits of parallelism in hash join algo-

rithms. Performance evaluation, 20(1/3):301–316.

Rahm, E. (August 1996). Dynamic load balancing in par-

allel database systems. in: Proc. EURO-PAR’96 Con-

ference, LNCS, Springer-Verlag, Lyon.

Schneider, D. and DeWitt, D. (1989). A performance eval-

uation of four parallel join algorithms in a shared-

nothing multiprocessor environment. Proc. of 1989

ACM SIGMOD International Conference, Portland,

Oregon, pages 110–121, New York, NY 10036, USA.

Skillicorn, D. B., Hill, J. M. D., and McColl, W. F. (1997).

Questions and Answers about BSP. Scientiﬁc Pro-

gramming, 6(3):249–274.

Valiant, L. G. (1990). A bridging model for parallel compu-

tation. Communications of the ACM, 33(8):103–111.

Wilschut, A. N., Flokstra, J., and Apers, P. M. (1995). Paral-

lel evaluation of multi-join queries. In Proc. of ACM-

SIGMOD, 24(2):115–126.

ICSOFT 2008 - International Conference on Software and Data Technologies

134