PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON

SHARED NOTHING MACHINES

M. Al Hajj Hassan and M. Bamha

LIFO, Universit

e d’Orl

eans

B.P. 6759, 45067 Orl

eans Cedex 2, France

Keywords:

PDBMS,Parallel joins, Data skew, Join product skew, GroupBy-Join queries, BSP cost model.

Abstract:

SQL queries involving join and group-by operations are frequently used in many decision support applications.

In these applications, the size of the input relations is usually very large, so the parallelization of these queries

is highly recommended in order to obtain a desirable response time. The main drawbacks of the presented

parallel algorithms that treat this kind of queries are that they are very sensitive to data skew and involve

expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we

present an algorithm that minimizes the communication cost by performing the group-by operation before

redistribution where only tuples that will be present in the join result are redistributed. In addition, it evaluates

the query without the need of materializing the result of the join operation and thus reducing the Input/Output

cost of join intermediate results. The performance of this algorithm is analyzed using the scalable and portable

BSP (Bulk Synchronous Parallel) cost model which predicts a near-linear speed-up even for highly skewed

data.

1 INTRODUCTION

Data warehousing, On-Line Analytical Processing

(OLAP) and other multidimensional analysis tech-

nologies have been employed by data analysts to ex-

tract interesting information from large database sys-

tems in order to improve the business performance

and help the organisations in decision making. In

these applications, aggregate queries are widely used

to summarize large volume of data which may be the

result of the join of several tables containing billions

of records (Datta et al., 1998; Chaudhuri and Shim,

1994). The main difﬁculty in such applications is

that the result of these analytical queries must be ob-

tained interactively (Datta et al., 1998; Tsois and Sel-

lis, 2003) despite the huge volume of data in ware-

houses and their rapid growth especially in OLAP

systems (Datta et al., 1998). For this reason, paral-

lel processing of these queries is highly recommended

in order to obtain acceptable response time (Bamha,

2005). Research has shown that join, which is one

of the most expansive operations in DBMS, is paral-

lelizable with near-linear speed-up only in ideal cases

(Bamha and Hains, 2000). However, data skew de-

grades the performance of parallel systems (Bamha

and Hains, 1999; Bamha and Hains, 2000; Seetha and

Yu, 1990; Hua and Lee, 1991; Wolf et al., 1994; De-

Witt et al., 1992). Thus, effective parallel algorithms

that evenly distribute the load among processors and

minimizes the inter-site communication must be em-

ployed in parallel and distributed systems in order to

obtain acceptable performance.

In traditional algorithms that treat ”GroupBy-Join”

queries

, join operations are performed in the ﬁrst

step and then the group-by operation (Chaudhuri

and Shim, 1994; Yan and Larson, 1994). But the

response time of these queries is signiﬁcantly reduced

if the group-by operation is performed before the

join (Chaudhuri and Shim, 1994), because group-by

reduces the size of the relations thus minimizing the

join and data redistribution costs. Several algorithms

that perform the group-by operation before the join

operation were presented in the literature (Shatdal

and Naughton, 1995; Taniar et al., 2000; Taniar and

Rahayu, 2001; Yan and Larson, 1994).

In the ”Early Distribution Schema” algorithm pre-

sented in (Taniar and Rahayu, 2001), all the tuples of

the tables are redistributed before applying the join

GroupBy-Join queries are queries involving group-by

and join operations.

301

Al Hajj Hassan M. and Bamha M. (2006).

PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON SHARED NOTHING MACHINES.

In Proceedings of the First International Conference on Software and Data Technologies, pages 301-307

DOI: 10.5220/0001316003010307

 SciTePress

or the group-by operations, thus the communication

cost in this algorithm is very high. However, the cost

of its join operation is reduced because the group-by

is performed before the expansive join operation.

In the second algorithm, ”Early GroupBy Scheme”

(Taniar and Rahayu, 2001), the group-by operation

is performed before the distribution and the join

operations thus reducing the volume of data. But in

this algorithm, all the tuples of the group-by results

are redistributed even if they do not contribute in

the join result. This is a drawback, because in some

cases only few tuples of relations formed of million

of tuples contribute in the join operation, thus the

distribution of all these tuples is useless.

These algorithms fully materialize the intermediate

results of the join operations. This is a signiﬁcant

drawback because the size of the result of this

operation is generally large with respect to the size of

the input relations. In addition, the Input/Output cost

in these algorithms is very high where it is reasonable

to assume that the output relation cannot ﬁt in the

main memory of each processor, so it must be reread

in order to evaluate the aggregate function.

In this paper, we present a new parallel algorithm

used to evaluate ”GroupBy-Join” queries on Shared

Nothing machines (a multiprocessors machine where

each processor has its own memory and disks (DeWitt

and Gray, 1992)). In this algorithm, we do not materi-

alize the join operation as in the traditional algorithms

where the join operation is evaluated ﬁrst and then the

group-by and aggregate functions (Yan and Larson,

1994). So the Input/Output cost is minimal because

we do not need to save the huge volume of data that

results from the join operation.

We also use the histograms of both relations in order

to ﬁnd the tuples which will be present in the join re-

sult. After ﬁnding these tuples, we apply on them the

grouping and aggregate function, in each processor,

before performing the join. Using our approach, we

reduce the size of data and communication costs to

minimum. It is proved in (Bamha and Hains, 2000;

Bamha and Hains, 1999), using the BSP model, that

histogram management has a negligible cost when

compared to the gain it provides in reducing the com-

munication cost. In addition, Our algorithm avoids

the problem of data skew because the hashing func-

tions are only applied on histograms and not on input

relations.

The performance of this algorithm is analyzed using

the scalable and portable BSP cost model (Skillicorn

et al., 1997) which predicts for our algorithm a near-

linear speed-up even for highly skewed data.

The rest of the paper is organized as follows. In

section 2, we present the BSP cost model used to eval-

uate the processing time of the different phases of the

algorithm. In section 3, we give an overview of differ-

ent computation methods of ”GroupBy-Join” queries.

In section 4, we describe our algorithm. We then con-

clude in section 5.

2 THE BSP COST MODEL

Bulk-Synchronous Parallel (BSP) cost model is a pro-

gramming model introduced by L. Valiant (Valiant,

1990) to offer a high degree of abstraction like PRAM

models and yet allow portable and predictable perfor-

mance on a wide variety of multi-processor architec-

tures (Bisseling, 2004; Skillicorn et al., 1997). A BSP

computer contains a set of processor-memory pairs, a

communication network allowing inter-processor de-

livery of messages and a global synchronization unit

which executes collective requests for a synchroniza-

tion barrier. Its performance is characterized by 3 pa-

rameters expressed as multiples of the local process-

ing speed:

• the number of processor-memory pairs p,

• the time l required for a global synchronization,

• the time g for collectively delivering a 1-relation

(communication phase where each processor re-

ceives/sends at most one word). The network is

assumed to deliver an h-relation in time g ∗ h for

any arity h.

TIME

P1 P2 P3 Pp

global synchronisation

...

. . .

Figure 1: A BSP superstep.

A BSP program is executed as a sequence of super-

steps, each one divided into (at most) three successive

and logically disjoint phases. In the ﬁrst phase each

processor uses only its local data to perform sequen-

tial computations and to request data transfers to/from

other nodes. In the second phase the network deliv-

ers the requested data transfers and in the third phase

a global synchronization barrier occurs, making the

transferred data available for the next superstep. The

execution time of a superstep s is thus the sum of the

maximal local processing time, of the data delivery

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

302

time and of the global synchronization time:

Time(s) = max

i:processor

(s)

+ max

i:processor

(s)

∗ g + l

where w

(s)

is the local processing time on proces-

sor i during superstep s and h

(s)

= max{h

(s)

, h

(s)

i−

}

where h

(s)

(resp. h

(s)

i−

) is the number of words trans-

mitted (resp. received) by processor i during super-

step s. The execution time,

Time(s), of a BSP

program composed of S supersteps is therefore a sum

of 3 terms: W +H ∗g+S ∗l where W =

max

(s)

and H =

max

(s)

. In general W , H and S are

functions of p and of the size of data n, or (as in the

present application) of more complex parameters like

data skew and histogram sizes. To minimize execu-

tion time of a BSP algorithm, design must jointly min-

imize the number S of supersteps and the total volume

h (resp. W ) and imbalance h

(s)

(resp. W

(s)

) of com-

munication (resp. local computation).

3 COMPUTATION OF

”GROUP-BY JOIN” QUERIES

In DBMS, the aggregate functions can be applied on

the tuples of a single table, but in most SQL queries,

they are applied on the output of the join of multiple

relations. In the later case, we can distinguish two

types of ”GroupBy-Join” queries. We will illustrate

these two types using the following example.

In this example, we have three relations that represent

respectively Suppliers, Products and the quantity of a

product shipped by a supplier in a speciﬁc date.

SUPPLIER (Sid

, Sname, City)

PRODUCT (Pid

, Pname, Category)

SHIPMENT (Sid, Pid, Date

, Quantity)

Query 1

Select p.Pid, SUM (Quantity)

From PRODUCT as p, SHIPMENT as s

Where p.Pid = s.Pid

Group By p.Pid

Query 2

Select p.Category, SUM (Quantity)

From PRODUCT as p, SHIPMENT as s

Where p.Pid = s.Pid

Group By p.Category

The purpose of Query1 is to ﬁnd the total quantity

of every product shipped by all the suppliers, while

that of Query2 is to ﬁnd the total amount of every

category of product shipped by all the suppliers.

The difference between Query1 and Query2 lies

in the group-by and join attributes. In Query1,

the join attribute (P id) and the group-by attribute

are the same. In this case, it is preferable to carry

out the group-by operation ﬁrst and then the join

operation (Taniar et al., 2000; Taniar and Rahayu,

2001), because the group-by operation reduces the

size of the relations to be joined. As a consequence,

applying the group-by operation before the join

operation in PDBMS

results in a huge gain in the

communication cost and the execution time of the

”GroupBy-Join” queries.

In the contrary, this can not be applied on Query 2,

because the join attribute (P id) is different from the

group-by attribute (category).

In this paper, we focus on ”GroupBy-Join”

queries when the join attributes are part of the

group-by attributes. In our algorithm, we succeeded

to redistribute only tuples that will be present in

the join result after applying the aggregate function.

Therefore, the communication cost is reduced to

minimum.

4 PRESENTED ALGORITHM

In this section, we present a detailed description of

our parallel algorithm used to evaluate ”GroupBy-

Join” queries when the join attributes are part of

the group-by attributes. We assume that the relation

R (resp. S) is partitioned among processors by

horizontal fragmentation and the fragments R

for

i = 1, ..., p are almost of the same size on each

processor, i.e. |R

| ≃

|R|

where p is the number of

processors.

For simplicity of description and without loss of

generality, we consider that the query has only one

join attribute x and that the group-by attribute set

consists of x, an attribute y of R and another attribute

z of S . We also assume that the aggregate function

f is applied on the values of the attribute u of S. So

the treated query is the following:

Select R.x, R.y, S.z, f(S.u)

From R, S

Where R.x = S.x

Group By R.x, R.y, S.z

In the rest of this paper, we use the following

notation for each relation T ∈ {R, S} :

• T

denotes the fragment of relation T placed on

processor i,

• Hist

(T ) denotes the histogram

of relation T

with respect to the attribute w, i.e. a list of pairs

PDBMS : Parallel DataBase Management Systems.

Histograms are implemented as a balanced tree (B-

tree): a data structure that maintains an ordered set of data

to allow efﬁcient search and insert operations.

PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON SHARED NOTHING MACHINES

303

(v, n

) where n

6= 0 is the number of tuples of re-

lation T having the value v for the attribute w. The

histogram is often much smaller and never larger

than the relation it describes,

• Hist

) denotes the histogram of fragment T

• Hist

(T ) is processor i’s fragment of the his-

togram of T ,

• Hist

(T )(v) is the frequency (n

) of value v in

relation T ,

• Hist

)(v) is the frequency of value v in sub-

relation T

• AGGR

f,u

(T )

is the result of applying the ag-

gregate function f on the values of the aggregate

attribute u of every group of tuples of T hav-

ing identical values of the group-by attribute w.

AGGR

f,u

(T ) is formed of a list of tuples having

the form (v, f

) where f

is the result of applying

the aggregate function on the group of tuples hav-

ing value v for the attribute w (w may be formed of

more than one attribute),

• AGGR

f,u

) denotes the result of applying the

aggregate function on the attribute u of the frag-

ment T

• AGGR

f,u,i

(T ) is processor i’s fragment of the re-

sult of applying the aggregate function on T ,

• AGGR

f,u

(T )(v) is the result f

of the aggregate

function of the group of tuples having value v for

the group-by attribute w in relation T ,

• AGGR

f,u

)(v) is the result f

of the aggregate

function of the group of tuples having value v for

the group-by attribute w in sub-relation T

• kT k denotes the number of tuples of relation T , and

• |T | denotes the size (expressed in bytes or number

of pages) of relation T .

The algorithm proceeds in four phases. We will

give an upper bound of the execution time of each

superstep using BSP cost model. The notation O(...)

hides only small constant factors: they depend only

on the program implementation but neither on data

nor on the BSP machine parameters.

Phase 1: Creating local histograms

In this phase, the local histograms Hist

)

i=1,...,p

(resp. Hist

)

i=1,...,p

) of blocks R

(resp. S

) are

created in parallel by a scan of the fragment R

(resp.

), on processor i, in time c

i/o

∗ max

i=1,...,p

(resp. c

i/o

∗ max

i=1,...,p

|) where c

i/o

is the cost of

writing/reading a page of data from disk.

In addition, the local fragments AGGR

x,z

f,u

)

i=1,...,p

AGGR

f,u

(T ) is implemented as a balanced tree (B-

tree).

of blocks S

are created on the ﬂy while scanning re-

lation S

in parallel, on each processor i, by applying

the aggregate function f on every group of tuples

having identical values of the couple of attributes

(x, z). At the same time, the local histograms

Hist

x,y

)

i=1,...,p

are also created.

(In this algorithm the aggregate function may be

MAX, M IN, SUM or COU NT . For the aggregate

AV G a similar algorithm that merges the COU NT

and the SUM algorithms is applied).

In principle, this phase costs:

T ime

phase1

= O

i/o

∗ max

i=1,...,p

(|R

| + |S

|) .

Phase 2: Creating the histogram of R ⋊⋉ S

The ﬁrst step in this phase is to create the histograms

Hist

(R) and Hist

(S) by a parallel hashing of

the histograms Hist

) and Hist

). After

hashing, each processor i merges the messages it

received to constitute Hist

(R) and Hist

(S).

While merging, processor i also retains a trace of the

network layout of the values d of the attribute x in

its Hist

(R) (resp. Hist

(S)): this is nothing but

the collection of messages it has just received. This

information will help in forming the communication

templates in phase 3.

The cost of redistribution and merging step is

(cf. to proposition 1 in (Bamha and Hains, 2005)):

T ime

phase2.a

min g ∗ |Hist

(R)|+||Hist

(R)||, g ∗

|R|

||R||

+ min g ∗ |Hist

(S)| + ||Hist

(S)||, g ∗

|S|

||S||

+ l ,

where g is the BSP communication parameter and l

the cost of a barrier of synchronisation.

We recall that, in the above equation, for a rela-

tion T ∈ {R, S}, the term min(g ∗ |Hist

(T )| +

||Hist

(T )||, g ∗

|T |

||T ||

) is the necessary time

to compute Hist

i=1,...,p

(T ) starting from the local

histograms Hist

)

i=1,...,p

The histogram

Hist

(R ⋊⋉ S) is then computed

on each processor i by intersecting Hist

(R) and

Hist

(S) in time:

T ime

phase2.b

max

i=1,...,p

min(||Hist

(R)||, ||Hist

(S)||) .

The size of Hist(R ⋊⋉ S) ≡ Hist(R) ∩ Hist(S) is

generally very small compared to |Hist(R)| and |Hist(S)|

because Hist(R ⋊⋉ S) contains only values that appears in

both relations R and S.

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

304

The total cost of this phase is:

T ime

phase2

= T ime

phase2.a

+ T ime

phase2.b

min g ∗ |Hist

(R)|+||Hist

(R)||, g ∗

|R|

||R||

+ min g ∗ |Hist

(S)| + ||Hist

(S)||, g ∗

|S|

||S||

+ max

i=1,...,p

min(||Hist

(R)||, ||Hist

(S)||) + l .

Phase 3: Data redistribution

In order to reduce the communication cost, only tu-

ples of Hist

x,y

(R) and AGGR

x,z

f,u

(S) that will be

present in the join result will be redistributed.

To this end, we ﬁrst compute on each processor

j the intersections

Hist

(j)x

) = Hist

(j)x

) ∩

Hist

(R ⋊⋉ S) and Hist

(j)x

) = Hist

(j)x

) ∩

Hist

(R ⋊⋉ S) for i = 1, ..., p where Hist

(j)x

)

(resp. Hist

(j)x

)) is the fragment of Hist

)

(resp. Hist

)) which was sent by processor i to

processor j in the second phase.

The cost of this step is:

||Hist

(j)x

)|| +

||Hist

(j)x

)||).

We recall that,

||Hist

(j)x

)|| = || ∪

Hist

(j)x

)|| ≤

min(||Hist

(R)||,

||R||

)

and

||Hist

(j)x

)|| = || ∪

Hist

(j)x

)|| ≤

min(||Hist

(S)||,

||S||

thus the total cost of this step is:

T ime

phase3.a

= O

min ||Hist

(R)||,

||R||

+ min ||Hist

(S)||,

||S||

Now each processor j sends each fragment

Hist

(j)x

) (resp. Hist

(j)x

)) to pro-

cessor i. Thus, each processor i receives

Hist

(j)x

)| +

Hist

(j)x

)| pages of

data from the other processors.

In fact, Hist

) = ∪

Hist

(j)x

)

and |Hist

)| =

|Hist

(j)x

)| ≥

|Hist

(j)x

) ∩ Hist

(R ⋊⋉ S)|, thus

|Hist

)| ≥

Hist

(j)x

)| (this also ap-

plies to Hist

)).

Therefore, the total cost of this stage of communica-

tion is at most:

T ime

phase3.b

= O

g ∗ |Hist

)|+|Hist

)| +l .

Remark 1 ∪

Hist

(j)x

) is simply the intersection

of Hist

) and the histogram Hist

(R ⋊⋉ S) which

will be noted:

Hist

) = ∪

Hist

(j)x

)

= Hist

) ∩ Hist

(R ⋊⋉ S).

Hence

Hist

) is only the restriction of the

fragment of Hist

) to values which will be present

in the join of the relations R and S. (this also applies

Hist

)).

Now, each processor obeys all the distribut-

ing orders it has received, so only tuples of

Hist

x,y

) = Hist

x,y

) ∩ Hist

) and

AGGR

x,z

f,u

) = AGGR

x,z

f,u

) ∩ Hist

) are

redistributed.

To this end, we ﬁrst evaluate Hist

x,y

) and

AGGR

x,z

f,u

). The cost of this step is of order:

T ime

phase3.c

max

i=1,...,p

||Hist

x,y

)|| + ||AGGR

x,z

f,u

)||

which is the necessary time to traverse all the

tuples of Hist

x,y

) and AGGR

x,z

f,u

) and access

Hist

) and Hist

) respectively on each pro-

cessor i.

Now, each processor i distributes the tuples of

Hist

x,y

) and AGGR

x,z

f,u

). After distribution,

all the tuples of

Hist

x,y

) and AGGR

x,z

f,u

)

having the same values of the join attribute x are

stored on the same processor. So, each processor

i merges the blocks of data received from all the

other processors in order to create

Hist

x,y

(R) and

AGGR

x,z

f,u,i

(S).

The cost of distributing and merging the tuples is

of order (cf. to proposition 1 in (Bamha and Hains,

2005)):

T ime

phase3.d

min g ∗ |Hist

x,y

(R)| + ||Hist

x,y

(R)||,

g ∗

|R|

||R||

+ min g ∗ |AGGR

x,z

f,u

(S)| + ||AGGR

x,z

f,u

(S)||,

g ∗

|S|

||S||

+ l ,

where the terms:

min

g ∗ |Hist

x,y

(R)| + ||Hist

x,y

(R)||, g ∗

|R|

||R||

and

min g∗|AGGR

x,z

f,u

(S)|+||AGGR

x,z

f,u

(S)||, g∗

|S|

||S||

represent the necessary time to compute Hist

x,y

(R)

and

AGGR

x,z

f,u,i

(S) starting from

Hist

x,y

) and

AGGR

x,z

f,u

) respectively.

PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON SHARED NOTHING MACHINES

305

The total time of the redistribution phase is:

T ime

phase3

min g ∗ |Hist

x,y

(R)| + ||Hist

x,y

(R)||,

g ∗

|R|

||R||

+ min ||Hist

(R)||,

||R||

+ min g ∗ |AGGR

x,z

f,u

(S)| + ||

AGGR

x,z

f,u

(S)||,

g ∗

|S|

||S||

+ min ||Hist

(S)||,

||S||

+ max

i=1,...,p

||Hist

x,y

)|| + ||AGGR

x,z

f,u

)|| + l .

We mention that we only redistribute

Hist

x,y

)

and AGGR

x,z

f,u

) and their sizes are generally very

small compared to |R

| and |S

| respectively. In

addition, the size of |Hist

(R ⋊⋉ S)| is generally very

small compared to |Hist

(R)| and |Hist

(S)|. Thus,

we reduce the communication cost to minimum.

Phase 4: Global computation of the aggre-

gate function

In this phase, we compute the global aggregate

function on each processor. We use the following

algorithm where AGGR

x,y,z

f,u,i

(R ⋊⋉ S) holds the

ﬁnal result on each processor i. The tuples of

AGGR

x,y,z

f,u,i

(R ⋊⋉ S) have the form (x, y, z, v) where v

is the result of the aggregate function.

Par (on each node in parallel) i = 1, ..., p

AGGR

x,y,z

f,u,i

(R ⋊⋉ S) = NULL

For every tuple t of relation

Hist

x,y

(R) do

freq =

Hist

x,y

(R)(t.x, t.y)

For every entry v

AGGR

x,z

f,u,i

(S)(t.x, z) do

Insert a new tuple (t.x, t.y, z, f (v

, freq))

into AGGR

x,y,z

f,u,i

(R ⋊⋉ S);

EndFor

EndPar

The time of this phase is:

max

i=1,...,p

||AGGR

x,y,z

f,u,i

(R ⋊⋉ S)||

, because

the combination of the tuples of

Hist

x,y

(R) and

AGGR

x,z

f,u,i

(S) is performed to generate all the tuples

of AGGR

x,y,z

f,u,i

(R ⋊⋉ S).

Remark 2 In practice, the imbalance of the data re-

lated to the use of the hash functions can be due to:

• a bad choice of the hash function used. This im-

balance can be avoided by using the hashing tech-

niques presented in the literature making it possible

to distribute evenly the values of the join attribute

with a very high probability (Carter and Wegman,

1979),

This instruction creates a B-tree to store histogram’s

entries.

• an intrinsic data imbalance which appears when

some values of the join attribute appear more fre-

quently than others. By deﬁnition a hash function

maps tuples having the same join attribute values

to the same processor. These is no way for a clever

hash function to avoid load imbalance that result

from these repeated values (DeWitt et al., 1992).

But this case cannot arise here

owing to the fact

that histograms contains only distinct values of the

join attribute and the hashing functions we use are

always applied to histograms.

The global cost of evaluating the ”GroupBy-Join”

queries is of order:

T ime

total

= O

i/o

∗ max

i=1,...,p

(|R

| + |S

+ max

i=1,...,p

||AGGR

x,y,z

f,u,i

(R ⋊⋉ S)||

+ min(g ∗ |Hist

(R)| + ||Hist

(R)||, g ∗

|R|

||R||

)

+ min(g ∗ |Hist

(S)| + ||Hist

(S)||, g ∗

|S|

||S||

)

+ min

g ∗ |Hist

x,y

(R)| + ||Hist

x,y

(R)||,

g ∗

|R|

||R||

+ min g ∗ |AGGR

x,z

f,u

(S)| + ||AGGR

x,z

f,u

(S)||,

g ∗

|S|

||S||

+ max

i=1,...,p

||Hist

x,y

)|| + ||AGGR

x,z

f,u

)|| + l .

Remark 3 In the traditional algorithms, the aggre-

gate function is applied on the output of the join op-

eration. The sequential evaluation of the ”groupBy-

Join” queries requires at least the following lower

bound: bound

inf

= Ω



i/o

∗(|R|+|S|+|R ⋊⋉ S|)



Parallel processing with p processors requires there-

fore: bound

inf

∗ bound

inf

Using our approach in the evaluation of the

”GroupBy-Join” queries, we only redistribute tuples

that will be effectively present in the ”groupBy-Join”

result, which reduces the communication cost to

minimum. This algorithm has an asymptotic optimal

complexity because all the terms in T ime

total

are

bounded by those of bound

inf

5 CONCLUSION

The algorithm presented in this paper is used to eval-

uate the ”GroupBy-Join” queries on Shared Noth-

ing machines when the join attributes are part of the

group-by attributes. Our main contribution in this

algorithm is that we do not need to materialize the

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

306

costly join operation which is necessary in all the

other algorithms presented in the literature, thus we

reduce its Input/Output cost. It also helps us to avoid

the effect of data skew which may result from com-

puting the intermediate join results and from redis-

tributing all the tuples if AVS (Attribute Value Skew)

exists in the relation. In addition, we partially eval-

uate the aggregate function before redistributing the

data between processors or evaluating the join oper-

ation, because group-by and aggregate functions re-

duce the volume of data. To reduce the communica-

tion cost to minimum, we use the histograms to dis-

tribute only the tuples of the grouping result that will

effectively be present in the output of the join oper-

ation. This algorithm is proved to have a near-linear

speed-up, using the BSP cost model, even for highly

skewed data. Our experience with the join operation

(Bamha and Hains, 2000; Bamha and Hains, 1999;

Bamha and Hains, 2005) is evidence that the above

theoretical analysis is accurate in practice.

REFERENCES

Bamha, M. (2005). An optimal and skew-insensitive

join and multi-join algorithm for ditributed architec-

tures. In Proceedings of the International Confer-

ence on Database and Expert Systems Applications

(DEXA’2005). 22-26 August, Copenhagen, Dane-

mark, volume 3588 of Lecture Notes in Computer Sci-

ence, pages 616–625. Springer-Verlag.

Bamha, M. and Hains, G. (2000). A skew insensitive al-

gorithm for join and multi-join operation on Shared

Nothing machines. In the 11th International Confer-

ence on Database and Expert Systems Applications

DEXA’2000, volume 1873 of Lecture Notes in Com-

puter Science, London, United Kingdom. Springer-

Verlag.

Bamha, M. and Hains, G. (2005). An efﬁcient equi-semi-

join algorithm for distributed architectures. In Pro-

ceedings of the 5th International Conference on Com-

putational Science (ICCS’2005). 22-25 May, Atlanta,

USA, volume 3515 of Lecture Notes in Computer Sci-

ence, pages 755–763. Springer-Verlag.

Bamha, M. and Hains, G. (September 1999). A frequency

adaptive join algorithm for Shared Nothing machines.

Journal of Parallel and Distributed Computing Prac-

tices (PDCP), Volume 3, Number 3, pages 333-345.

Appears also in Progress in Computer Research, F.

Columbus Ed. Vol. II, Nova Science Publishers, 2001.

Bisseling, R. H. (2004). Parallel Scientiﬁc Computation :

A Structured Approach using BSP and MPI. Oxford

University Press, USA.

Carter, J. L. and Wegman, M. N. (April 1979). Universal

classes of hash functions. Journal of Computer and

System Sciences, 18(2):143–154.

Chaudhuri, S. and Shim, K. (1994). Including Group-By in

Query Optimization. In Proceedings of the Twentieth

International Conference on Very Large Databases,

pages 354–366, Santiago, Chile.

Datta, A., Moon, B., and Thomas, H. (1998). A case for

parallelism in datawarehousing and OLAP. In Ninth

International Workshop on Database and Expert Sys-

tems Applications, DEXA 98, IEEE Computer Society,

pages 226–231, Vienna.

DeWitt, D. J. and Gray, J. (1992). Parallel database systems

: The future of high performance database systems.

Communications of the ACM, 35(6):85–98.

DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Se-

shadri, S. (1992). Practical Skew Handling in Parallel

Joins. In Proceedings of the 18th VLDB Conference,

pages 27–40, Vancouver, British Columbia, Canada.

Hua, K. A. and Lee, C. (1991). Handling data skew in mul-

tiprocessor database computers using partition tuning.

In Lohman, G. M., Sernadas, A., and Camps, R., ed-

itors, Proc. of the 17th International Conference on

Very Large Data Bases, pages 525–535, Barcelona,

Catalonia, Spain. Morgan Kaufmann.

Seetha, M. and Yu, P. S. (December 1990). Effectiveness of

parallel joins. IEEE, Transactions on Knowledge and

Data Enginneerings, 2(4):410–424.

Shatdal, A. and Naughton, J. F. (1995). Adaptive paral-

lel aggregation algorithms. SIGMOD Record (ACM

Special Interest Group on Management of Data),

24(2):104–114.

Skillicorn, D. B., Hill, J. M. D., and McColl, W. F. (1997).

Questions and Answers about BSP. Scientiﬁc Pro-

gramming, 6(3):249–274.

Taniar, D., Jiang, Y., Liu, K., and Leung, C. (2000).

Aggregate-join query processing in parallel database

systems,. In Proceedings of The Fourth International

Conference/Exhibition on High Performance Comput-

ing in Asia-Paciﬁc Region HPC-Asia2000, volume 2,

pages 824–829. IEEE Computer Society Press.

Taniar, D. and Rahayu, J. W. (2001). Parallel processing of

’groupby-before-join’ queries in cluster architecture.

In Proceedings of the 1st International Symposium on

Cluster Computing and the Grid, Brisbane, Qld, Aus-

tralia, pages 178–185. IEEE Computer Society.

Tsois, A. and Sellis, T. K. (2003). The generalized pre-

grouping transformation: Aggregate-query optimiza-

tion in the presence of dependencies. In VLDB, pages

644–655.

Valiant, L. G. (August 1990). A bridging model for par-

allel computation. Communications of the ACM,

33(8):103–111.

Wolf, J. L., Dias, D. M., Yu, P. S., and Turek, J. (1994).

New algorithms for parallelizing relational database

joins in the presence of data skew. IEEE Transactions

on Knowledge and Data Engineering, 6(6):990–997.

Yan, W. P. and Larson, P.-k. (1994). Performing group-by

before join. In Proceedings of the 10th IEEE Inter-

national Conference on Data Engineering, pages 89–

100. IEEE Computer Society Press.

PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON SHARED NOTHING MACHINES

307