Authors:
M. Al Hajj Hassan
and
M. Bamha
Affiliation:
LIFO, Université d’Orléans, France
Keyword(s):
PDBMS,Parallel joins, Data skew, Join product skew, GroupBy-Join queries, BSP cost model.
Related
Ontology
Subjects/Areas/Topics:
Energy and Economy
;
Load Balancing in Smart Grids
;
Smart Grids
Abstract:
SQL queries involving join and group-by operations are frequently used in many decision support applications. In these applications, the size of the input relations is usually very large, so the parallelization of these queries is highly recommended in order to obtain a desirable response time. The main drawbacks of the presented parallel algorithms that treat this kind of queries are that they are very sensitive to data skew and involve expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we present an algorithm that minimizes the communication cost by performing the group-by operation before redistribution where only tuples that will be present in the join result are redistributed. In addition, it evaluates the query without the need of materializing the result of the join operation and thus reducing the Input/Output cost of join intermediate results. The performance of this algorithm is analyzed using the scalable and portable BSP
(Bulk Synchronous Parallel) cost model which predicts a near-linear speed-up even for highly skewed data.
(More)