Authors:
M. Al Hajj Hassan
and
M. Bamha
Affiliation:
LIFO, Université d’Orléans, France
Keyword(s):
Parallel DataBase Management Systems (PDBMS), Parallel joins, Data skew, Join product skew, GroupBy-Join queries, BSP cost model.
Related
Ontology
Subjects/Areas/Topics:
Databases and Datawarehouses
;
Distributed and Parallel Applications
;
Internet Technology
;
Web Information Systems and Technologies
Abstract:
SQL queries involving join and group-by operations are fairly common in many decision support applications where the size of the input relations is usually very large, so the parallelization of these queries is highly recommended in order to obtain a desirable response time. The most significant drawbacks of the algorithms presented in the literature for treating such queries are that they are very sensitive to data skew and involve expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we present an algorithm that overcomes these drawbacks because it evaluates the ”GroupBy-Join” query without the need of the direct evaluation of the costly join operation, thus reducing its Input/Output and communication costs. Furthermore, the performance of this algorithm is analyzed using the scalable and portable BSP (Bulk Synchronous Parallel) cost model which predicts a linear speedup even for highly skewed data.