MapReduce (MR) (Jeffrey and Sanjay, 2004;
Duda, 2012; Hejmalíček, 2015) became the next step
of Data Warehouse evolution. There are many
technologies based on MR that allow implementation
of a Data Warehouse (Li, et.al, 2014). Hive is an
example of major Facebook platform components. It
is intended for implementation of a Data Warehouse
in Hadoop (Huai, et al., 2014). Here a SQL query is
translated into a series of MR tasks. Some
experiments show that Hive is significantly slower
than other methods (Duda, 2012; Zhou and Wang,
2013). In order to solve the low performance problem
of a Data Warehouse in MR systems some methods
were developed that allow access to Data Warehouse
directly in MR (i.e. without additional components).
Four such methods (MFRJ, MRIJ, MRIJ on RCFile,
MRIJ with big dimension tables) are described in
(Zhou and Wang, 2013). They are based on the
dimension table caching in the RAM of each node.
Along with the MR Hadoop a MapReduce-like
system Spark and others are used; however, their
discussion is outside of this paper’s scope.
This paper discusses Multi-Fragment-Replication
Join (MFRJ) method (Zhou and Wang, 2013), which
unlike other methods allows access to n-dimensional
Data Warehouse for only one MR task and avoids
extra transfers of the fact table external keys (section
3). It is also simple in implementation. Effectiveness
of this method in comparison with other methods is
shown in (Zhou and Wang, 2013).
The motivation of this work is the need for Data
Warehouse access time forecasts, due to the intense
growth in number of MR applications. Examples of
such applications are provided below:
o Internet-application log processing for large
internet shops and social networks (Duda,
2012) for service demand analysis.
o Large data volume processing for data
collected by credit organizations for market
behavior forecasts.
o Statistics calculation for large weather
forecast processing.
The problem is that large volume data processing
is time-consuming which may become unacceptable.
Discovery of this problem at the operational stage
leads to costly resolution. First of all, there are many
processing tasks. Secondly, if the tasks are complex
then tuning does not help. In this case algorithms
have to be changed and Map and Reduce functions
recoded. This means redoing the already done work
wasting time and resources. Thus the processing time
estimation for peak load during the design stage, i.e.
before MR tasks implementation, is beneficial.
The importance of modeling can be demonstrated
on the following example. Two RDBMSes (column
and row-based) and MR Hadoop were compared in
an experiment in (Pavlo, et.al, 2009). The conclusion
was that Hadoop loses in the test tasks.
Detailed analysis in (Burdakov, et.al, 2014)
showed that experiments with RDBMS were
executed with the node number below 100, with low
data selectivity in queries, with the lack of sorting,
and record exchange of fragmented tables between
the nodes. Modeling was performed with calibrated
models and different input parameters (Burdakov,
et.al, 2014). The results showed that Hadoop over-
performing RDBMS with high selectivity and sorting
starting from 300 nodes (6 TB of stored data).
Obviously, implementation of a test stand and live
experiments on a large number of nodes is much more
expensive than development of an adequate
mathematical model and its application.
This paper discusses an MFRJ access method to a
Data Warehouse (section 3), analyzes MR workflow
(section 4), develops an analytical model for Data
Warehouse query execution time evaluation (section
5), and calibrates the model and evaluates its
adequacy based on experiments (section 6).
2 RELATED WORK
The developed analytical model for query execution
time evaluation is a cost model (Simhadri, 2013).
Below we provide an overview of the existing
models and point out their disadvantages.
Burdakov, et.al (2014) and Palla (2009) model
only two tables join in MR. Palla (2009) evaluates
input/output cost, but disregards the processing part.
However, as the measurements indicate (e.g. see
Section 6), the process load cannot be disregarded.
The processing time is considered in (Burdakov, et.al,
2014), however the Shuffle algorithm is simplified.
Afrati and Ullman (2010) propose the following
access method to n-dimensional Data Warehouses.
The Map phase in each node reads dimension and fact
table records (n+1 tables). The Map function
calculates hash-values h(b
i
) for b
i
attributes that
participate in a join. Each Reduce task is associated
with n values {h(b
i
)}. Each record is sent to multiple
Reduce tasks according to the calculated hash-values.
The Reduce task joins received records. Transferred
records number minimization task is solved based on
a constant number of Reduce tasks. This method has
the following disadvantages: