processing demands. Currently, there are many
serverless computing platforms that provides state-
of-the-art cloud computing services, such as AWS
Lambda, Google Cloud, Microsoft Azure, Alibaba
Cloud etc.
MapReduce is currently the most popular model
for processing massive amounts of data, which
mainly includes four stages: Map, Partition, Shuffle,
and Reduce. MapReduce is widely used for parallel
processing across distributed systems and generating
large-scale datasets. First, it is user-friendly, even for
beginners, as it conceals the specific intricacies
involving parallelization, fault-tolerance, optimizing
locality, and balancing workloads. Second, many
complex problems in the real world are highly
expressible in the MapReduce programming model,
such as word counting, word frequency analysis etc.
(Baldini et al., 2017) However, MapReduce is often
constrained by the data transmission method.
Specifically, due to the need for the mapper to be
completed as soon as possible, there may be a risk of
timeout for the mapper while the reducer is still
working. Therefore, it is not feasible to directly
transfer data between mappers and reducers. In this
context, combining serverless and MapReduce
frameworks shows promising application prospects.
Inspired by these two cutting-edge and matured
technologies, this paper focus on combining them to
reduce the time span and increase the efficiency when
executing the word frequency counting task. This
paper uses a MapReduce programming model based
on a serverless computing platform to figure out the
most optimized number of Map functions and Reduce
functions. Though it seemed obvious that the more
map and reduce functions are implemented, the
higher the overall efficiency the program may
achieve. This paper’s goal, however, is to figure out
the trend at which the overall efficiency is increasing.
The results indicate that, when executing the same
amount of workloads, as the number of map functions
and reduce functions increases, both execution time
reduces and the overall efficiency of the program
improves but at different rates. This paper hopes to
find out the most optimized number of map and
reduce functions so as to help cooperations and
programmers figure out the most optimized solutions
when implementing the MapReduce programming
model on their tasks and workflows.
Focusing on above aspects, this paper starts with
a brief overview of the basic principles of the
MapReduce programming model, the operating rules
of serverless computing platforms as well as services
and the overall framework of the experiment (Section
2). Then, the paper discusses relevant methodologies
as well as evaluations and presents the result of the
experiment conducted by giving in-depth evaluations
and conclusions based on existing research data and
results (Section 3). Lastly, the paper discusses current
drawbacks of the experiment framework used in this
paper, analyses the strengths and weaknesses of the
results and envisions possible solutions and new
research areas based on current experiments (Section
4). This paper also summarizes in Section 4.
2 METHOD
2.1 Revisiting MapReduce and
Serverless
In this section, the paper presents a brief overview of
the basic principles of MapReduce programming
model as well as the operating rules of the serverless
computing platform.
MapReduce. The overall MapReduce
programming model mainly consists of two
functions, two phases as well as three categories of
files. In terms of three categories of files, there are
input files, intermediate files as well as the output
files. The input file contains data that needs to be
processed. The intermediate files contain important
data that are needed during the MapReduce executing
process and the output files hold the final result of the
program. In terms of the two functions and two
phases, there is the Map function, which relates to the
Map phase, and the Reduce function, which relates to
the Reduce phase. The Map function is responsible
for reading data from the input files and process these
data into key-value pairs, which are later stored in
intermediate files. These intermediate files forward
these key-value pairs to the Reduce function, where
these key-value pairs are sorted, partitioned and
processed into final results and are written into the
output files, which later are available and accessible
to the user (Jeffrey et al., 2004).
Serverless. The operating rules of serverless
computing platform consists of four main stages,
which are: Event Trigger, Function Execution,
Function Processing and Response Return. In the
Event Trigger stage, there is a local client, which runs
locally on the user’s device. The client triggers an
event, such as an HTTP request, file upload or a
message queue. Then, the trigger passes the event to
the function on the cloud, entering Function
Execution stage. Once the function on the cloud is
triggered, the serverless computing platform will
dynamically allocate and scale the computing