execution. Serialization refers to the execution of
multiple tasks by a single thread in a certain order. In
this method, the previous task must be completed
before proceeding to the next task's process. All tasks
cannot overlap in time. For multi-threaded parallel
execution, it is the simultaneous opening of multiple
threads, which strictly means the same occurrence at
the same time, and multiple tasks overlap in time.
Obviously, the efficiency and speed of parallel
execution achieved by multi-threading are
significantly better than serial execution. In addition,
multi-threaded methods are also a convenient and fast
way to achieve parallel task execution at present.
In addition to the multi-threaded method, another
two points of this project are the dispatch function
deployed on Alibaba Cloud FC. As mentioned in
section 2.3, the dispatch function processes HTTP
requests in an HTTP mode. For the list of files
retrieved from the Alibaba Cloud OSS bucket, they
will be sorted based on the number of words in each
file and distributed to different threads to ensure that
the total number of words allocated to each thread is
as close as possible. This can reduce the time waste
and improve the efficiency and resource utilization.
3 EXPERIMENT
3.1 Experimental Settings
This project aims to solve the problem of word
frequency statistics for multiple files using the
MapReduce method. The platform is the OSS and FC
services provided by Alibaba Cloud. All the codes are
implemented using the Python language and the
Jupyter Notebook. The dataset used is 50 randomly
collected English essays from the experimenter's
middle school period, with a total word count of no
less than 5000 words. It is worth noting that according
to the file name number, the number of words in files
1-20 is relatively small, while in the following 30
files, the number of words in each file is obviously
higher than in the first 20. The evaluation indicators
mainly include the time it takes to sequentially count
the word frequency in 10, 20, 30, 40, and 50 files, and
compare it with the time it takes for a single-threaded
serial execution of the same task, in order to study the
efficiency of the MapReduce method combined with
multi-threaded parallel operations.
3.2 Experimental Environment
There are many implementation methods for the
MapReduce model. Considering the specific research
questions and equipment conditions of this project, as
mentioned above, the project applies the FC and OSS
services provided by Alibaba Cloud for word
frequency statistics, and the experimenter's laptop
serves as the commander of this project. Below are
the parameters of each part of the equipment.
(1)As the commander, the computer is equipped
with the traditional 64 bit processor based x64
Windows operating system and the 12th generation
Intel i7 central processor.
(2)The network connecting the computer to the
Alibaba Cloud platform and Jumper Notebook is a
regular home LAN, with a speed of approximately
8.5MB/s.
(3)Jupyter Notebook, formerly known as IPython
Notebook, is an interactive notebook commonly used
in popular areas in the computer field such as data
simulation, statistical modeling, and machine
learning.
3.3 Experimental Process
Firstly, deploy the dispatch, mapper, and reducer
functions in the Alibaba Cloud, and deploy the client
function as the main calling function on the Jupyter
Notebook. The client function issues instructions and
calls the cloud function to execute the task. After the
dispatch function is executed, the files to be counted
are divided into several groups based on the number
of threads and the number of words in the files
assigned. As introduced earlier, the dispatch function
provides a solution for assigning files to each thread
that will perform word frequency statistics tasks.
However, it is worth noting that the dispatch function
only provides an allocation scheme for the mapper
function below, and the data provided to the mapper
function is only some file numbers. After receiving
the allocation scheme provided by the dispatch
function, the mapper function will group the obtained
files according to the scheme and allocate them to
various threads. Each reducer function is responsible
for the word frequency statistics of a thread. Firstly,
the number of words in each file will be output as key-
value pairs, and each will be written into a temporary
file in json format. Next, by reading the key-value
pairs in these json files, summarizing them again to
obtain a final result, and writing it into another file.
At the same time, the time module of the Python
language was introduced into the code of the Jupyter
Notebook, which records the start time when calling
three cloud functions, the end time after all execution
is completed, and the final running time is the
difference between the two times.