
external services, such as databases or storage sys-
tems. Consequently, for workflows that generate in-
termediate state or data, the common practice is to uti-
lize remote storage solutions like AWS S3 (Amazon,
2006) or MinIO (MINIO, 2016) to store this interme-
diate data (Ana Klimovic, 2018).
We show the above behavior in Figure 1. We
define the data download route, the route of down-
loading data to runtime from remote storage, and
the data upload route, the route of uploading data
to remote storage from runtime. The data down-
load/upload time represents the execution time of the
defined routes.
However, this approach can lead to several chal-
lenges. Firstly, it incurs a notable amount of time
dedicated to data transfer in both data downloading
and uploading operations since it uses the network as
the transmission method. Additionally, this approach
consumes storage space, in cloud computing usually
cloud storage for maintaining the intermediate data,
which may grow significantly over time, the cost of
renting a cloud storage is inescapable.
In this scenario, both remote database I/O
and network condition become potential bottlenecks
(Eric Jonas, 2017; SINGHVI A, 2017). This results
in prolonged data transfer times and an overall in-
crease in latency. Furthermore, as the number of pods
or functions increases, the demand on the network
infrastructure also escalates, leading to a substantial
surge in network requirements.
The primary focus of our research is to optimize
data transfer efficiency through the implementation of
data caching mechanisms. To achieve this, we lever-
age the local disk storage available on each node to
establish a local storage. We incorporate the Least
Recently Used (LRU) caching algorithm to effec-
tively cache intermediate data generated throughout
the workflow processes in the distributed system.
In Figure 2, we illustrate the new data transfer
route after our optimization. This route is not only
significantly quicker than the previous one but also
involves the replacement of a portion of the time
spent on remote network access with local disk ac-
cess. Note that this route is in the case that desired
data is cached in the same node. For data cached
in other nodes, we have components to realize the
caching mechanism as well, we will introduce it in
the latter chapters.
Our experimental result demonstrates a remark-
able reduction in transfer times, with a substantial
at most decrease of 82.55% in our experiment. We
also show that the data transfer speedup is positively
correlated with the disk/network speed ratio, which
means the optimization will be better as the disk per-
Figure 3: Data transfer route for current workflows.
formance is enhanced or the network is congested.
2 SERVERLESS WORKFLOW
BACKGROUND
One of the behaviors of serverless computing is using
remote storage as centralized storage. We have found
that for workflows requiring remote storage, exist-
ing serverless platforms spend a significant amount of
time on data transfer, which is the total time of down-
loading and uploading data. In this kind of workflow,
both database I/O and network become bottlenecks,
resulting in extended data transfer times and overall
increased latency.
We benchmark two real-world data-driven work-
flows, image-processing and video-processing. These
workflows are common in serverless benchmarks.
They share a common characteristic in serverless
workflow: intermediate data is only used between
functions, and only the initial and final data need to
be in remote storage. Figure 3 shows their workflow
in DAGs.
2.1 Image Processing
This workflow is a classic image recognition work-
flow. It first downloads data from the remote storage,
scales it to a specific size, uses a pre-trained model to
get the result, and uploads it back to the remote stor-
age.
2.2 Video Processing
This use case comes from Alibaba Cloud. It first splits
video data into small clips, transcodes them parallelly,
and merges them together. The initial data and results
are downloaded/uploaded from/to the remote storage
(Alibaba, 2021).
CacheFlow: Enhancing Data Flow Efficiency in Serverless Computing by Local Caching
225