above, we propose a layered batch inference
optimization method for convolutional neural
networks based on CPU referred to as LBCI. This is
a fine-grained inference method, which converts the
traditional "end-to-end" inference method into a
"layer-to-layer" inference method. The user
preference delay on the server is the most significant
constriction for processing the inference requests.
Hence, LBCI will schedule the to-be-processed
inference task samples to be batching processed with
the being-processed inference task samples in a
running thread within the user preference delay,
while the batch size of the being-processed inference
task samples is smaller than the ideal value. LBCI
can make full use of the parallel capability of the
CPU and improve the average throughput of the
inference server within the user preference delay.
Our contributions can be summarized as follows.
•Our study indicates that the batch process
schedule of the CNN inference on CPU can be
optimized at the level of the layers, not only at the
level of the model. And we design a novel strategy
for predicting the running time with the new batch
size by using the running time ratio lookup table of
the computed sub-models.
•We propose a layered batch inference
optimization method for convolutional neural
networks based on CPU referred to as LBCI which
makes full use of the parallel capability of the CPU
and improve the average throughput of the inference
server within the user preference delay.
2
RELATED WORK
2.1 Optimizing CNN Inference Task
Scheduling
On homogeneous devices, the focus of scheduling
optimization is fully utilizing resources on the
device; while on heterogeneous devices, the focus of
scheduling optimization is often the division of
computing tasks and communication between
heterogeneous devices. Here we only focus on
research work on scheduling optimization of CNN
inference requests on homogeneous devices. Choi et
al. (Choi, 2021) proposed a batch processing system
LazyBatching that supports SLA (service level
agreement) on the NPU simulator. It performs
scheduling and batch processing at the level of
nodes in the graph, rather than at the level of the
entire graph, and improves the throughput of batch
processing on the NPU simulator. But this work is
based on the NPU simulator, not the CPU platform.
Zhang et al. (Y. Zhang, 2022) proposed a CNN task
scheduling paradigm, "One-Instance-Per-x-Core",
which improved the throughput of multi-core CPU
batch processing on DNN training and inference
tasks. Since ParaX works mainly for DNN model
training, they mainly consider the impact of the
batch size in each instance on the accuracy of the
training results, not on the delay and throughput of
multi-core CPUs. Wu et al. (X. Wu, 2020) proposed
Irinan online scheduling optimization strategy on the
GPU platform for multiple different DNN models’
inference, which reduces delays under unpredictable
workloads, effectively shares GPU resources and
minimizes average inference delays. Irina focuses on
scheduling optimization between different DNN
model inference tasks.
2.2 Using CPU Multithreading to
Calculate CNN Inference Tasks
The CPU computing modules in mainstream deep
learning frameworks such as PyTorch already
support multi-threading technology. The PyTorch
deep learning framework can achieve multiple levels
of parallelism on the CPU platform (Pytorch, 2019).
Liu et al. (Liu, 2019) pointed out that high-
performance kernel libraries (such as Intel MKL-
DNN (INTEL, 2022) and OpenBlas (Zhang, 2016)
are usually used to obtain the high performance of
CNN operations. In the convolution calculation, the
parallel instructions of OpenMP (Openmp] are used
to realize multi-threaded parallel operations at the
same time, making full use of hardware resources
and greatly reduce computing time. Amazon
(Daghaghi, 2021) pointed out that the inference time
per unit image shows a decreasing trend as the
number increases using the MXNet framework on
the CPU platform for CNN inference when the
number of input images is within a certain range.
2.3 Optimizing Batch Processing of
DNN Inference Tasks
Batch research on the inference process of DNN
began in 2018, and Gao et al. (Gao, 2018) firstly
studied the inference process of RNN. The
traditional CNN batch inference method is image-
wise batch processing. Wang et al. (Wang, 2020)
proposed a layer-wise scheduling method on a CPU
processor without parallel optimization. With the
layer-wise scheduling method, the images in one
batch use the weights of one layer at the same time,
reducing the memory accesses and the access delays.
In view of the different weight data and memory