The authors of (Jia et al., 2012) propose the
Stargazer framework to build performance models
for a simulator running on GPU, so as to correlate
several GPU parameters to the simulator execution
time. Given the daunting size of the design space,
which considers very low level parameters, they ex-
ploit sparse random sampling and iterative model se-
lection, thus creating step by step an accurate lin-
ear regression model. Another approach is proposed
in (Liu et al., 2007), where the authors elaborate a
detailed analytical model of general purpose appli-
cations on GPUs, consisting of three general expres-
sions to estimate the time taken for common opera-
tions, according to their dependencies on data size or
computational capabilities. Similar analytical model-
ing approaches (Baghsorkhi et al., 2010; Zhang and
Owens, 2011; Hong and Kim, 2009; Song et al.,
2013) rely on micro-architecture information to pre-
dict GPU performance. As GPU architectures con-
tinue to evolve, the main issue of analytical models is
that any minor change may require extensive work for
adapting models.
Given the complexity of GPU hardware (many
cores, context switching, memory subsystem, etc.),
recently black box approaches based on machine
learning (ML) are favored over analytical models. In-
deed, black box techniques allow for deriving per-
formance models from data and making predictions
without a priori knowledge about the internals of the
target system. On the other hand, ML models (Dao
et al., 2015; Barnes et al., 2008; Bitirgen et al., 2008;
Kerr et al., 2010; Lu et al., 2017; Gupta et al., 2018)
require to perform an initial profiling campaign to
gather training data. An overview and quantitative
comparison among recent analytical and ML-based
model proposals is reported in (Madougou et al.,
2016).
In this research area, the authors of (Venkataraman
et al., 2016) propose Ernest, a black box performance
prediction framework for large scale analytics based
on experiment design to collect the minimum number
of training points. In particular the work allows pre-
dicting the performance of different business analyt-
ics workloads based on Spark MLlib on Amazon EC2
and achieves an average prediction error under 20%.
The authors of (Kerr et al., 2010) profile and build
models for a range of applications, run either on CPUs
or GPUs. Relying on 37 performance metrics, they
exploit principal component analysis and regression
in order to highlight those features that are more likely
to affect performance on heterogeneous processors.
Along the same lines, the authors of (Luk et al., 2009)
describe Qilin, a technique for adaptively mapping
computation onto CPUs or GPUs, depending on ap-
plication as well as system characteristics. With this
approach, they show an improved speedup with re-
spect to manually associating jobs and resources.
Building upon the discussed comparison, the
model proposed in this paper adopts ML with only
high level features, such as batch size and number of
iterations. This allows, on one side, to avoid the issues
posed by analytical approaches when the underlying
hardware architecture changes, on the other it expands
the applicability since, differently from several alter-
natives available in the literature, there is no need to
modify target applications or frameworks in order to
instrument their code.
3 END TO END MODEL
The per layer approach described in (Gianniti et al.,
2018) adopts each CNN layer computational com-
plexity to estimate the layer forward or backward pass
execution times. This technique is quite general in
its applicability, however the prediction errors tend to
increase as more complex networks are considered,
since its generality entails some approximations. In
the case of a working deployment, it is quite natural
to trade off some generality for lower prediction er-
rors, whence the end to end method laid out in the
following.
The basic idea is to extract from historical data,
particularly logs of previous runs or traces collected
by a monitoring platform, the execution time of the
network in its entirety, so as to build a dataset associ-
ating these timings to batch sizes and number of iter-
ations. Then it is possible to apply linear regression
to a sample in order to obtain a model specialized for
the particular CNN and deployment under considera-
tion, but capable of predicting performance with high
accuracy.
Deep learning practice usually involves several al-
ternating phases of CNN training and testing. The
former iteratively feeds the network with labeled im-
age batches, so that its parameters can change fol-
lowing the direction of the back propagated gradient,
whilst the latter evaluates the CNN’s evolving quality
in terms of more human readable metrics, rather than
the loss function used for training, but without con-
tributing to the learning of weights and biases. For
example, generally training is performed minimizing
a loss function that may be SVM-like or based on
cross entropy, but the stopping criterion is likely ex-
pressed in terms of classification accuracy or F-score,
for unbalanced datasets. Since training involves back
propagation, but testing does not, it is necessary to
characterize two different models.
Performance Prediction of GPU-based Deep Learning Applications
281