width between host and device is a major concern,
especially when larger batch size is used to transfer
more data simultaneously. Our experiments show the
degradation can lead to more than 30% elapse time
increment, and this problem is neither addressed by
container or virtual machine. Hence, cloud manager
must take this resource usage requirements into ac-
count when running multiple jobs on the same ma-
(4) Finally, the actual severity of performance im-
pact can be varied according to the train model char-
acteristics. If the model is more complex, the net-
work overhead becomes more important. If the train-
ing dataset becomes larger, the I/O or memory access
overhead becomes more critical. But overall, it is bet-
ter to use more lightweight virtualization or resource
orchestration approach, and prevent additional virtu-
alization layers.
Besides offering researchers more understanding
of the resource orchestration impact on deep learn-
ing applications, we identify the following research
challenges that deserved to receive more attentions in
the future: (1) improve network virtualization perfor-
mance and reduce overlay network layers in resource
orchestration; (2) provide resource sharing and con-
trolling mechanism on a single GPU device as well
as the I/O bandwidth resource between devices and
hosts; (3) develop more accurate resource usage es-
timation and performance prediction mechanism for
deep learning job to help cloud providers optimize
their job scheduling and placement decision.
