width between host and device is a major concern,
especially when larger batch size is used to transfer
more data simultaneously. Our experiments show the
degradation can lead to more than 30% elapse time
increment, and this problem is neither addressed by
container or virtual machine. Hence, cloud manager
must take this resource usage requirements into ac-
count when running multiple jobs on the same ma-
chine.
(4) Finally, the actual severity of performance im-
pact can be varied according to the train model char-
acteristics. If the model is more complex, the net-
work overhead becomes more important. If the train-
ing dataset becomes larger, the I/O or memory access
overhead becomes more critical. But overall, it is bet-
ter to use more lightweight virtualization or resource
orchestration approach, and prevent additional virtu-
alization layers.
Besides offering researchers more understanding
of the resource orchestration impact on deep learn-
ing applications, we identify the following research
challenges that deserved to receive more attentions in
the future: (1) improve network virtualization perfor-
mance and reduce overlay network layers in resource
orchestration; (2) provide resource sharing and con-
trolling mechanism on a single GPU device as well
as the I/O bandwidth resource between devices and
hosts; (3) develop more accurate resource usage es-
timation and performance prediction mechanism for
deep learning job to help cloud providers optimize
their job scheduling and placement decision.
REFERENCES
Abadi, M., Agarwal, A., and et al. (2015). TensorFlow:
Large-scale machine learning on heterogeneous sys-
tems. Software available from tensorflow.org.
Bernstein, D. (2014). Containers and cloud: From LXC
to docker to kubernetes. IEEE Cloud Computing,
1(3):81–84.
Chen, T., Li, M., and et al. (2015). Mxnet: A flexible and
efficient machine learning library for heterogeneous
distributed systems. CoRR, abs/1512.01274.
Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011).
Torch7: A matlab-like environment for machine learn-
ing. In BigLearn, NIPS Workshop.
Docker (2017). Docker swarm.
https://docs.docker.com/engine/swarm/.
Gartner (2017). Gartner. http://www.gartner.com/.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep
residual learning for image recognition. CoRR,
abs/1512.03385.
Hindman, B., Konwinski, A., and et al. (2011). Mesos: A
platform for fine-grained resource sharing in the data
center. In Proceedings of the 8th USENIX Conference
on Networked Systems Design and Implementation,
NSDI’11, pages 295–308.
Jlassi, A. and Martineau, P. (2016). Benchmarking hadoop
performance in the cloud - an in depth study of re-
source management and energy consumption. In In-
ternational Conference on Cloud Computing and Ser-
vices Science, pages 192–201.
Kominos, C. G., Seyvet, N., and Vandikas, K. (2017). Bare-
metal, virtual machines and containers in openstack.
In 20th ICIN, pages 36–43.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In 25th NIPS, volume 1, pages 1097–1105.
Kubernetes (2017). Kubernetes is an open-source system
for automating deployment, scaling, and management
of containerized applications. https://kubernetes.io/.
Li, Z., Kihl, M., Lu, Q., and Andersson, J. A. (2017).
Performance overhead comparison between hypervi-
sor and container based virtualization. In IEEE AINA,
pages 955–962.
Mazaheri, S., Chen, Y., Hojati, E., and Sill, A. (2016).
Cloud benchmarking in bare-metal, virtualized, and
containerized execution environments. In IEEE CCIS,
pages 371–376.
Noda, K., Yamaguchi, Y., and et al. (2015). Audio-visual
speech recognition using deep learning. Appl. Intell.,
42(4):722–737.
OpenStack (2017). Open source software for creating pri-
vate and public clouds. https://www.openstack.org/.
Ramos, S., Gehrig, S. K., Pinggera, P., Franke, U., and
Rother, C. (2017). Detecting unexpected obstacles for
self-driving cars: Fusing deep learning and geometric
modeling. In IEEE Intelligent Vehicles Symposium,
pages 1025–1032.
Ruan, B., Huang, H., Wu, S., and Jin, H. (2016). A perfor-
mance study of containers in cloud environment. In
Asia-Pacific Services Computing Conference, volume
10065 of Lecture Notes in Computer Science, pages
343–356.
Russakovsky, O., Deng, J., and et al. (2015). ImageNet
Large Scale Visual Recognition Challenge. Interna-
tional Journal of Computer Vision, 115(3):211–252.
Salah, T., Zemerly, M. J., Yeun, C. Y., Al-Qutayri, M., and
Al-Hammadi, Y. (2017). Performance comparison be-
tween container-based and vm-based services. In 19th
ICIN, pages 185–190.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
Z. (2015). Rethinking the inception architecture for
computer vision. CoRR, abs/1512.00567.
Theano Development Team (2016). Theano: A Python
framework for fast computation of mathematical ex-
pressions. arXiv e-prints, abs/1605.02688.