REFERENCES
Abadi, M. et al. (2016). Tensorflow: A system for large-
scale machine learning. In Proceedings of the 12th
USENIX Conference on OSDI, pages 265–283.
Amaral, M., Polo, J., Carrera, D., Seelam, S. R., and Stein-
der, M. (2017). Topology-aware GPU scheduling for
learning workloads in cloud environments. In Pro-
ceedings of the SuperComputing (SC), pages 17:1–
17:12.
Bao, Y., Peng, Y., Wu, C., and Li, Z. (2018). Online job
scheduling in distributed machine learning clusters.
CoRR, abs/1801.00936.
Burns, B., Grant, B., Oppenheimer, D., Brewer, E., and
Wilkes, J. (2016). Borg, omega, and kubernetes. ACM
Queue, 14:70–93.
Chan-Yi Lin (2019). DRAGON: Deep Learning
with Auto-scale and Gang-schedule On Kubernetes.
https://github.com/ChanYiLin/tf-operator-Dragon/.
De Sa, C., Feldman, M., R
´
e, C., and Olukotun, K.
(2017). Understanding and optimizing asynchronous
low-precision stochastic gradient descent. In Proceed-
ings of the 44th Annual ISCA, pages 561–574.
Dean, J. et al. (2012). Large scale distributed deep net-
works. In Proceedings of the 25th International Con-
ference on Neural Information Processing Systems,
pages 1223–1231.
Goyal, P., Dollar, P., Girshick, R., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
He, K. (2017). Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour. In arXiv preprint
arXiv:1706.02677.
Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and
Gibbons, P. B. (2017). Proteus: agile ML elastic-
ity through tiered reliability in dynamic resource mar-
kets. In Proceedings of the Twelfth EuroSys, pages
589–604.
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A.,
Joseph, A. D., Katz, R., Shenker, S., and Stoica, I.
(2011). Mesos: A Platform for Fine-grained Resource
Sharing in the Data Center. In Proceedings of the 8th
USENIX Conference on NSDI, pages 295–308.
IBM (2018). Fabric for deep learning (ffdl).
https://github.com/IBM/FfDL.
Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J.,
Xiao, W., and Yang, F. (2018). Multi-tenant GPU
Clusters for Deep Learning Workloads: Analysis and
Implications. Microsoft Research Technical Report.
Jette, M. A., Yoo, A. B., and Grondona, M. (2002). Slurm:
Simple linux utility for resource management. In In
Proceedings of Job Scheduling Strategies for Parallel
Processing, pages 44–60. Springer-Verlag.
Krizhevsky, A. (2014). One weird trick for parallelizing
convolutional neural networks. CoRR, abs/1404.5997.
Kubeflow (2017). The machine learning toolkit for kuber-
netes. https://www.kubeflow.org/.
Li, M. et al. (2014). Scaling distributed machine learning
with the parameter server. In Proceedings of the 11th
USENIX Conference on OSDI, pages 583–598.
Mayer, R., Mayer, C., and Laich, L. (2017). The tensorflow
partitioning and scheduling problem: It’s the critical
path! In Proceedings of Workshop on Distributed In-
frastructures for Deep Learning, pages 1–6.
Microsoft (2016). Open platform for ai(openpai).
https://github.com/Microsoft/pai.
Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen,
R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and
Dean, J. (2017). Device placement optimization with
reinforcement learning. CoRR, abs/1706.04972.
Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). Hog-
wild!: A lock-free approach to parallelizing stochastic
gradient descent. In Proceedings of the 24th Interna-
tional Conference on Neural Information Processing
Systems, pages 693–701.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and
Lerer, A. (2017). Automatic differentiation in pytorch.
Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. (2018).
Optimus: an efficient dynamic resource scheduler for
deep learning clusters. In Proceedings of the Thir-
teenth EuroSys Conference, pages 3:1–3:14.
RiseML (2017). Machine learning platform for kubernetes.
https://riseml.com/.
Sergeev, A. and Balso, M. D. (2018). Horovod: fast and
easy distributed deep learning in tensorflow. CoRR,
abs/1802.05799.
Vavilapalli et al. (2013). Apache Hadoop YARN: Yet An-
other Resource Negotiator. In Proceedings of the 4th
Annual Symposium on Cloud Computing, pages 5:1–
5:16.
Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra,
N., Han, Z., Patel, P., Peng, X., Zhao, H., Zhang, Q.,
Yang, F., and Zhou, L. (2018). Gandiva: Introspec-
tive Cluster Scheduling for Deep Learning. In 13th
USENIX OSDI, pages 595–610.
You, Y., Gitman, I., and Ginsburg, B. (2017). Scaling
SGD batch size to 32k for imagenet training. CoRR,
abs/1708.03888.
Yu, D. et al. (2014). An introduction to computational net-
works and the computational network toolkit. Mi-
crosoft Technical Report.
Zhang, W., Gupta, S., Lian, X., and Liu, J. (2016).
Staleness-aware async-SGD for Distributed Deep
Learning. In Proceedings of the Twenty-Fifth Interna-
tional Joint Conference on Artificial Intelligence, IJ-
CAI’16, pages 2350–2356.
DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster
577