ual learning for image recognition. In 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778.
He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y.,
Atallah, A., Herbrich, R., Bowers, S., and Candela, J.
Q. n. (2014). Practical lessons from predicting clicks
on ads at facebook. In Proceedings of the Eighth In-
ternational Workshop on Data Mining for Online Ad-
vertising, page 1–9.
Jalaparti, V., Bodik, P., Menache, I., Rao, S., Makarychev,
K., and Caesar, M. (2015). Network-aware scheduling
for data-parallel jobs: Plan when you can. In Proceed-
ings of the 2015 ACM Conference on Special Interest
Group on Data Communication, page 407–420.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-
agenet classification with deep convolutional neural
networks. Commun. ACM, 60(6):84–90.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P.,
and Soricut, R. (2019). ALBERT: A lite BERT for
self-supervised learning of language representations.
CoRR, abs/1909.11942.
Levine, S., Pastor, P., Krizhevsky, A., and Quillen, D.
(2016). Learning hand-eye coordination for robotic
grasping with deep learning and large-scale data col-
lection. CoRR, abs/1603.02199.
Lin, C.-Y., Yeh, T.-A., and Chou, J. (2019). DRAGON: A
dynamic scheduling and scaling controller for manag-
ing distributed deep learning jobs in kubernetes clus-
ter. In International Conference on Cloud Computing
and Services Science (CLOSER), pages 569–577.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
V. (2019). Roberta: A robustly optimized BERT pre-
training approach. CoRR, abs/1907.11692.
Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. (2018).
Optimus: An efficient dynamic resource scheduler for
deep learning clusters. In EuroSys, pages 1–14.
Peng, Y., Bao, Y., Chen, Y., Wu, C., Meng, C., and Lin, W.
(2021). DL2: A deep learning-driven scheduler for
deep learning clusters. IEEE Transactions on Parallel
Distributed Systems, 32(08):1947–1960.
Redmon, J., Divvala, S. K., Girshick, R. B., and Farhadi, A.
(2015). You only look once: Unified, real-time object
detection. CoRR, abs/1506.02640.
Tannenbaum, T., Wright, D., Miller, K., and Livny, M.
(2001). Condor – a distributed job scheduler.
Tian, Y., Pei, K., Jana, S., and Ray, B. (2017). Deeptest:
Automated testing of deep-neural-network-driven au-
tonomous cars. CoRR, abs/1708.08559.
Tumanov, A., Zhu, T., Park, J. W., Kozuch, M. A., Harchol-
Balter, M., and Ganger, G. R. (2016). Tetrisched:
Global rescheduling with adaptive plan-ahead in dy-
namic heterogeneous clusters. In EuroSys, pages 1–
16.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L. u., and Polosukhin,
I. (2017). Attention is all you need. In Guyon,
I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-
gus, R., Vishwanathan, S., and Garnett, R., editors,
Advances in Neural Information Processing Systems,
volume 30, pages 5998–6008. Curran Associates, Inc.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S.,
Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H.,
Seth, S., Saha, B., Curino, C., O’Malley, O., Radia,
S., Reed, B., and Baldeschwieler, E. (2013). Apache
hadoop yarn: Yet another resource negotiator. In Pro-
ceedings of Symposium on Cloud Computing.
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D.,
Tune, E., and Wilkes, J. (2015). Large-scale cluster
management at google with borg. In EuroSys, pages
1–17.
Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra,
N., Han, Z., Patel, P., Peng, X., Zhao, H., Zhang, Q.,
Yang, F., and Zhou, L. (2018). Gandiva: Introspective
cluster scheduling for deep learning. In OSDI, pages
595–610.
Xiao, W., Ren, S., Li, Y., Zhang, Y., Hou, P., Li, Z., Feng,
Y., Lin, W., and Jia, Y. (2020). Antman: Dynamic
scaling on GPU clusters for deep learning. In OSDI,
pages 533–548.
Xu, D., Anguelov, D., and Jain, A. (2017). Pointfusion:
Deep sensor fusion for 3d bounding box estimation.
CoRR, abs/1711.10871.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,
R. R., and Le, Q. V. (2019). Xlnet: Generalized
autoregressive pretraining for language understand-
ing. In Wallach, H., Larochelle, H., Beygelzimer,
A., d'Alch
´
e-Buc, F., Fox, E., and Garnett, R., editors,
Advances in Neural Information Processing Systems,
volume 32, pages 5753–5763. Curran Associates, Inc.
Yeh, T.-A., Chen, H.-H., and Chou, J. (2020). Kubeshare:
A framework to manage gpus as first-class and shared
resources in container cloud. In Proceedings of the
29th International Symposium on High-Performance
Parallel and Distributed Computing, page 173–184.
Yu, P. and Chowdhury, M. (2019). Salus: Fine-grained
GPU sharing primitives for deep learning applica-
tions. CoRR, abs/1902.04610.
Zoph, B., Cubuk, E. D., Ghiasi, G., Lin, T., Shlens, J., and
Le, Q. V. (2019). Learning data augmentation strate-
gies for object detection. CoRR, abs/1906.11172.
CLOSER 2021 - 11th International Conference on Cloud Computing and Services Science
132