DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads
Min-Chi Chiang, Jerry Chou
2021
Abstract
The recent success of deep learning applications is driven by the computing power of GPUs. However, as the workflow of deep learning becomes increasingly complicated and resource-intensive, how to manage the expensive GPU resources for Machine Learning (ML) workload becomes a critical problem. Existing resource managers mostly only focus on a single specific type of workload, like batch processing or web services, and lacks runtime optimization and application performance awareness. Therefore, this paper proposes a set of runtime dynamic management techniques (including auto-scaling, job preemption, workload-aware scheduling, and elastic GPU sharing) to handle a mixture of ML workloads consisting of modeling, training, and inference jobs. Our proposed system is implemented as a set of extended operators on Kubernetes and has the strength of complete transparency and compatibility to the application code as well as the deep learning frameworks. Our experiments conducted on AWS GPU clusters prove our approach can out-perform the native Kubernetes by 60% system throughput improvement, 70% training time reduction without causing any SLA violations on inference services.
DownloadPaper Citation
in Harvard Style
Chiang M. and Chou J. (2021). DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads. In Proceedings of the 11th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-758-510-4, pages 122-132. DOI: 10.5220/0010483401220132
in Bibtex Style
@conference{closer21,
author={Min-Chi Chiang and Jerry Chou},
title={DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads},
booktitle={Proceedings of the 11th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2021},
pages={122-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010483401220132},
isbn={978-989-758-510-4},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 11th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads
SN - 978-989-758-510-4
AU - Chiang M.
AU - Chou J.
PY - 2021
SP - 122
EP - 132
DO - 10.5220/0010483401220132