Authors:
Thomas Renner
;
Lauritz Thamsen
and
Odej Kao
Affiliation:
Complex and Distributed IT Systems and Technische Universität Berlin, Germany
Keyword(s):
Cluster Resource Management, Distributed Data Analytics, Distributed Dataflow Systems, Adaptive Resource Management.
Abstract:
Many distributed data analysis jobs are executed repeatedly in production clusters. Examples include daily executed batch jobs and iterative programs. These jobs present an opportunity to learn workload characteristics through continuous fine-grained cluster monitoring. Therefore, based on detailed profiles of resource utilization, data placement, and job runtimes, resource management can in fact adapt to actual workloads. In this paper, we present a system architecture that contains four mechanisms for an adaptive resource management, encompassing data placement, resource allocation, and container as well as job scheduling. In particular, we extended Apache Hadoop's scheduling and data placement to improve resource utilization and job runtimes for recurring analytics jobs. Furthermore, we developed a Hadoop submission tool that allows users to reserve resources for specific target runtimes and which uses historical data available from cluster monitoring for predictions.