SELF-LEARNING PREDICTION SYSTEM FOR OPTIMISATION

OF WORKLOAD MANAGEMENT IN A MAINFRAME

OPERATING SYSTEM

Michael Bensch

, Dominik Brugger

, Wolfgang Rosenstiel

, Martin Bogdan

1,2

, Wilhelm Spruth

1,2

Department of Computer Engineering, Tübingen University, Sand 13, Tübingen, Germany

Department of Computer Engineering, Leipzig University, Johannisgasse 26, Leipzig, Germany

Peter Baeuerle

IBM Germany Development Lab, Schönaicher Str. 220, 71032 Böblingen, Germany

Keywords: Workload management, time series prediction, neural networks, feature selection.

Abstract: We present a framework for extraction and prediction of online workload data from a workload manager of

a mainframe operating system. To boost overall system performance, the prediction will be incorporated

into the workload manager to take preventive action before a bottleneck develops. Model and feature

selection automatically create a prediction model based on given training data, thereby keeping the system

flexible. We tailor data extraction, preprocessing and training to this specific task, keeping in mind the non-

stationarity of business processes. Using error measures suited to our task, we show that our approach is

promising. To conclude, we discuss our first results and give an outlook on future work.

1 INTRODUCTION

In a large system environment in mainframe

computing like the IBM z-Series many different

workloads of different types compete for the

available resources. The types of workload cover

things like online transaction processing, database

queries or batch jobs as well as online timesharing

users. The resources that are needed by these

workloads are hard resources like CPU capacities,

main memory or I/O channels as well as soft

resources like available server processes that serve

transactions (Vaupel et al., 2004).

Every installation wants to make the best use of

its resources and maintain the highest possible

throughput. To make this possible the Workload

Manager (WLM) was introduced into the z/OS

operating system (Aman et al., 1997). With the

Workload Manager, one defines performance goals

and assigns a business importance to each goal. The

user defines the goals for work in business terms,

and the system decides how many resources, such as

CPU and storage, should be given to it in order to

meet the predefined goal. The Workload Manager

will constantly monitor the system and adapt

processing to meet the goals.

Due to the growth and rapid change of the

workload requirements in today’s information

processing, the challenge for workload management

is to assign the required resource to the correct

workload exactly when it is needed – or even before

it is needed – to avoid any delays. This is especially

important as workload management nowadays not

only assigns available resources, but increasingly

provides additional resources when they are needed.

Any over- or underprovisioning is a direct cost

factor.

Thus, a load prediction system that adapts itself

to each customer's specific behaviour is required.

Based on such load predictions the z/OS operating

system could provide and assign the right resources

to the important workload right in time instead of

waiting until they are needed or even longer (Bigus

et al., 2000) to improve performance and throughput

while optimising resources usage and minimising

costs.

This paper describes the project and the results

achieved regarding the prediction quality of the self-

212

Bensch M., Brugger D., Rosenstiel W., Bogdan M., Spruth W. and Baeuerle P. (2007).

SELF-LEARNING PREDICTION SYSTEM FOR OPTIMISATION OF WORKLOAD MANAGEMENT IN A MAINFRAME OPERATING SYSTEM.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - AIDSS, pages 212-218

DOI: 10.5220/0002392102120218

 SciTePress

learning system. One has to take into account that

the process being modelled is non-stationary. Future

investigations will consider questions of re-learning

when the customer's specific behaviour is changing,

and how far explicit knowledge can be incorporated

to improve the prediction quality.

The paper is organised as follows: Section 2

introduces the Workload Manager responsible for

the resource allocation which will be influenced by

the load prediction. Section 3 explains the data

extraction and training process we use. Results are

presented in Section 4, followed by a summary in

Section 5.

2 WORKLOAD MANAGER

The z/OS Workload Manager (WLM) is a

functionality of the base control program of the z/OS

operating system for IBM Mainframes. Its basic

functions are dynamic allocation and re-allocation of

system resources to the different workloads running

on clusters of z/OS driven systems, called Sysplex.

Resources could be dispatching priorities of CPU

access, real storage, software servers of transactions

and so forth. They are permanently allocated and re-

allocated between the workloads based on a policy

of business goals that was defined by the customer.

Units of work could be things like batch jobs, online

transactions like CICS, DB2 or web-transactions, or

interactive TSO work (see Figure 1).

The goals are defined in different service classes.

Each incoming workload is classified into a service

class, based on customer defined classification rules.

An example is shown in Figure 2. Each service class

has an importance between 1 (very important) and 5

(very unimportant) or it is discretionary.

The WLM tries to fulfil each service class’s goal

by constantly allocating/reallocating resources to the

workloads. More important goals are preferred over

less important ones, when they compete for

resources. The central measurement parameter of

goal achievement is the so-called Performance Index

(PI) for each service class.

A PI of 1 means, that the goal of this service

class is exactly met. If the goal is over achieved, the

PI becomes less than 1. A PI > 1 indicates that the

goal is not achieved.

3 MODUS OPERANDI

An overview of our data processing, feature

selection, training, and prediction method is given in

Figure 3. Detailed explanations of these methods can

be found in the following subsections.

3.1 Data

The data gathering and feature extraction steps are

specific to z/OS. The preprocessing methods are

universal and applicable to other domains as well.

3.1.1 Gathering

The prediction of the PI of a selected service class

must be based on those input data, which have the

biggest influence on the achievement of the

predefined goal of that service class. One important

goal of this project was to find out, which features of

Figure 1: z/OS Workload Manager.

Figure 3: Classification of workloads into service classes.

Figure 2: Overview of the training and prediction method.

Preprocessing

Manual

Feature

Extraction

alidate

Model

Train Selected

Model

on all Data

Feature

Selection

Train

Model

Data

Gathering

Predict

Workload

Test Error

Interpret

Result

Model Selection

SELF-LEARNING PREDICTION SYSTEM FOR OPTIMISATION OF WORKLOAD MANAGEMENT IN A

MAINFRAME OPERATING SYSTEM

213

the data are the most relevant for the prediction

quality.

Some of the candidates for those data features

are:

• The historical time series of the PI that we

want to predict itself.

• The time series of the PIs of other service

classes.

• % CPU usage (% of total time where CPU

was active)

• % workflow (% execution speed)

• % delay (% of time where processes waited

for resources)

• Number of (active) users

• Unreferenced interval count (UIC, measure

of memory contention)

• Transaction ended rate

• Number of CPUs

Those data, the so called SMF Data, were gathered

on repeating intervals by using a standard add-on

product for z/OS, RMF (Resource Measurement

Facility). RMF is able to gather data from each

system of a Sysplex and consolidate the data

Sysplex-wide into one data pool. It can make those

consolidated data available through a TCP/IP

interface by a component called RMF DDS

(Distributed Data Server).

In order to make the data available to the

prediction system we have developed a JAVA based

program, DDSMonitor, that takes the data from

DDS and does some preprocessing and feature

extraction. The architecture of the data gathering

system can be seen in Figure 4.

3.1.2 Preprocessing

The data preprocessing has three major objectives:

- Data reduction

- Data smoothing

- Data scaling

Data reduction and smoothing were done by

average downsampling and calculating a moving

average.

Average downsampling means that the data

points of n intervals are replaced by one data point,

which represents the average of n intervals. This

reduces the data to 1/n. As a trade-off between data

reduction and maintaining enough data precision a

value of n=5 was obtained empirically and used.

Calculating a moving average means that each

data point is replaced by the average of itself plus

the next m data samples. This smoothens the data

series curves and reduces coincidental deviations

from the main slope. A value of m=5 was used.

Figure 5 shows the effect of applying down-

sampling and moving average to the original data for

the “% CPU usage” time series.

To obtain a limited range of data values as input

to the prediction algorithm, a hyperbolic tangent

function is used to scale the data d (see Equation 1).

))4.1(5.0tanh(2 −⋅

⋅

dscaled

(1)

3.1.3 Manual Feature Extraction

After the preprocessing step, we use our domain

knowledge to manually discard some of the features

which are known to be irrelevant for the PI

prediction. This step speeds up the model selection

phase, which includes an automated feature selection

step (described in the following subsection).

3.2 Feature Selection

To reduce the number of features of the training data

(which amounts to reducing the complexity of the

problem), we employ Recursive Feature Elimination

(Guyon et al., 2002), which is based on Support

Vector methods. We have already employed feature

DDSMonitor

Distributed Data Server (DDS)

Sysplex Data Server

RMF Data Gatherer

TCP/IP

Figure 4: Resource Measurement Facility (RMF).

12 24 36

100

Percent CPU usage

Hours

12 24 36

100

Hours

Percent CPU usage

Figure 5: Original "% CPU usage" time series (left) an

after downsampling and applying moving average (right).

ICEIS 2007 - International Conference on Enterprise Information Systems

214

selection successfully for quality control problems

(Bensch et al., 2005). Since every load configuration

is individual depending on the customer, this step

should be repeated whenever load characteristics

have changed significantly. The selected features

(system parameters) are considered to be relevant for

predicting the PI.

As can be seen in Figure 6, many feature types

are highly correlated over all service classes (dark

bands) and thus do not contribute any additional

information to the prediction. Ideally, the feature

selection algorithm will retain only one feature of

such a highly correlated set.

Investigations showed that it suffices to merely

take into account the features belonging to the same

service class as the PI to be predicted (see Figure

11). An example of the three most important features

to predict the PI (except the PI itself) which were

selected by Recursive Feature Elimination is shown

in Figure 7.

3.3 Self-learning Models

To predict the PI time series, various prediction

models are compared. Momentarily two artificial

neural network (ANN) types are used, a feedforward

topology and the FlexNet. In addition, Support

Vector Regression (SVR) is employed. These

prediction methods are introduced in the following

subsections.

3.3.1 Feedforward Network

We use a standard multi-layer feedforward network

(number of layers and neurons optimised by model

selection) with a backpropagation learning method

(Rumelhart et al., 1986). A regularised error

measure is used to prevent overfitting, displayed in

Equation 2.

mswmsemsereg ⋅−

⋅

)1(

(2)

mse refers to the typically used mean squared error

of prediction to true value during training, msw is

the mean of the sum of squares of the network

weights and biases and γ is the regularisation

parameter. Our model selection generally found

values for γ in the range 0.8 ≤ γ < 1. To speed up

convergence, we use conjugate gradient descent with

Polak-Ribière updates of the weights (Polak and

Ribière, 1969). Figure 8 shows a typical configura-

tion of the feedforward network used.

3.3.2 FlexNet Network

FlexNet is a flexible, easy to use neural network

construction and training algorithm. Starting with

input and output neuron layers only, the network

structure is incrementally defined during the training

process. An example can be seen in Figure 9.

FlexNet determines the best suited position/layers

for competing groups of candidate neurons in the

current network before adding them. As this allows

5 10 15 20 25 30 35 40 45 50

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

Figure 6: Cross-correlation of features numbered 1 to 50.

A dark block indicates high correlation of the

corresponding feature on the y-axis with another feature

on the x-axis. Thus, the diagonal shows a feature's

correlation with itself.

24 48 72

-1

-0.5

0.5

24 48 72

-1

-0.5

0.5

Hours

TCP U%

TAEndedRateSCP_CICS

UsingSCP% _BA TCH

LPI_BATCH

Figure 7: Performance index of one week (top) with 3

important features selected for the prediction (bottom).

PI (t, ...)

CPU (t,

...)

Usage (t,

...)

Inp

ut:

Performance

Index

Series

Neural

Network

Outp

ut:

Figure 8: Feedforward network.

SELF-LEARNING PREDICTION SYSTEM FOR OPTIMISATION OF WORKLOAD MANAGEMENT IN A

MAINFRAME OPERATING SYSTEM

215

for new neurons being added to both, new and

existing layers, the created networks are not

necessarily as deep and narrow as networks

constructed by Cascade Correlation. FlexNet creates

networks with as many layers and hidden units as

needed to solve a given problem (Mohraz and

Protzel, 1996). The FlexNet allows us to skip some

of our model selection steps (see Subsection 3.4.1).

3.3.3 Support Vector Regression

A standard ν-SVR algorithm was compared to the

ANN approaches. First results did not reach the

prediction accuracy of the ANN methods, as the

regression did not predict the periods of high

fluctuation of the PI. The reason for this will be the

object of further investigation.

3.4 Training Process

Our training process consists of two phases: A

model selection phase, in which we determine the

prediction model best suited to the data available,

and a final training phase in which the prediction

method is trained for application to a future PI time

series. Note that due to the non-stationarity of the

underlying business processes, model selection and

final training have to be repeated at regular intervals.

The following two subsections explain the training

phases.

3.4.1 Model selection

Model selection is a vital step for the accuracy of the

later prediction. A review is given by (Kearns et al.,

1997). We estimate the prediction error of various

prediction models for yet unknown data by a

modified cross-validation. This method keeps the

training data in chronological order and is called

sliding window validation (see Figure 10). The

models we tested were combinations of hyper-

parameters for the prediction methods (number of

input/hidden neurons and γ for FF-net, ν and C for

SVR), and the size of the input data window. The

FlexNet does not require model selection for the

number of input/hidden neurons since it develops its

topology automatically.

The model with the lowest validation error is

used for the final training.

3.4.2 Final Training

The final training phase consists of training the

model on all available data that were collected from

the system. Typical load characteristics are repeated

on a daily basis. However, their details differ by

some amount from day to day. The first four

weekdays were used as a training set to predict the

fifth weekday, which thus was the test set. For

online operation we would use all five days as

training set.

3.5 Prediction Method

Our aim was to predict system workload over a time

of 7 hours, i.e. 50 samples (after preprocessing, one

sample holds data of 500 seconds). The system was

trained with data vectors of length m = w × f,

whereby w is the length of the input window,

typically 20-40 samples, and f is the number of

features (up to 5).

Results are evaluated with two error measures:

The number of predicted samples outside a

predefined error band (Prediction Error or PE,

defined in accordance with WLM experts), and the

mean average percentage error (MAPE) commonly

used for evaluating electricity load predictions

(Ortiz-Arroyo et al., 2005).

Figure 9: FlexNet topology.

Output

Hidden Unit 1

Input1

Input2

Bias +1

Hidden Unit 2

train

val.

train

val.

train

val.

10x

Length of PI measurement

testtrain

Figure 10: Sliding window validation.

ICEIS 2007 - International Conference on Enterprise Information Systems

216

4 RESULTS

We present results for two data sets. Dataset A is a

synthetic simulation of a 3-week load scenario and

dataset B is a real-time measurement of system

parameters during a realistic load scenario. All

results were obtained with smoothed data as

explained in Subsection 3.1.2.

We first examined whether the prediction

accuracy stays constant when only features from the

batch to be predicted (batch2a in this example) are

used. Apart from the fact that the use of more than

1-2 additional features actually decreased the

prediction accuracy, we could ascertain that there is

only a minimal loss when limiting the features to a

particular batch, as can be seen in Figure 11.

The FlexNet returns better results on the

synthetic data, whereas the FFN achieved

improvement on the real data. Figure 12 shows the

test result with a FFN trained on the first 4 days,

using a 40-sample time lag, 20 hidden neurons,

γ = 0.7 and a prediction over 50 samples into the

future. Figure 13 zooms in on the last day to display

the prediction and the fault tolerance band we use to

evaluate the result.

Regarding the PE and MAPE, the best results

achieved with a single PI as training input and with

additional features for a 7-hour prediction (50

samples) are displayed in Table 1.

Table 1: Results for 50-sample prediction.

Features Data PE MAPE Method

A 28 170 FlexNet

PI only

B 52 20.7 FFN

A 54 57.9 FlexNet

Many features

B 34 12.6 FFN

Table 2 shows the same results for an 8-minute

(single-sample) prediction. As expected, the results

for single-sample prediction are better than for the

50-sample prediction. Note that prediction over 50

samples is generally considered to be a difficult

problem. In general, employing feature selection to

select the best three features and adding these to the

PI, improved prediction results.

Table 2: Results for single-sample prediction.

Features Data PE MAPE Method

A 2.5 190 FlexNet

PI only

B 13 6 FFN

A 5.6 28 FlexNet

Many features

B 14.4 6.7 FlexNet

Figure 11: PE error (round markers) and MAPE error

(square markers) in percent for various numbers o

features shown on x-axis. The use of predicted batch2a

features only (dotted lines) is compared to the use o

features from all batches (solid lines).

1 2 3 4 5 6 7 8 9 10

100

Number of features used for prediction

Error

PE batch2a

PE all batches

MAPE batch2a

MAPE all batches

Figure 12: Performance Index time series over three days

(dashed line) with prediction of last day (solid line).

24 48 72

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2.2

2.4

Hours

Performance Index

PI timeseries

Prediction

Figure 13: 50-sample prediction (corresponds to 7 hours)

for dataset B shown over the course of one day, with a

fault tolerance band.

6 12 18 24

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2.2

2.4

Hours

Performance Index

Fault tolerance

PI prediction

SELF-LEARNING PREDICTION SYSTEM FOR OPTIMISATION OF WORKLOAD MANAGEMENT IN A

MAINFRAME OPERATING SYSTEM

217

5 CONCLUSION AND FURTHER

WORK

Early results seem promising, as the system could

predict the tendencies of the workload behaviour as

it changed over time. However, we will shorten the

prediction horizon to gain higher prediction

accuracy in future. A prediction horizon of 2-4

hours, rather than the 7 hours used so far, is

sufficient for the future applications that are planned

to exploit that prediction system. We also plan to

investigate the inclusion of calendaring data into the

prediction method, to include prior knowledge about

load peaks on special days of the month.

The results of this project show that it is possible

to predict the PI, as a relevant performance

indicator, of a complex mainframe system cluster,

like a z/OS Sysplex. This enables us to develop

several functionalities that do resource assignment

or resource provisioning to workloads right in time.

This can avoid resource contention, like CPU or

memory usage, significantly. Especially in big

mainframe environments with very large numbers of

competing workloads this can improve the

throughput and optimal resource usage significantly

and thus optimise data processing costs.

Further work in this area is to analyse and

develop an automatic relearning environment for the

prediction of non-stationary processes. This way, the

prediction system will be able to permanently

improve its prediction quality for each customer by

learning more and more about its particular

environment and workload behaviour. As an

additional benefit, relearning adjusts the prediction

to changes in typical customer workloads, e.g. when

business related changes take effect.

We expect further improvements, when

additional explicit knowledge is incorporated, e.g.

changes or higher workload at special days of the

month or upcoming events.

No other results in the field of operating system

workload management prediction are known to us,

hindering the comparison of our early results with

others.

ACKNOWLEDGEMENTS

The authors thank Clemens Gebhard and Sarah

Kleeberg for their participation in this project.

REFERENCES

Aman, J., Eilert, C. K., Emmes, D., Yocom, P.,

Dillenberger, D., 1997. Adaptive algorithms for

managing a distributed data processing workload. In

IBM Systems Journal, 36(2): 242-283.

Bensch, M., Schröder, M., Bogdan, M., Rosenstiel, W.,

2005. Feature Selection for High-Dimensional Indus-

trial Data. In Proceedings of European Symposium on

Artificial Neural Networks (ESANN), Bruges.

Bigus, J. P., Hellerstein, J. L., Jayram, T. S., Squillante,

M. S., 2000. AutoTune: A Generic Agent for Auto-

mated Performance Tuning. In Practical Application

of Intelligent Agents and Multi Agent Technology.

Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene

Selection for Cancer Classification using Support

Vector Machines. Machine Learning, 46(1): 389-422.

Kearns, M., Mansour, Y., Ng, A. Y., Ron, D., 1997. An

Experimental and Theoretical Comparison of Model

Selection Methods. Machine Learning, 27(1): 7–50.

Mohraz, K., Protzel, P., 1996. FlexNet: A Flexible Neural

Network Construction Algorithm. In Proceedings of

European Symposium on Artificial Neural Networks

(ESANN), Bruges.

Ortiz-Arroyo, D., Skov, M. K., Huynh, Q., 2005. Accurate

Electricity Load Forecasting with Artificial Neural

Networks. In Computational Intelligence for

Modelling, Control and Automation, 1: 94-99.

Polak, E., Ribiere, G., 1969. Note sur la Convergence de

Methodes de Directions Conjugees. Revue Francaise

d'Informatique et de Recherche Operationnelle, 3: 35-

43.

Rumelhart, D., Hinton, G., Williams, R., 1986. Learning

internal representations by error propagation. In D.

Rumelhart and J. McClelland, editors, Parallel

Distributed Processing, pages 318-362. MIT Press.

Vaupel, R., Teuffel, M., 2004. Das Betriebssystem z/OS

und die zSeries – Die Darstellung eines modernen

Großrechnersystems. ISBN 3-486-27528-3, Olden-

bourg Wissenschaftsverlag, Germany.

ICEIS 2007 - International Conference on Enterprise Information Systems

218