different applications, the probability can be
converted into relative likelihood levels, such as 1 to
7 to donate extreme low, very low, low, medium,
high, very high, and extreme high, with different
thresholds. The impact of a risk depends on the
context of the application. Since Cloud services are
based on the Virtual Machines hosted in the Cloud
hardware resources, in our work of managing the
reliability risk of Cloud services, physical host
failure is considered as the threat to the QoS of a
Cloud service and the impact is modelled as the
number Virtual Machines to be allocated to the
physical hosts and potentially to be affected in case
of physical host failures. In order to fit impacts into
risk calculations they are given a scale, such as 1 to
7 to indicate the level to which the impact could be.
The final risk value is calculated as likelihood
multiplied by the impact level and multiplication
result is then converted into a score scale of 1-7 to
indicate the overall risk level.
In order to support a large scale and flexible
Virtual Machine scheduling optimisation, in this
paper we propose an evolutionary Cultural
Algorithm (CA) (Reynoids, 1994) based risk aware
Virtual Machine allocation algorithm to minimize
the risk of physical host failure. A CA framework
consists of three major components: a population
space, an external belief space, and a communication
protocol that defines the interactions between the
two spaces. Based on these components, a CA
controls a dual interdependent inheritance process
that harnesses the evolution of individuals both from
the macro-evolutionary level as within the belief
space and at the micro-evolutionary level as within
the population space. Our case study indicates this
dual interdependent inheritance process could
effectively support the scheduling optimisation in
large scale searching space and the traditional
Genetic Algorithms.
In the Section 2, the historical data based
modelling of physical host failure threat is
introduced and this provides a basis for assessing the
risk associated with the Virtual Machine allocations.
In Section 3, a specific risk mitigation strategy is
identified and designed as a risk impact
minimisation problem, which is based on the
searching and optimisation mechanisms of
evolutionary Cultural Algorithm. Section 4
introduces and explains the main contributions of the
work, which designs and implements an effective
Cultural Algorithm to support a large scale and
flexible Virtual Machine scheduling optimisation
and demonstrate the performance of the optimisation
algorithm with empirical comparisons with
traditional Genetic Algorithm(GA). Section 5 briefly
introduces the closely related works of general risk
management frameworks for Cloud service
provision and Virtual Machine scheduling specific
approaches. Finally, the conclusion of current work
in progress is presented in Section 6, in which future
work is also introduced and discussed.
2 MODELLING PHYSICAL HOST
FAILURE THREAT
In order to calculate the Probability of Failure (PoF)
of a physical host, gathering data relating to past and
current status of cloud resources is an essential
activity. Monitoring resource failures is crucial in
the design of reliable systems, e.g. the knowledge of
failure characteristics can be used in resource
management to improve resource availability.
Furthermore, calculating the risk of failure of a
resource depends on past failures as well.
There are various events that cause a resource to
fail. Cloud resources may fail as a result of a failure
of one or more of the resource components, such as
CPU or memory; this is known as hardware failure.
Another event which can result in a resource failure
is the failure of the operating system or programs
installed on the resource; this type is known as
software failure. The third event is the failure of
communication with the resource; this is referred to
as network failure. Finally, another event is the
disturbance to the building hosting the resource,
such as a power cut or an air conditioning failure;
this type is event is known as environment failure.
Sometimes, it is difficult to pinpoint the exact cause
of the failure, i.e. whether it is hardware, software,
network, or environment failure; this is therefore
referred to as unknown failure.
The Time To Fail (TTF) of a physical host is
modelled as a life time random variable whose value
is always more than zero. Given the physical host
has been up until time t, the Probability of Failure
(PoF) of it during future time interval x is a
conditional probability P{X<=t+x|t}. In order to
calculate the P{X<=t+x|t}, the general methodology
is based on the following 5 steps:
Step 1: Collect observed historical data
representing TTFs;
Step 2: Find a probability distribution model of
TTF of the physical host by data distribution fitting;
Step 3: Estimate the particular parameters of the
risk model by analysing the observations on the
physical host;