A Full Private VM Pool Capacity Failure occurs
when the resources of a private cloud are fully
utilized, i.e. when VMs are allocated to other cloud
applications. Such a failure arises when the
infrastructure of the private cloud is overloaded.
Such a situation impacts both the cloud provider and
cloud application providers. The former will not be
able to meet one of its core needs, scalability while
the latter will find that their needs are unmet.
To detect a full private VM pool capacity failure,
it is necessary to monitor the number of idle VMs in
the cloud. If no idle VMs in the private cloud are
available (or a limited number) then mitigating steps
should be taken. In terms of recovery, when only a
private cloud is used, there is no solution for
recovery from this type of failure. The cloud
provider will often simply reject any request for
VMs needed for new cloud applications until VMs
become available. The cloud provider can solve this
issue by scaling the infrastructure out (i.e. adding
new physical resources), but this solution can cause
optimization issues in the longer term, i.e. the
infrastructure may subsequently be underutilized.
On the other hand, a hybrid cloud solution is more
efficient in terms of time and resource utilization.
Cloud providers can scale their cloud infrastructure
dynamically based on demand. This provides an
opportunity to scale up and down based on need, so
the utilization of resources can be optimized.
A Cloud Outage Failure has been known since
the emergence of cloud computing (Google, 2010,
Laing, 2012). In this failure, cloud application
providers and users cannot access services and
applications. The cause of this failure varies. It can
be a network partition of the data center, an outage
of the power supply or even a bug in cloud
infrastructure software. This failure can have a
major impact on cloud providers and end users
because all running services in the data center
become effectively unavailable. There are many
detection mechanisms that can be used here, e.g.
pinging where a dummy message is sent to a
suspected machine and a reply expected. Recovering
from this failure is a challenge. There is no solution
in a single cloud model (i.e. private cloud), however,
a hybrid cloud model can address it to some extent,
by launching VMs from healthy clouds, or at least
mitigate its impact on cloud applications and overall
cloud systems (Grozev and Buyya, 2014). This
cannot be guaranteed to be autonomously supported
however. Thus if a private cloud experiences a total
outage, then an automated process to launch new
VMs on the public cloud via redirecting request
from the private cloud may be impossible.
A VM Crash Failure can be caused by
hardware failure, a virtual machine monitor issue
(VMM), an operating system issue or indeed an
application software issue. The impact of this failure
is downtime of the VM and the inaccessibility of the
cloud applications running on the VM. This can
have major issues, especially when a cloud
application is running on only the impacted VM.
The situation is less risky when the application is
running on two or more VMs, e.g. as shown in
Figure 1 with a web application running on back end
servers with a load balancer running at the front end.
Like all distributed systems, detecting a VM failure
in cloud computing is non-trivial (Tanenbaum and
Steen, 2006). This is because, even if the suspected
VM is running (apparently) healthily, there may be
other issues such as network partitions or test
messages getting lost due to network issues. As with
cloud outage failures, the mechanism that can be
used to detect a VM crash can be as simple as
pinging.
Recovery from this failure in the private cloud
solution can be achieved by launching a new VM
from the single private cloud. However, the request
for a new VM may be rejected if there are no
available VM resources in the private resource pool.
If the cloud application is running on only one VM,
this will make the application unavailable for
potentially unpredictable periods. In contrast, in a
hybrid cloud solution this situation can be avoided if
the system is able to launch new VMs to the public
cloud. In this case, the amount of downtime is
determined by how long it takes to launch a new
VM and install and start the cloud application.
Knowing the temporal thresholds for such re-
establishment is a key aspect of TTR.
A VM Slowdown Failure is less problematic
type of failure than other failure types because the
application is still running and can respond, although
the response time may be relatively high. The cause
of this failure can be due to other VM issues.
Another cause may be input/ output (I/O) sharing
among multiple VMs running on a physical
machine. In (Armbrust et al., 2009), Armbrust et al.
introduce I/O sharing as an obstacle for cloud
computing that can unpredictably affect the overall
system performance. They claim that sharing CPUs
and memory among different VMs results in
improved performance in cloud computing but that
I/O sharing is a problem.
The effects of a VM slowdown failure on cloud
applications can include a delay in handling requests
and QoS subsequent decrease. There are two
possible methods for detecting a VM slowdown