Recovery-Oriented Resource Management in Hybrid Cloud

Environments

Yasser Aldwyan

1,2

and Richard O. Sinnott

School of Computing and Information Systems, The University of Melbourne, Australia

Department of Computer Science, Islamic University in Madinah, Saudi Arabia

Key

words: Cloud Computing, Hybrid Cloud, Reliability, Recovery Oriented Computing (ROC), Fault Tolerance,

Virtual Infrastructure Management, Resource Management.

Abstract: Cloud-based systems suffer from an increased risk of individual server failures due to their scale. When

failures happen, resource utilization and system reliability can be negatively affected. Hybrid cloud models

allow utilization of local resources in private clouds with resources from public clouds as and when needed

through cloudbursting. There is an urgent need to develop cloudbursting approaches that are cognisant of

the reliability and fault tolerance of external cloud environments. Recovery oriented computing (ROC) is a

new approach for building reliable services that places emphasis on recovery from failures rather than

avoiding them completely since even the most dependable systems will eventually fail. All fault tolerant

techniques aim to reduce time to recover (TTR). In this paper, we develop a ROC-based fault tolerant

approach for managing resources in hybrid clouds by proposing failure models with associated feedback

control supporting a local resource-aware resource provisioning algorithm. We present a recovery-oriented

virtual infrastructure management system (RVIMS). Results show that RVIMS is more reliable than those

of single cloud environments even though TTR in the single cloud environments are about 10% less than

those of RVIMS.

1 INTRODUCTION

Cloud computing has become an important

paradigm in the field of Information Technology. It

encompasses important aspects such as self-service,

enhanced access to virtualized resources, such as

compute, i.e. virtual machines (VMs) and storage

resources, and on-demand capacity provisioning.

While cloud computing has many benefits, its main

advantage is the ability to provision and release

resources based on workloads (Voorsluys et al.,

2011). However, this ability poses a challenge when

the traditional single cloud model is used and the

data center overloaded. Hybrid cloud models can

help to overcome such issues by making the cloud

adaptive and allowing seamless utilization of

resources of public clouds along with the local

resources of private clouds.

As in other large-scale distributed systems,

failures in cloud computing are unavoidable due to

their scale (Javadi et al., 2012). When failures occur,

the utilization of resources, reliability and

availability can be adversely affected. Consider a

web application running in a cloud. Figure 1 shows a

typical architectural model of the application that

consists of a load balancer running on a VM, (i.e., a

front end tier) and a number of web servers deployed

on other VMs in the cloud. The load balancer

distributes the incoming (http) requests for web

pages evenly across the back end servers, and the

back end servers process the requests and send back

the responses which can include web pages or other

web resources. Now, assume in a given period, the

load on these web servers increases dramatically.

This could result in a failure in performance if the

cloud management system was not aware of this

increase and could not take appropriate action, e.g.

by provisioning new VMs to run new web servers.

This failure could cause the CPU utilization of the

web servers to exceed predefined thresholds. The

response time thus becomes high and thus the

overall quality of service (QoS) will decrease. This

situation exemplifies an undesirable form of

utilization, i.e., overutilization of resources.

Furthermore, under-utilization of resources is

another unwanted form of resource utilization. This

situation occurs when the incoming requests

decrease and the cloud management system does not

Aldwyan, Y. and Sinnott, R.

Recovery-Oriented Resource Management in Hybrid Cloud Environments.

DOI: 10.5220/0006359602530265

In Proceedings of the 7th International Conference on Cloud Computing and Services Science (CLOSER 2017), pages 225-237

ISBN: 978-989-758-243-1

225

decrease the number of resources given to the cloud

application. At any given time a VM crash can cause

unavailability of the cloud application and affect the

QoS. To overcome these issues, cloud management

systems need to be fault tolerant and have the ability

to provide cloud applications with appropriate

features, such as auto-scaling and load balancing,

across multiple clouds. This requires the cloud

management system supports fault tolerant

techniques in hybrid cloud environments.

Most existing fault tolerant approaches attempt

to predict failures and avoid them before they occur.

However, failures will inevitably happen. Thus, in

our work we attempt to detect failures and recover

from them as rapidly as possible i.e. our proposed

fault tolerant techniques put effort to reduce time to

recover (TTR). The aim of our research is to

develop an approach for enabling fault tolerant

techniques in hybrid cloud environments that will

enable those environments to be recovery-oriented,

adaptive and self-managed in order to optimize the

utilization of resources; improve availability and

reliability and minimize human interventions.

Figure 1: Example architectural model of cloud-based web

application.

We adopt a recovery-oriented computing approach

(ROC) (Berkeley, 2004). Specifically, we

investigate failure models for hybrid clouds. To

support this, we first explore failure models by

identifying failures and their characteristics that can

impact on TTR. We then develop a feedback control

system. To support this, we propose a system model

based on a control theory (Yixin et al., 2005), which

provides a number of mechanisms for designing

automated self-managing computing models. The

model periodically monitors the health of resources,

detects failures and recovers from failures. We use

this to establish a recovery-oriented virtual

infrastructure management system (RVIMS). We

apply RVIMS in a hybrid cloud environment and

show how it can subsequently be used to manage

cloud services in fault tolerant hybrid cloud

environments.

The rest of the paper is organized as follows.

Section 2 presents background and an overview of

cloud computing and ROC with focus on feedback

control models for self-managing computing

systems and virtual infrastructure management. In

Section 3 we present failure models for hybrid

clouds. Then in Section 4, the feedback control

model is proposed, including a local resource aware

hybrid cloud provisioning algorithm. In Section 5,

the architecture of RVIMS is presented. The

experiments and results are presented in Section 6.

Finally, conclusions and future work are presented

in Section 7.

2 BACKGROUND AND RELATED

WORK

This section aims to provide an overview of cloud

computing with particular focus on hybrid cloud

models. We introduce relevant concepts and services

and provide a brief overview of ROC; the control

theory for self-managing computing systems, and

virtual infrastructure management systems for

hybrid clouds.

Cloud Computing is based on two main

technologies: service oriented architectures (SOA)

and virtualization technology (El-Refaey, 2011).

SOA are an architectural model in which everything

should be provided as a service, including

processing power, networks, storage, IT

infrastructure, software, hardware and other IT

resources. El-Refaey (El-Refaey, 2011) defines

virtualization technology as technology that provides

an abstraction of computing resources: examples

include CPUs, memory, storage and networks. This

has led to the division of physical servers into

multiple virtual machines (VMs). This technology is

significant in cloud computing because it facilitates

the management of resources and improves the

utilization of those resources.

A Hybrid Cloud, also known as a Multi-Cloud

in related literature (Grozev and Buyya, 2014) is a

combination of two or more different cloud

infrastructures: private, public or community. A

major benefit of the hybrid cloud model is that it

takes the attributes of both public and private clouds

and combines them into a unified, automated, and

well-managed cloud computing offering. A hybrid

cloud takes advantage of a public cloud’s scalability

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

226

and cost-effectiveness while also providing the

control and high performance available in a private

cloud. However, utilizing resources from both

models in an optimized way is a major issue in

hybrid clouds, and to ensure a minimum level of

quality of service (QoS), providers must leverage

strategies that fulfil potentially diverse QoS

requirements.

Resource Provisioning is used for exercising

control over VMs or other cloud resources in cloud

systems. This is often used for launching,

suspending and terminating VMs. A hybrid cloud

resource provisioning service us used for

provisioning resources from different clouds. For

instance, if a user requests three VMs in a hybrid

cloud, the resource provisioning can launch one VM

from the private cloud and the other two from the

public cloud. Many factors can be considered when

provisioning resources in hybrid clouds including

local resource awareness. Local resource awareness

factors allows resource provisioning services to first

launch local resources, e.g. resources from private

clouds, and then, when local resources are at

capacity, launch resources on public resources

(Grozev and Buyya, 2014). This improves

scalability however it incurs additional complexity

and potential monetary cost.

Recovery-Oriented Computing (ROC) is an

approach developed by Berkeley and Stanford for

investigating innovative strategies and techniques

for building highly-dependable Internet services

(Berkeley, 2004). ROC places emphasis on recovery

from failures rather than avoiding failures. The

motivation behind this approach is that even long-

lasting and healthy systems will periodically face

failures. There are three assumptions considered in

the ROC approach: Software and hardware will

definitely fail; not all failures can be predicted in

advance, and individuals can/do make mistakes.

Applying the ROC approach helps a system

designer change their way of thinking from paying

attention to failure avoidance to paying attention to

reducing the time needed to recover from a failure.

This shift of thinking can help create more robust

cloud platforms (Microsoft, 2014a). In (Microsoft,

2014a), the authors propose an approach to design

reliable cloud services based on ROC. They

introduce mean time to recover (MTTR) (or only

time to recover [TTR] (Microsoft, 2014b)) which is

the time needed to re-establish a service after a

failure. Minimizing TTR requires a system to be

recovered to a fully functional state as quickly as

possible.

A hybrid cloud model helps in overcoming

scalability and availability issues. A better approach

is to make such adaptations happen automatically

(Tanenbaum and Steen, 2006). This is often known

as autonomic computing or self-managing systems,

such as IBM’s Autonomic Computing and

Microsoft’s Dynamic Systems Initiative. A main

goal of these systems is to minimize the costs of

operation by increasing automation, i.e. making

systems self-managing without any human

interaction (Yixin et al., 2005). Making cloud

systems self-managing allows them to recover from

failures quickly and, subsequently the TTR can be

reduced.

To make automatic adaptations, monitoring and

adjustments of Cloud systems is required. One way

to achieve this is to organize systems to include

high-level feedback and control systems. These

systems are typically based on control theory which

gives a valuable set of methods for building self-

diagnosis, self-repairing, self-healing, self-

optimizing, self-configuring and ultimately self

managing computing systems (Yixin et al., 2005,

Tanenbaum and Steen, 2006).

A system that manages virtualized resources is

known as a Virtual Infrastructure Management

System (VIMS) or a virtual infrastructure (VI)

manager (Sotomayor et al., 2009). When designing

and implementing hybrid (or private) clouds,

Sotomayor el al. (Sotomayor et al., 2009) outlined

several features of public clouds that must be

considered: a hybrid cloud must provide a

consistent, identical, homogeneous view of all

virtualized resources without consideration of the

virtualization technology, e.g. Xen or VMware. It

must have control over the entire lifecycle of VMs,

such as VM disk image and software deployment. It

must be adaptive to meet dynamic needs for

resources, such as peak times where resources are

not sufficient for the current demand. Resource

provisioning in hybrid clouds must also be

configurable to different policies in order to meet the

systems’ requirements such as server consolidation

to save power and/or cost optimization or support

high availability demands.

OpenNebula (OpenNebula, 2016) is an example

of a VIMS. In our work, we propose a recovery

oriented virtual infrastructure management system

(RVIMS), which employs the proposed failure and

feedback control models in hybrid cloud

environments.

2.1 Related Work

A standard model of Cloud computing (i.e. a single

Recovery-Oriented Resource Management in Hybrid Cloud Environments

227

cloud) poses a number of challenges (Grozev and

Buyya, 2014). In terms of availability and reliability,

a data center outage can cause mass service

unavailability and all cloud clients will not be able to

access cloud resources (NIST, 2011, Armbrust et al.,

2009, Laing, 2012, Google, 2010). Another

challenge is scalability. This occurs when the cloud

is overloaded. Hybrid clouds as a kind of Multi-

Cloud (Grozev and Buyya, 2014) overcome these

issues by making the cloud adaptive and utilize

cloud resources from external public clouds.

Considerable work has been done in the

development of hybrid cloud open-source libraries,

e.g. Apache LibCloud (Libcloud, 2009). These

libraries provide a unified API for managing and

deploying cloud resources, such as VMs and storage

(Grozev and Buyya, 2014). However, they are not

concerned with resource provisioning. Likewise,

cross-cloud management services, such as

RightScale (RightScale, 2006) only offer (unified)

user interfaces and tools for managing different

clouds without implementing resource provisioning.

In terms of application deployment, projects like

Contrail (Cascella et al., 2012) aim at deploying

applications in hybrid cloud environments.

However, they only deal with provisioning and set-

up and do not consider the distribution of workload

and autoscaling of applications.

With regards to resource provisioning, Javadi et

al. (Javadi et al., 2012) propose hybrid cloud

resource provisioning policies in the presence of

resource failures. These policies only consider

resource failure correlations when redirecting user

requests for resources to suitable cloud providers

and not during deployment of VMs for user requests.

Furthermore, in (Mattess et al., 2013), the authors

propose a dynamic provisioning algorithm of

MapReduce applications across hybrid clouds,

however this does not handle failures. In contrast to

others, our hybrid cloud resource provisioning

approach considers: failures that may occur during

the whole lifecycle of cloud applications; recovery

mechanisms based on recovery-oriented computing

(ROC) (Berkeley, 2004); local resource awareness

issues (Grozev and Buyya, 2014) to reduce the cost,

and offers a multi-tier architectural model suited for

web-based applications.

Significant efforts have been made in the

development of virtual infrastructure management

systems (Sotomayor et al., 2009). These kind of

management systems are often called Multi-Cloud

services when they support multiple clouds (Grozev

and Buyya, 2014). Amazon Elastic Compute Cloud

(Amazon EC2) (Amazon, 2016) and Google

Compute Engine (Google, 2016) provide single-

cloud VIMS. On the other hand, Eucalyptus

provides a VIMS across hybrid clouds, however the

clouds have to be compatible with Amazon Web

Services (AWS) (Eucalyptus, 2008). They also lack

fault tolerant techniques to make them more reliable

and resilient. In contrast, we propose a recovery-

oriented virtual infrastructure management system

(RVIMS) suitable for hybrid cloud environments

that is failure-aware and leverages control theory.

This offers self-managing features for systems

(Yixin et al., 2005). We also facilitate the process of

adding new clouds to the system. Lastly, we develop

a vendor-independent cloud agent that provides

RVIMS with feedback messages to monitor

resources across multiple clouds.

3 FAILURE MODELS FOR

HYBRID CLOUDS

In this section, we explore failure models for hybrid

cloud systems. We identify possible failures and

their characteristics and potential recovery solutions.

Our failure models are adapted from Resilience

Modeling and Analysis (RMA), i.e. an approach for

improving resilience at Microsoft (Microsoft,

2014b). This approach adopts the main ideas behind

ROC, i.e. failures will eventually occur and thus it is

necessary to try to reduce TTR to minimize the

impacts of such failures. There are six types of

failures in the proposed models: full private VM pool

capacity, cloud outage, VM crash, VM slowdown,

VM high load failures and VM low load failures as

illustrated in Table 1.

Table 1: Failures in hybrid clouds.

Failure

Recovery

Private Cloud

Solution

Hybrid Cloud

Solution

Full Private

VM Pool

Capacity

Request rejected

Launch VMs on

public clouds

Cloud

Outage

No solution

Launch VMs on

healthy clouds

VM Crash

Launch a VM on

private cloud; or

reject

Launch new VM on

private or public

clouds

Slowdown

Launch a VM on

private cloud; or

reject.

Launch new VM on

private or public

clouds

VM High

Load

Launch a VM to

distribute the load

Launch a VM to

distribute the load

VM Low

Load

Decrease the

running VMs

Decrease the number

of running VMs

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

228

A Full Private VM Pool Capacity Failure occurs

when the resources of a private cloud are fully

utilized, i.e. when VMs are allocated to other cloud

applications. Such a failure arises when the

infrastructure of the private cloud is overloaded.

Such a situation impacts both the cloud provider and

cloud application providers. The former will not be

able to meet one of its core needs, scalability while

the latter will find that their needs are unmet.

To detect a full private VM pool capacity failure,

it is necessary to monitor the number of idle VMs in

the cloud. If no idle VMs in the private cloud are

available (or a limited number) then mitigating steps

should be taken. In terms of recovery, when only a

private cloud is used, there is no solution for

recovery from this type of failure. The cloud

provider will often simply reject any request for

VMs needed for new cloud applications until VMs

become available. The cloud provider can solve this

issue by scaling the infrastructure out (i.e. adding

new physical resources), but this solution can cause

optimization issues in the longer term, i.e. the

infrastructure may subsequently be underutilized.

On the other hand, a hybrid cloud solution is more

efficient in terms of time and resource utilization.

Cloud providers can scale their cloud infrastructure

dynamically based on demand. This provides an

opportunity to scale up and down based on need, so

the utilization of resources can be optimized.

A Cloud Outage Failure has been known since

the emergence of cloud computing (Google, 2010,

Laing, 2012). In this failure, cloud application

providers and users cannot access services and

applications. The cause of this failure varies. It can

be a network partition of the data center, an outage

of the power supply or even a bug in cloud

infrastructure software. This failure can have a

major impact on cloud providers and end users

because all running services in the data center

become effectively unavailable. There are many

detection mechanisms that can be used here, e.g.

pinging where a dummy message is sent to a

suspected machine and a reply expected. Recovering

from this failure is a challenge. There is no solution

in a single cloud model (i.e. private cloud), however,

a hybrid cloud model can address it to some extent,

by launching VMs from healthy clouds, or at least

mitigate its impact on cloud applications and overall

cloud systems (Grozev and Buyya, 2014). This

cannot be guaranteed to be autonomously supported

however. Thus if a private cloud experiences a total

outage, then an automated process to launch new

VMs on the public cloud via redirecting request

from the private cloud may be impossible.

A VM Crash Failure can be caused by

hardware failure, a virtual machine monitor issue

(VMM), an operating system issue or indeed an

application software issue. The impact of this failure

is downtime of the VM and the inaccessibility of the

cloud applications running on the VM. This can

have major issues, especially when a cloud

application is running on only the impacted VM.

The situation is less risky when the application is

running on two or more VMs, e.g. as shown in

Figure 1 with a web application running on back end

servers with a load balancer running at the front end.

Like all distributed systems, detecting a VM failure

in cloud computing is non-trivial (Tanenbaum and

Steen, 2006). This is because, even if the suspected

VM is running (apparently) healthily, there may be

other issues such as network partitions or test

messages getting lost due to network issues. As with

cloud outage failures, the mechanism that can be

used to detect a VM crash can be as simple as

pinging.

Recovery from this failure in the private cloud

solution can be achieved by launching a new VM

from the single private cloud. However, the request

for a new VM may be rejected if there are no

available VM resources in the private resource pool.

If the cloud application is running on only one VM,

this will make the application unavailable for

potentially unpredictable periods. In contrast, in a

hybrid cloud solution this situation can be avoided if

the system is able to launch new VMs to the public

cloud. In this case, the amount of downtime is

determined by how long it takes to launch a new

VM and install and start the cloud application.

Knowing the temporal thresholds for such re-

establishment is a key aspect of TTR.

A VM Slowdown Failure is less problematic

type of failure than other failure types because the

application is still running and can respond, although

the response time may be relatively high. The cause

of this failure can be due to other VM issues.

Another cause may be input/ output (I/O) sharing

among multiple VMs running on a physical

machine. In (Armbrust et al., 2009), Armbrust et al.

introduce I/O sharing as an obstacle for cloud

computing that can unpredictably affect the overall

system performance. They claim that sharing CPUs

and memory among different VMs results in

improved performance in cloud computing but that

I/O sharing is a problem.

The effects of a VM slowdown failure on cloud

applications can include a delay in handling requests

and QoS subsequent decrease. There are two

possible methods for detecting a VM slowdown

Recovery-Oriented Resource Management in Hybrid Cloud Environments

229

failure. One is when a response time exceeds a

predefined threshold. The other is when the number

of requests per second exceeds a given threshold, i.e.

the cloud application responds after a delay.

Regarding failure recovery, the private cloud and the

hybrid cloud solutions are similar to solutions

involving the recovery mechanism of a VM crash

failure. The only difference is that the failed VM

will continue to serve, albeit with lower QoS, until

the new VM is ready to use.

A VM High Load Failure occurs when the

demands on a cloud application increase and

consequently the load on the VM increases and it

eventually becomes over-utilized. The cause of this

failure is related to high demands on the cloud

application itself, e.g. if it becomes very popular or

the business running it offers a temporary discounted

price on the offered services. For an application to

be ready for unexpected bursts, it needs to be

scalable dynamically and automatically. The impact

of this failure is on the QoS of the cloud application

and higher response times.

Detecting VM high load failures can be achieved

by monitoring CPU utilization, main memory

utilization and network traffic. A policy including a

set of thresholds for each VM (e.g. CPU and

memory) should be provided before launching the

application in the cloud. A failure occurs when a

resource exceeds a predefined threshold. To recover

from this failure in the private cloud, a new VM can

be launched in order to distribute the workload

evenly on all running VMs for the cloud application.

However, this will be problematic if the private

cloud is overloaded. Hybrid cloud systems can

overcome this issue by launching a new VM on a

public cloud and thus distributing the load to VMs

across multiple clouds.

In contrast to the previous failures a VM Low

Load Failure can introduce other undesirable forms

of resource utilization. This type of failure is caused

when the cloud application encounters lower

demands. As result, the resources will be

underutilized. The failure detection mechanism for

this failure is similar to the one for the VM high load

failure. There is a need for lower load thresholds for

cloud application resources. The failure happens

when the use of a monitored VM resource is below a

predefined threshold. In terms of failure recovery,

there is only one solution, and it can be applied in

either a private cloud or a hybrid cloud. This

solution is decreasing the number of running VMs

for the cloud application.

4 FEEDBACK CONTROL MODEL

FOR HYBRID CLOUDS

To detect failures and recover from them rapidly to

reduce TTR, there is an urgent need for a fault

tolerant system model based on the failure models

proposed in the previous section. As consequence,

we propose a self-managed feedback control model

in a hybrid cloud environment by organizing

components in a way that enables monitoring

resources and taking appropriate action in the

presence of failures. We describe the components of

the model and propose local resource-aware hybrid

cloud provisioning, failure detection and failure

recovery algorithms.

4.1 Components of Feedback Control

Model

As shown in Figure 2, the proposed model consists

of six components: the cloud interface component,

cloud services and policies components, provisioner

component, monitoring component and fault

tolerance components.

Figure 2: Feedback control model for hybrid clouds.

The Cloud Interface Component is an entry point

for cloud users (e.g. SaaS/cloud application

providers) to request cloud services upon which to

deploy their cloud applications. This component

receives requests for cloud services. After receiving

requests, the cloud interface component passes them

to the cloud services and policy component. Then,

the cloud interface component waits until it receives

a response from other components as to whether the

cloud platform is able to handle a given request by

provisioning the needed resources. If not, the request

is rejected. In both cases, the response is forwarded

to the cloud users.

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

230

The Cloud Services and Policies Component is

responsible for creating and initializing appropriate

cloud services and policy objects based on user

requests and then passing them to the provisioner

component for deployment. Cloud services can be

VMs used to deploy cloud applications, load

balancers used to distribute workloads across

multiple back end VMs or autoscaling services used

to scale up or down based on peak usage times.

Cloud policies are a set of rules or conditions that

help support failure detection and failure recovery

and to take appropriate action when one or more

conditions are met.

The Provisioner Component controls private

and public resource pools in the hybrid cloud

platform. It is responsible for provisioning VMs and

other cloud resources on private and public clouds.

This component provides a hybrid cloud

provisioning service using the provisioning

algorithm (see Algorithm 1). This service has a

number of advantages for the proposed model,

including providing awareness of local resources

and thus reducing monetary costs (Grozev and

Buyya, 2014). Furthermore, it has the ability to

easily add more resource pools either from public or

private clouds. Another benefit of this service is that

it provides a higher level of abstraction by hiding the

lower level communications and their

implementation across clouds.

Algorithm 1: Local Resource Aware Hy

rid Cloud

Provisioning.

input: nReqVMs //Number of VMs for a request

// VMs in unavailable clouds will not be considered

nIdleVMs  getTotalHybridCloudIdleVMs();

listPrivateIdleVMs  empty list;

listPublicIdleVMs  empty list;

if nReqVMs ≤ nIdleVMs then

nPrivateIdleVMs 

getTotalPrivatCloudIdleVMs();

if nPrivateIdleVMs ≥ 0 then

// add idle VMs from private pool

listPrivateIdleVMs.add(privateIdleVMs)

nRemainingVMs  nReqVMs –

length of listPrivateIdleVMs

if nRemainingVMs > 0 then

// add idle VMs from public pool

listPublicIdleVMs.add(publicIdleVMs)

if listPrivateIdleVMs is not empty then

// Launch VMs from private pool

launchVM( listPrivateIdleVMs)

if listPublicIdleVMs is not empty then

// Launch VMs from public pool

launchVM(listPublicIdleVMs)

else

reject the request

The provisioner component manages the whole life

cycle of VMs in the model. Firstly it launches,

suspends, migrates and terminates VM instances

across multiple clouds. Secondly it runs

customization scripts. Thirdly it installs cloud agent

software on the top of VMs, to allow the cloud

management system to monitor the health of the

hybrid cloud resources. The provisioner component

is also able to migrate VM instances from public

clouds to private ones whenever VMs in the private

cloud become free thereby reducing cost.

With regards to the local resource-aware hybrid

cloud-provisioning algorithm shown in Algorithm 1,

the primary parameter is the number of VMs

required for user requests. The algorithm checks

whether the total number of idle VMs in the hybrid

cloud system is sufficient for the request. If the

resources are not sufficient, then the request will be

rejected and the algorithm will exit. Otherwise, the

algorithm will first attempt to provision VMs from

the resource pool of the private cloud. If there are

not enough VMs at that time, the algorithm will

provision VMs from public pools of available public

clouds.

The Monitoring Component is designed to

monitor VMs in both the private and public clouds.

Each cloud agent running in a VM checks the health

of the VM and periodically sends health messages to

the monitoring component. This component listens

on a (predefined) monitoring port and receives the

VM health messages. It extracts the health

information (e.g. CPU load) and sends it to the

failure detection component so that any failures can

be detected as early as possible.

The Failure Detection Component detects

failures that occur during the lifetime of cloud

applications running on VMs. It can detect the

occurrence of a failure based on the VM health

information coming from the monitoring component

or on information obtained by direct

communications with VMs (e.g. pinging and

measuring response times). The failure detection

component detects failures using two approaches:

message based failure detection methods (MFDM)

and direct failure detection methods (DFDM). In the

MFDM method, the component receives a health

message from a running VM, extracts the health

information for resources from that VM and then

compares this information with predefined

thresholds provided by cloud policies that are

received initially from the cloud services and

policies component. These policies can vary from

one cloud application to another based on user

requests for the specific cloud services. For DFDM,

Recovery-Oriented Resource Management in Hybrid Cloud Environments

231

the failure detection component continuously

(periodically) pings all running VMs across clouds

in order to detect their liveness. If there is a reply,

then the receiving VM is awake and healthy. If there

is no reply, the cloud manager repeats the request to

make sure the problem is not related to a sporadic

network issue. If there is no reply after three

attempts, the failure detector component detects a

VM crash failure. Every time a VM crash failure is

detected, the detector checks if other VMs in the

resource pool have crashed, and if so a cloud outage

failure is flagged and the resource pool of that cloud

is marked as unavailable. This helps the provisioner

to use live and available clouds only.

Another form of DFDM is a slowdown detection.

In this form, the failure detection component

periodically sends test messages to all VMs running

a cloud application and measures their response

time. If the response time is more than the

predefined threshold in the policy of the cloud

application, then a VM slowdown failure is

identified. When such a failure occurs, the failure

detection component puts a failure message into the

failure recovery component’s message queue,

prompting the component to take appropriate action

to recover from the failure as soon as possible.

The Failure Recovery Component reads failure

messages in its message queue and handles the

failure accordingly. The means of recovering from

failures depends on the nature of failure and the

cloud policy and autoscaling service associated with

the cloud services and policies component of the

cloud application. This component is used to reduce

TTR and thus enhance the utilization of resources.

The input parameter of this component’s

algorithm is a failure message. The algorithm checks

the failed VM object and the autoscaling service in

which the VM is involved. After that, the algorithm

takes the appropriate action based on the type of

failure. If the failure is a VM crash failure, then the

algorithm will unregister the failed VM from the

load balancer, terminate the failed VM, launch a

new VM and, finally, register the new VM with the

load balancer. If the failure is a VM slowdown

failure, then a new VM will be launched and

registered with the load balancer and, finally, the

failed VM will be released. If the failure is a VM

high load failure, then the algorithm will launch and

failure is a VM low load failure, the cloud platform

will look first for a VM running on the public cloud

to be unregistered from the load balancer and then

released. Otherwise, any private VM for the cloud

application will be chosen. The reason for doing this

is to reduce the monetary cost since public VMs are

typically not free. We note that this algorithm

recovers from failures occurring during the lifetime

of the cloud application, while the failures occurring

before the deployment of the cloud application are

handled implicitly by the provisioner component.

5 SYSTEM ARCHITECTURE

Employing the previously mentioned models in

hybrid clouds requires the realization of a cloud

management system. A central goal of this kind of

system is to provide a high level of abstraction with

which to facilitate the process of managing

heterogeneous resources across different clouds

including monitoring resources in VMs across

different clouds. Ideally it should be possible to add

new public clouds easily. In this section, we present

a system architecture and implementation of a fault

tolerant cloud management system for hybrid cloud

environments supporting a Recovery-Oriented

Virtual Infrastructure Management System

(RVIMS).

Figure 3: Architecture of RVIMS.

RVIMS has a number of key features. First, it

deploys cloud services, such as creating instances of

VMs and volumes, in different cloud platforms. At

present the system supports Google Compute Engine

(Google, 2016) and Amazon (AWS) (Amazon,

2016) clouds and the Australian National eResearch

Collaboration Tools and Resources (Nectar) research

cloud (Nectar, 2016). Secondly, it automates the

process of installing cloud applications and their

dependencies on VMs. Thirdly, it supports

OpenStack based cloud platforms (e.g. Nectar)

(OpenStack, 2010). Fourthly, it can easily support

new cloud platforms. Essentially, it can detect and

recover from all failures previously given in Table 1.

As shown in Figure 3, the architecture of RVIMS

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

232

consists of the same components found in the

feedback control model section. RVIMS offers a

cloud interface component, which acts as an entry

point for cloud application providers. In the core

part, we have four components: the cloud services

and policies, monitoring, provisioner, failure

detection, and failure recovery components. As

noted, the provisioner component is responsible for

providing higher levels of abstraction for managing

heterogeneous resources more readily.

Figure 4 depicts the interaction between RVIMS and

a VM instance. To understand the RVIMS and VM

interactions, we consider the layered architecture of

a VM. The first layer from the bottom is the

operating system (OS) layer. The OS is chosen

during the process of launching a VM. The layer

above the OS layer is a cloud agent layer. This layer

allows management and monitoring of running VMs

regardless of the cloud platform from which they

were launched. To achieve this, the cloud agent has

to be in the application layer. On top of the cloud

agent layer, we have the cloud application itself.

With regard to the interaction between RVIMS and a

VM, the cloud agent monitors VM resources (i.e.,

CPU, memory, hard disk and network traffic) and

captures the behaviour of the cloud application. It

periodically sends a health message containing this

information to RVIMS. RVIMS reads the

information, detects failures if they have occurred or

failures that might subsequently arise and

subsequently resolves them.

Figure 4: Interaction between RVIMS and a VM instance.

6 EXPERIMENTS AND RESULTS

This section aims to evaluate RVIMS by considering

a recovery solution in a single (private) cloud as the

baseline and then applying recovery solutions in two

different hybrid cloud environments. The primary

evaluation metric here is TTR, a measurement that

starts when the system detects a failure and ends

when the failure has been successfully resolved.

Note that this work focuses on independent web

services and inter-process communications between

more complex applications and their deployment

management across hybrid clouds remains an area

for future work.

6.1 Experimental Testbed and Sample

Application

The RVIMS system was deployed in a hybrid cloud

environment consisting of three resource pools from

different cloud providers: Nectar, Amazon and

Google clouds. Nectar was used as a private cloud

and the others were treated as public clouds. The

private cloud was composed of four VMs, each of

which was an m1.medium instance with two virtual

CPUs (VCPUs) and 8 GB of RAM. Amazon cloud

as a public cloud provider had a t2.large VM with

two VCPUs and 8 GB of RAM. The other public

cloud resource, which was provisioned was from

Google cloud which included an n1-standard-2

instance with two VCPUs and 7.5 GB RAM. Nectar

instances were located in Melbourne, Australia. The

Amazon instance was in Sydney, Australia, and the

Google cloud was in Changhua County, Taiwan.

One VM instance in the private cloud was

running the RVIMS while the others were used to

deploy a sample application. The sample application

for this evaluation was a simple two-tier web server.

The first tier consisted of a load balancer running on

a private VM instance, used to balance the load

related to incoming http requests. The second tier

was composed of back-end web servers running VM

instances in different clouds.

A number of configuration parameters in the

experiments were used to cause deliberate failures.

This allowed us to observe the proposed system

reactions in the presence of different kinds of

failures. One configuration parameter was used to

shut down a back-end server running on a VM

instance. Another parameter slowed down the

processing procedures of http requests in a back-end

server by adding a sleep statement. Another

configuration parameter increased the load on a VM

instance by causing CPU utilization to exceed the

threshold specified in the user request for VMs. The

final configuration parameter acted to decrease the

load on a VM instance.

User requests for cloud resources consisted of a

number of VMs needed to run a cloud application.

In our experiment, we limit the user request to a

maximum of 3 VMs where one VM runs as a load

balancer while the others run as back-end web

servers. In all tests, all resources were provisioned

Recovery-Oriented Resource Management in Hybrid Cloud Environments

233

initially from the private cloud. When a failure was

injected, there were three ways of handling it (see

Table 2). The first is the private cloud solution,

which handles failures in the private cloud only. The

second is through the hybrid cloud (using the public

cloud Amazon) for handling failures in the Amazon

cloud. The third is through the cloud (using the

public cloud Google), which handles failures in

Google public cloud. In the second and the third

recovery solutions, we assume the private cloud is

overloaded.

Table 2: Types of failure recovery solutions.

Recovery Solution Description

Private cloud Only

(Nectar)

Failures resolved in the

private cloud only.

Hybrid cloud

(Nectar & Amazon)

Failures resolved in

Amazon cloud.

Hybrid cloud

(Nectar & Google)

Failures resolved in

Google cloud.

6.2 Test Scenarios: Results and

Discussion

Each test in the experiment involved the user request

(described in the previous section) in the hybrid

cloud environment. A configuration parameter was

used to deliberately cause failures. We immediately

began to measure TTR of RVIMS. We included a

test scenario for each failure in the failure models

except for the full private VM capacity failure—this

exception was made because we were interested here

in the failures that occurred during the run time of

our user request. Five tests were conducted for each

recovery solution.

6.2.1 VM Crash and Outage Failures Test

Scenario

The VM crash failure and outage failure are similar.

The impact of each failure is the only difference.

The impact of an outage failure is complete service

unavailability (downtime) while the effect of a VM

crash failure is decreased QoS. As a result, we used

the same test scenario for both. A recovery

mechanism was used to launch new VMs. In the

private cloud solution, the new VM was launched on

the private cloud (Nectar), while in the hybrid cloud

solution, it was launched on a public cloud. As

shown in Figure 5, the results indicate that the

private cloud had the lowest TTR, which means it

was the fastest to recover from failures. The worst

case showed that the TTR difference between the

private cloud and the hybrid cloud (Google) ranged

from 10-20 seconds. This may be an issue for

critical applications. The Amazon hybrid cloud

solution showed better TTRs than the Google hybrid

cloud solution. However, in the case of cloud outage

failure, there was no recovery at all in the private

cloud solution. Also, when the private cloud was

overloaded, the TTRs increased and became

undermined.

6.2.2 VM Slowdown Failure Test Scenario

The VM slowdown failure can be triggered by using

the VM slow down configuration parameter through

adding a sleep statement. The results showed that

TTRs in the private cloud solution were better than

those of the hybrid cloud solutions (see Figure 6). In

one case, a TTR for the Amazon hybrid cloud

solution was the same as that for the private cloud

solution. In another case, a TTR for the Amazon

hybrid cloud solution was the worst (i.e. Test 2).

Notice that TTRs in the private cloud solution were

more stable than those of the other solutions.

Additionally, the Google hybrid cloud solution was

more stable than the Amazon one. Test 1 of the

private cloud and the two other hybrid recovery

solutions showed similar TTRs.

Figure 5: Time to recover from VM crash failure.

Test 1Test 2Test 3Test 4Test 5

Time To

Recover

TTR

(secs)

Time to Recover from VM Crash Failure

Private Cloud Only (Nectar)

Hybrid Cloud ( Nectar & Amazon)

Hybrid Cloud (Nectar & Google)

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

234

Figure 6: Time to recover from VM slowdown failure.

6.2.3 VM High Load Failure Test Scenario

When the requests on the application increase, the

CPU and memory utilization subsequently increase.

In our tests, the VM load increase configuration

parameter was used to increase the CPU and

memory utilization. This caused a VM high load

failure to occur. The failure recovery mechanism

here was to scale the cloud application by adding

new VMs. As shown in Figure 7, the values of TTRs

for the private cloud solution were the smallest of all

of the tests. In terms of hybrid cloud solutions, the

Google hybrid cloud solution showed shorter TTRs

than Amazon. In addition, the Google hybrid cloud

solution was more stable than Amazon. TTRs for the

Google hybrid cloud solution were around 56

seconds while the average TTRs for the private

cloud solution were of the order of 47 seconds.

Figure 7: Time to recover from VM high load failure.

Figure 8: Time to recover from VM low load failure.

6.2.4 VM Low Load Failure Test Scenario

The VM low load failure is almost the opposite of

the VM high load failure. Here, the application load

decreases and thus the CPU and memory utilization

are reduced. To avoid resources being underutilized,

the cloud management systems need to scale down

the applications immediately by terminating idle

VMs. A VM load decrease configuration parameter

can deliberately cause this kind of failure. Moreover,

though this failure has no impact on the cloud

application itself, it affects the resource utilization of

the overall cloud platform. The results depicted in

Figure 8 show that the Amazon hybrid cloud

solution performed better with this kind of failure

than the other recovery solutions. The average TTR

for the Amazon hybrid cloud solution was about 0.3

seconds. The TTR values for the Google hybrid

cloud solution were the highest. Overall, the TTRs

for all solutions were small compared to other TTR

failures.

7 CONCLUSIONS AND FUTURE

WORK

We have presented a fault tolerant approach for

efficiently managing cloud resources, improving

reliability and availability and minimizing human

intervention in hybrid cloud environments. The

methodology behind this approach leverages ideas

from recovery-oriented computing (ROC)

approaches. A key assumption of the ROC approach

is that failures will (eventually) occur. Instead of

attempting to solely rely on predicting failures and

attempting to avoid them completely, one should

also focus on attempting to recover quickly from

them. Thus, the intermediate goal of applying ROC

Test 1 Test 2 Test 3 Test 4 Test 5

Time to

Recover

(TTR)

(secs)

Time to Recover from VM Slowdown

Failure

Private Cloud Only (Nectar)

Hybrid Cloud ( Nectar & Amazon)

Hybrid Cloud (Nectar & Google)

Test 1 Test 2 Test 3 Test 4 Test 5

Time to

Recover

(TTR)

(seconds)

Time to Recover from VM Highload Failure

Private Cloud Only (Nectar)

Hybrid Cloud ( Nectar & Amazon)

Hybrid Cloud (Nectar & Google)

Test 1 Test 2 Test 3 Test 4 Test 5

Time to

Recover

(TTR)

(secs)

Time to Recover from VM Lowload Failure

Private Cloud Only (Nectar)

Hybrid Cloud ( Nectar & Amazon)

Hybrid Cloud (Nectar & Google)

Recovery-Oriented Resource Management in Hybrid Cloud Environments

235

in our research was to reduce the recovery time after

failures to meet QoS demands.

Our fault tolerant approach supports a range of

failure models and a feedback control system model

that leverages hybrid cloud services for autoscaling,

resilience and load distribution. The RVIMS realizes

the proposed models and heterogeneous cloud

services. This fault tolerant system can overcome

many scalability issues found in single cloud

models.

In our research, we have taken into account

awareness of failures and local resources when

deciding upon resource provisioning. However,

considering only those factors leaves a number of

challenges unmet in achieving optimal resource use.

Further studies are needed to determine how other

factors influence the effectiveness of hybrid clouds.

Furthermore most resource provisioning

algorithms suffer when dealing with big data

including data transfer, limitations of network

bandwidths and the topology awareness of clouds. It

is imperative that such issues are addressed in order

to make hybrid clouds more reliable and efficient.

This is one focus of our future work.

Furthermore, augmenting our work with further

(richer) models of fault tolerance and failure

prediction is also an area of future consideration.

Thus whilst ROC can help certain classes of

application to recover, partial failures for long

running applications can have unique requirements

that need to be considered also.

Finally the challenge of bursting to the public

cloud can often have implications on what

applications and data can be recovered to external

resources, e.g. due to privacy considerations of

outsourcing. We shall also consider such demands as

part of a more holistic approach to where and how

RVIMS can be optimally applied.

ACKNOWLEDGEMENTS

The authors would like to express thanks to the

Nectar Research Cloud (www.nectar.org.au) for the

cloud resources used to perform this research.

REFERENCES

Amazon. 2016. Amazon Elastic Compute Cloud [Online].

Available: http://aws.amazon.com/ec2 [Accessed 22-

09-2016].

Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz,

R. H., Konwinski, A., Lee, G., Patterson, D. A.,

Rabkin, A., Stoica, I. & Zaharia, M. 2009. Above the

Clouds: A Berkeley View of Cloud Computing. EECS

Department, University of California, Berkeley.

Berkeley. 2004. Recovery-Oriented Computing Overview

[Online]. Available:

http://roc.cs.berkeley.edu/roc_overview.html

[Accessed 03-11-2016].

Cascella, R. G., Morin, C., Harsh, P. & Jegou, Y. 2012.

Contrail: a reliable and trustworthy cloud platform.

Proceedings of the 1st European Workshop on

Dependable Cloud Computing. Sibiu, Romania: ACM.

El-Refaey, M. 2011. Virtual Machines Provisioning and

Migration Services. Cloud Computing. John Wiley &

Sons, Inc.

Eucalyptus. 2008. Eucalyptus Cloud Platform [Online].

Available: www.eucalyptus.com [Accessed 25-09-

2016].

Google. 2010. Google. Post-mortem for February 24th,

2010 outage [Online]. Available:

https://groups.google.com/forum/#!topic/google-

appengine/p2QKJ0OSLc8 [Accessed 02-10-2016].

Google. 2016. Google Compute Engine [Online].

Available: https://cloud.google.com/compute

[Accessed 04-07-2016].

Grozev, N. & Buyya, R. 2014. Inter-Cloud architectures

and application brokering: taxonomy and survey.

Software: Practice and Experience, 44, 369-390.

Javadi, B., Abawajy, J. & Sinnott, R. O. 2012. Hybrid

Cloud resource provisioning policy in the presence of

resource failures. Cloud Computing Technology and

Science (CloudCom), 2012 IEEE 4th International

Conference on, 10-17.

Laing, B. 2012. Summary of Windows Azure Service

Disruption on Feb 29th, 2012 [Online]. Available:

https://azure.microsoft.com/en-us/blog/summary-of-

windows-azure-service-disruption-on-feb-29th-2012

[Accessed 07-07-2016].

Libcloud. 2009. Apach Libcloud [Online]. Available:

http://libcloud.apache.org [Accessed 05-09-2016].

Mattess, M., Calheiros, R. N. & Buyya, R. Scaling

MapReduce Applications Across Hybrid Clouds to

Meet Soft Deadlines. Advanced Information

Networking and Applications (AINA), 2013 IEEE

27th International Conference on, 25-28 March 2013

2013. 629-636.

Microsoft. 2014a. An Introduction to designing reliable

cloud services [Online]. Trustworthy Computing.

Available: https://www.microsoft.com/en-

us/twc/reliability.aspx [Accessed 08-01-2017].

Microsoft. 2014b. Resilience by design for cloud services

[Online]. Trustwothy Computing. Available:

https://www.microsoft.com/en-us/twc/reliability.aspx

[Accessed 02-01-2017].

Nectar. 2016. The Australian National eResearch

Collaboration Tools and Resources (Nectar) Research

Cloud [Online]. Available: http://cloud.nectar.org.au

[Accessed 13-10-2016].

NIST. 2011. The NIST Definition of Cloud Computing

[Online]. Available: http://csrc.nist.gov/publications/

PubsSPs.html [Accessed 9-10-2016].

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

236

OpenNebula. 2016. OpenNebula Cloud Platform [Online].

Available: http://opennebula.org [Accessed 13-11-

2016].

OpenStack. 2010. OpenStack Cloud Platform [Online].

Available: https://www.openstack.org [Accessed 10-

10-2016].

RightScale. 2006. RightScale - A Cloud Management

Solution [Online]. Available:

http://www.rightscale.com [Accessed 25-10-2016].

Sotomayor, B., Montero, R. S., Llorente, I. M. & Foster, I.

2009. Virtual Infrastructure Management in Private

and Hybrid Clouds. IEEE Internet Computing, 13, 14-

22.

Tanenbaum, A. S. & Steen, M. v. 2006. Distributed

Systems: Principles and Paradigms (2nd Edition),

Prentice-Hall, Inc.

Voorsluys, W., Broberg, J. & Buyya, R. 2011.

Introduction to Cloud Computing. Cloud Computing.

John Wiley & Sons, Inc.

Yixin, D., Hellerstein, J. L., Parekh, S., Griffith, R.,

Kaiser, G. E. & Phung, D. 2005. A control theory

foundation for self-managing computing systems.

Selected Areas in Communications, IEEE Journal on,

23, 2213-2222.

Recovery-Oriented Resource Management in Hybrid Cloud Environments

237