Business Resiliency Framework for Enterprise Workloads in the Cloud

Valentina Salapura, Ruchi Mahindru and Richard Harper

IBM T.J. Watson Research Center, Yorktown Heights, NY, U.S.A

Keywords: Cloud Computing, High Availability, Enterprise Class Applications, Resiliency.

Abstract: Businesses with enterprise-level workloads - such as Systems Applications and Products (SAP) workloads -

require business level resiliency including high availability, clustering, or physical server appliances. To

enable businesses to use enterprise workloads in a cloud, the IBM Cloud Managed Services (CMS) cloud

offers many SAP enterprise-level workloads for both virtualized and non-virtualized cloud environments.

Based on our experience with enabling resiliency for enterprise-level workloads like SAP and Oracle, we

realize that as the end-to-end process is quite cumbersome, complex and expensive. Therefore, it would be

highly beneficial for the customers and the cloud providers to have a systematic business resiliency

framework in place, which would very well fit the cloud model with appropriate level of abstraction,

automation, while allowing the desired cost benefits. In this paper, we introduce an end-to-end business

resiliency framework and resiliency life cycle. We further introduce an algorithm to determine the optimal

resiliency pattern for enterprise applications using a diverse set of platforms in the IBM CMS cloud

offering.

1 INTRODUCTION

Cloud computing is being rapidly adopted across the

IT industry to reduce the total cost of ownership of

increasingly more demanding workloads. It is

becoming the new de facto environment for many

system deployments in a quest for more agile on-

demand computing with lower total cost of

ownership. Medium and large enterprises, various

agencies and institutions are quickly adopting cloud

computing, with high expectations of resiliency that

have heretofore been associated with the traditional

dedicated datacenters.

Enterprises demand usage of Enterprise

Resource Planning (ERP) (Hossain, 2001)

workloads - such as Systems Applications and

Products (SAP) workloads (Gargeya, 2005). The

ERP workloads are used to manage business

operations and customer relations that are commonly

required for running business back-office operations.

Such workloads are legacy applications which

require an infrastructure with high availability,

clustering, shared storage, or physical server

appliances. Clustering enables redundancy, which in

turns provides resiliency. Setting such resiliency

features based on legacy processes is quite

cumbersome, as it involves multiple teams

performing different actions leading to expensive

setup and steady state operations.

Business impact of loss of IT infrastructure can

be huge. Enterprise-class clients, such as banks,

financial institutions, hospitals, governments, utility

companies, etc. can suffer business losses even from

short outages and service interrupts. Cost of

downtime could dissolve business, or cause

irreparable brand damage, loss of customer data and

reputation. To deliver the level of resiliency needed

by various enterprise applications, a systematic way

and a framework for delivering resilient systems is

needed.

To satisfy a growing need of enterprise

customers to run their enterprise-level workloads in

cloud environment, IBM Cloud Managed Services

(CMS) (IBM Corporation, 2017; Kochut, 2011)

enables enterprise workloads. IBM CMS is a

premier cloud offering with both shared and

dedicated customer set up, with many resiliency

features built it at the infrastructure and hypervisor

level (Salapura, 2013). CMS provides a unique mix

of virtualized and non-virtualized infrastructure,

diverse types of platforms e.g. System x and Power

systems and service level agreement (SLA)

mechanisms. IBM CMS cloud offers a fully

managed solution for many SAP applications in the

cloud.

686

Salapura, V., Mahindru, R. and Harper, R.

Business Resiliency Framework for Enterprise Workloads in the Cloud.

DOI: 10.5220/0006376707140721

In Proceedings of the 7th International Conference on Cloud Computing and Services Science (CLOSER 2017), pages 686-693

ISBN: 978-989-758-243-1

Figure 1: Business resiliency framework for cloud.

In this paper, we introduce an end-to-end business

resiliency framework we developed in the scope of

IBM CMS cloud for a sample set of resiliency

solutions. We show how various resiliency patterns

can be implemented for enterprise applications for

various supported platforms. An example of these

applications is SAP High-performance Analytic

Appliance (HANA) (Färber, 2012).

2 CLOUD BUSINESS

RESILIENCY FRAMEWORK

Cloud computing is highly desirable for its main

attributes like scalability, multi-tenancy, on-demand

computing resources delivered over the network, and

pay-per-use pricing. This offers flexibility in using

as few or as many IT resources as needed at any

point in time. Thus, the users do not need to predict

resources that they might need in future, which

makes cloud infrastructure attractive for businesses.

To ensure resiliency of workloads, a number of

resiliency features are implemented. These features

typically include VM restart upon failure or VM

migration, and high availability clusters (HA

clusters), where multiple OS images are used to

enable continuous operation of enterprise

applications. Implementing HA clusters requires

several resiliency features such as VM anti-

collocation, where VMs are placed on different

physical hosts, or shared storage, so that multiple

VMs might need to access the same DB data. It also

avoids distributed solutions that are hard to manage

between the cloud and non-cloud environments with

a part of the workload running in the cloud, and the

other part running in a non-cloud environment.

Given that deployment of such resiliency

features can be complex, it warrants a need for a

structured and ongoing approach to plan, maintain,

test and continuously improve such business

resiliency operations. To address this need, we

introduce an end-to-end business resiliency

framework and the lifecycle we developed in the

scope of IBM CMS cloud.

Each enterprise customer has different

workloads requirements and SLAs. Cloud is a multi-

tenant environment with the goal to standardize the

solutions and phases within them as much as

possible to simplify the process associated with

deployment and steady state operations to promote

the asset reuse while maintaining the low cost. Such

objective motivates the need for an end to end

business resiliency framework, as described below.

An end to end business resiliency framework

allows cloud provider and their customers to define

a comprehensive resiliency plan in the cloud

environment for both cloud native and cloud enabled

workloads. The resiliency framework enables to

systematically assess and evaluate customer

workloads to identify resiliency requirements as

determined by business impact analysis.

Business Resiliency Framework for Enterprise Workloads in the Cloud

687

Figure 2: Business resiliency lifecycle.

Because of the business risk analysis and resiliency

requirements, and referencing the resiliency

reference architecture, an appropriate resiliency plan

is created which uses the selected cloud resiliency

patterns. For the cloud enabled workloads, the

resiliency plan selects resiliency components, and

gives configuration of the appropriate resilience

elements, such as HA clusters or data replication.

For new applications for which there are no

resiliency patterns available, guidelines are provided

to assist designing resilient applications from scratch

via patterns, reference architectures, and wizards.

As illustrated in Fig. 1, the developed resiliency

plan is deployed across the cloud and non-cloud

environments available to the client. The ongoing

operation of the customer’s resiliency mechanisms is

instrumented, and collected data is analysed to

ensure that the required resiliency and SLA levels

are being met. In addition, the framework provides

recommendations on how to improve the resiliency

posture and/or reduce the cost of resiliency while

maintaining SLAs.

3 CLOUD BUSINESS

RESILIENCY LIFECYCLE

The resiliency framework is used for both initial

resiliency deployment, and for ongoing resiliency

optimization. An important component in the

framework is continuous monitoring of the deployed

workload, the environment, risks, costs, and other

parameters. Based on the variations in the workload,

risk updates, impending events and disasters, cost

variations (e.g., cost of datacenter, cost of replication

network, datacenter saturation), or variation in client

workload importance over time, the resiliency plans

are revised and updated.

Any resiliency solution, whether for a high

availability or disaster recovery, undergoes a life

cycle, as shown in Fig. 2. Due to space limitations,

only a summary of the key phases is presented here.

The two major phases of the life cycle are “Plan,

Implement, and Test,” and “Manage and Sustain.” In

the former phase the requirements are “Assessed and

Evaluated”, and the resiliency solution is “Planned

and Designed,” leveraging the business resiliency

framework described earlier. At the end of this

phase, the resiliency solution is “Implemented,

Tested, and Deployed” into the production

environment and enters service.

While in service (also called steady state),

resiliency functionality is leveraged to “Protect” the

workload from the anticipated failures. All resiliency

solutions must periodically undergo “Recovery

Test” to ensure that the resiliency mechanisms are

functional. Such tests often reveal weaknesses in the

resiliency solution which in turn requires a

continuous revalidation of the “Plan, Implement, and

Test” life cycle phase to update the weak elements

of the solution.

In addition, while in service the workload may

suffer failures. The resiliency features will engage

and the workload will enter the “Failed Over or

Degraded” state. The exact configuration of this

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

688

state of course depends on the resiliency solution in

effect. If the failure did not result in any physical

destruction of the originating environment, then a

“Non-Reconstructive Failback” to that environment

is performed when that environment has been

repaired. However, if the originating environment

has been irremediably damaged, then a

“Reconstructive Failback” process is performed.

This equates strongly to re-entering the “Implement

and Acceptance Test” state.

Next sections demonstrate an end-to-end

scenario with the application of resiliency

framework, along with various phases and states of

the lifecycle.

4 WORKLOADS AND

RESILIENCY PATTERNS

CHARACTERIZATION

Enterprise applications can be deployed in several

different ways, depending on the features needed,

performance requirements, or if high availability

support is needed. Each of these different

configurations provide a different level of resiliency

inherent to that configuration. For example, SAP

HANA can be deployed in a single node, or multiple

nodes configuration. As a single node deployment, it

can be scaled up to include resources and provide

high availability for data. As a multi node scale our

configuration, it can be configured to support high

availability clusters, or not. Each of these

configurations achieves different levels of resiliency,

satisfying different SLA requirements. Also, each of

these configurations has a different cost base.

Since for the cloud environment we want to

provide economy of scale, we want to provide the

required level of resiliency while minimizing cost.

Resiliency is increased in highly automated

environments, thus eliminating human errors, and

reducing cost.

To select the optimal configuration which

provides required level of resiliency, we introduce

an algorithm to be used with our resiliency

framework. While the framework has several

phases, design and plan, test, steady state, etc.,

resiliency evaluation and optimization can be

performed in each phase. In this paper, we focus on

the optimization during the “Plan and design” phase.

In the future, we plan to work on optimization for

the other framework phases, such as for “Test and

validate”, and for “Steady state”.

To determine the optimal resilient architecture

for a workload, we use application attributes to

qualify each application. The attributes describe

applications’ properties in terms of memory

consistency, state-full and scaling. The attributes we

use are a result of our observation of the workloads

deployed, and attributes that must be considered for

resiliency deployment. By no means it represents the

exhaustive list of workloads’ attributes. The

attributes are listed below:

Relaxed Consistency vs. Sequential

consistency: Sequential consistency model requires

a write by any processor to be seen by all processors

in real time, maintaining the overall order of writes

between the processors, but which can impact

performance. Relaxed consistency requires

programmers to implement the memory consistency

explicitly by applying synchronization.

Stateless vs. Stateful: A stateless applications

do not record data generated in one session for use

in the next session. A stateful application must

record changes in state caused by events during a

session.

Distributed vs. Monolithic: A monolithic

application is a single-tiered application in which the

user interface and data access code are within a

single program. A multitier application is a client–

server architecture in which web interface,

application processing, and data management

functions are physically separated.

Scale-up vs. Scale-out: Scale up (after referred

to as Vertical Scaling) approach adds more resources

(processors and memory) to a server, providing a

more robust server. Scale out (or Horizontal

Scaling) approach adds more servers without

increasing individual servers.

To capture characteristics of different workloads,

we distinguish a set of different workload groups.

We characterize different workload groups for each

of the given attributes. For example, we differentiate

between less critical database workloads, financial

databases, and transactional workloads, to name a

few. Other workload groups can be characterized

following our nomenclature. For example, a less

critical database workload can use a relaxed

consistency, and is implemented as a distributed

system that can grow by adding more servers. A

financial database is stricter, and it must preserve the

exact order of transactions thus demanding

sequential consistency.

Cloud providers offer different level of service

level agreement (SLA) to describe level of

availability. SLAs are contractual obligations and in

many cases, include penalties for noncompliance.

Business Resiliency Framework for Enterprise Workloads in the Cloud

689

Table 1: Application characterization based on their

attributes.

Typically offered SLA levels are: 99.999%, 99.99%,

99.9%, 98.5%, which describe different allowed

down time. This translates in tolerated maximum

downtime from 26.3 seconds per month for the

highest SLA level, to 14.4 hours of downtime per

month for servers with the lowest SLA level

(Schmidt, 2006).

Different resiliency patterns achieve different

level of availability. We distinguish between high

availability (HA) solutions and disaster recovery

(DR) solutions for each SLA level. For example, to

achieve SLA of 98.5% the use virtual or physical

server restart mechanisms is sufficient. To achieve

the highest SLA level, more sophisticated methods

must be used such as high availability clustering

with servers configured in active-active

configuration. We list some existing resiliency

patterns for HA and DR for achieving different level

of availability in Table 2.

Table 2: Service level agreement levels, cost and

resiliency solution.

Each of the resiliency pattern is associated with a

cost base to implement it. Thus, for a higher SLA

level, more resources must be used, which results in

a higher cost solution. For example, using a cluster

of servers to implement high availability cluster

offers a higher availability solution, but it also costs

more than restarting a single server, as in a lower

level availability solution.

5 RESILIENCY PATTERN

OPTIMIZATION

For our resiliency pattern optimization algorithm, we

quantify different resiliency patterns we can use as a

solution architecture to ensure high availability to

workloads.

Each resiliency solution has a range of

availability numbers, cost and recovery time

associated to it. The cost of any solution has

multiple contributing components such as

cost{overhead, operational cost, deployment cost,

maintenance cost, resource cost}. Recovery time is

defined as a range of minutes it takes to recover,

which could be a range of minutes to recover.

For example, active replication pattern ensures

advance high availability but comes at high

operational cost, whereas virtual machine restart

provides moderate availability at a low operational

cost. However, operation disruption may not be

acceptable for the mission critical workloads. The

attributes associated with resiliency patterns are

captured by system matter experts.

When submitting a request for a business

resiliency solution a user may specify the attributes

application attributes based on the system’s

guidance or select all the standard attributes for a

given application listed in the best practices catalog

by the service provider. We list only a subset of

possible attributes and their mapping to applications.

The mapping is continuously evolving for new

applications and identified attributes. These

attributes may be reprioritized over time, and revised

as learnt through the system to eliminate correlated

attributes.

For a given SLA, our algorithm selects the

optimal resiliency pattern that matches the given

application attributes and the availability while

minimizing the total cost. The combination of the

attributes of a workload and the desired SLA level

drives the cost of the appropriate resiliency solution.

The algorithm performs the following steps:

• For a given workload, enumerate the attributes

of the workload

• Select the required SLA

• For the given SLA and attributes, select possible

resiliency patterns.

• From possible resiliency patterns, select lowest

cost pattern for which the desired SLA is met.

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

690

• Add to library of resiliency pattern solutions for

that application and given SLA.

This algorithm effectively maps the user provided

input workload attributes to the attributes captured

for each of the resiliency solutions. Every new

determined resiliency pattern is added to the library

of statically defined pattern-workload mapping,

which contains pre-matched set of solutions for

combination of attributes selected.

Each resiliency solution has an embedded

availability model {number of nodes, heartbeat, type

of box, type of storage} that can be adjusted at any

stage of the process.

6 CASE STUDY: BUSINESS

RESILIENCY FRAMEWORK

FOR HANA

SAP HANA appliance can be deployed on a single

node server (without high availability), scale-up with

high availability, scale-out without high availability,

or a scale-out multi node cluster to provide high

availability. Due to complex deployment and high

cost associated with deploying SAP HANA solution,

it is generally recommended to first scale-up the

solution as much as possible (i.e., to add more

resources to the server) before considering the

option to scale-out (to distribute the application on

multiple servers). Scale-out is primarily available for

analytics workloads like BW on HANA or DataMart

scenarios. Scale-up is generally available for the

transactional workloads like SAP Business Suite on

HANA including ERP, CRM, SRM, SCM, etc.

Figure 3 shows a scenario where customer

initially requires to host a small-to-medium sized

critical BW analytics application to provide real-

time feeds to its sample users. In the “Plan,

Implement, Test” phase, first the requirements are

assessed and evaluated. Based on the assessment

and evaluation in the “Assess and Evaluate” state, a

scale up solution with high availability scenario is

planned and designed. The solution is planned,

designed, implemented, tested, and deployed with an

active-passive configuration.

During the “Manage and Sustain” phase, the

solution is maintained in steady state, where it is

monitored for performance and capacity constraints.

The high-availability set-up is tested on some

periodic basis to pro-actively validate and fine-tune

the setup, in case of an actual failure. In case of an

actual failure of the primary node, the workload is

failed-over to the standby node. The deployed

solution is scaled-up to its maximum capability,

based on the event and capacity monitoring and

recovery test functionality.

Figure 3: Use of elements of the resiliency framework

across the resiliency life cycle.

Figure 4: Use of elements of the resiliency framework

across the resiliency life cycle.

Overtime, based on the performance data collected

during the “Manage and Sustain” phase along with

further assessment and evaluation of customer’s

growing needs to host a large analytics application

with larger number of real users, a scale-out solution

is selected. It offers high availability to provide

increased benefit with the corresponding increased

investment. The new solution is planned, designed,

implemented and tested with the right sized business

resiliency solution to cater the customer

requirements, as shown in Figure 4.

7 RELATED WORK

Enterprise-class customers (e.g., banks, insurances

Business Resiliency Framework for Enterprise Workloads in the Cloud

691

and airlines) need management services such as

monitoring, patching, backup, change control, high

availability and disaster recovery to support systems

running complex applications with stringent IT

process control and quality-of-service (QoS)

requirements. Such features are typically offered by

IT service providers in strategic outsourcing (SO)

engagements, a business model for which the

provider takes over several aspects of management

of a customer’s datacenter resources, software

assets, and processes. Servers with such support are

characterized as being managed.

This should be contrasted with unmanaged

servers provisioned using basic Amazon Web

Services (AWS) (Miller, 2010; AWS Corporation,

2017) and IBM’s SoftLayer (SoftLayer, 2017)

offerings, where the cloud provider offers automated

server provisioning. To make a server managed,

these cloud service providers have networked with

other service partners that customers can engage to

fill all the gaps up and down the stack. This enables

the user to add services to the provisioned server,

but the cloud provider assumes no responsibility for

their upkeep or the additional services added.

Therefore, it puts burden on the customer to obtain a

fully managed solution for their enterprise workload

rather than the cloud service providing an end-to-end

fully managed solution for the customers.

AWS provides the IT resources so that the

customers can launch entire SAP enterprise software

stacks on the AWS Cloud. AWS Cloud is SAP

verified and certified. AWS provides highly reliable

services and multiple fault-tolerant Availability

Zones for disaster recovery implementations.

The IBM Cloud Managed Services (CMS)

product (IBM Corporation, 2017) from IBM is an

enterprise cloud which provides managed services

for critical workloads and enterprise-level SLA

mechanisms. CMS supports several software

services on CMS, such CMS4SAP CMS4ORCALE

and AMM4SAP.

HANA is fully certified to run on VMware

platform (King, 2014). vSphere 5.5 has a limitation

in that the largest VM can be created with 1 TB of

disk storage only. Depending on the usage of the

data, both warm and cold data can reside together on

the disk. This enables extension of the total size of

the SAP HANA database above 1 TB. Currently,

several cloud providers that are enabling themselves

to support more options for SAP and SAP HANA

workloads.

In (Dekel, 2003), the authors have described a

system that focuses on performance aware high

availability which is achieved through cloning and

replication of application’s state. Our work focuses

on a resiliency framework to determine and deploy

the optimal resiliency support for a given workload

based on its characteristics.

8 LESSONS LEARNT AND

CONCLUSIONS

During enablement of enterprise workloads in the

IBM’s CMS cloud, several points became apparent.

First insight is that each enterprise customer has a

varied set of resiliency requirements for the

workload that they are running depending on the

nature of their business. Therefore, the cloud service

providers must handle such heterogeneous

requirements with least amount of customization

possible that must be delivered in a tight scheduled

while maintaining the low cost.

Second insight is that there is a variety of cluster

set up configurations that may be possible and the

required set up may vary from workload to

workload. Additionally, the cluster set up may

evolve overtime based on the changing requirements

of the workload. Additionally, the cloud provider

must support the application level replication

technology depending on the applications being

deployed. As the requirements are highly variable

and may evolve overtime as the workload evolves, it

is crucial to systematize and standardize the end to

end process of the resiliency solution planning,

implementation, testing and delivery.

Another insight is that multiple levels of resiliency

at infrastructure, middleware and application levels

are required for increased system reliability.

Implementing multiple levels of resiliency delivers a

more robust system, while enabling operation of

these different levels of resiliency seamlessly.

Enterprise-class customers, such as banks,

financial institutions, hospitals, governments, utility

companies, etc. can suffer high business losses even

from short outages and service interrupts in the IT

infrastructure. Cost of downtime could dissolve

business, or cause irreparable brand damage, loss of

customer data and reputation. A structured and

continuously improving mechanism is required to

deliver the level of resiliency needed by the various

enterprise applications.

We introduced an end-to-end business resiliency

framework and resiliency life cycle. We further

discussed various resiliency patterns implemented

for enterprise applications using a diverse set of

platforms in the IBM CMS cloud offering. To

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

692

determine the optimal resiliency pattern for various

applications, we introduce an optimization algorithm

which takes into consideration application attributes

and the desired SLA level, to determine the optimal

resiliency pattern. We showcased an end to end

application of the resiliency framework and

resiliency life cycle for a SAP HANA scenario.

REFERENCES

L. Hossain, J. D. Patrick, and M. A. Rashid, Enterprise

Resource Planning: Global Opportunities and

Challenges. Hershey Park, PA: Idea Group

Publishing. 2001.

Gargeya, VB 2005, ‘Success and failure factors of

adopting SAP in ERP system implementation’,

Business Process Management Journal, Vol.11, No.5,

pp501–516.

IBM Corporation, IBM Cloud Managed Services.

[Online]. Available:

http://www.ibm.com/marketplace/cloud/managed-

cloud/us/en-us. Last Accessed: 2017-03-15.

A. Kochut, Y. Deng, M. R. Head, J. Munson, A. Sailer, H.

Shaikh, C. Tang, A. Amies, M. Beaton, D. Geiss, D.

Herman, H. Macho, S. Pappe, S. Peddle, R. Rendahl,

A. E. T. Reyes, H. Sluiman, B. Snitzer, T. Volin, and

H. Wagner, "Evolution of the IBM cloud: Enabling

an enterprise cloud services ecosystem," IBM Journal

of Research and Development, vol. 55, pp. 397-409,

Nov. 2011.

V. Salapura, R. Harper, and M. Viswanathan. "Resilient

cloud computing." IBM Journal of Research and

Development, vol. 57 no. 5, 2013.

F. Färber, N. May, W. Lehner, P. Große, I. Müller, H.

Rauhe, and J. Dees, "The SAP HANA Database--An

Architecture Overview," in IEEE Data Eng. Bull. vol.

35, no. 1, pp. 28-33, 2012.

F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg,

and W. Lehner, "SAP HANA database: data

management for modern business applications," in

ACM Sigmod Record, vol. 40, no. 4, pp 45-51, 2012.

K. Schmidt, High Availability and Disaster Recovery:

Concepts, Design, Implementation. Springer Science

and Business Media, 2006.

F. P. Miller, A. F. Vandome, and J. McBrewster, Amazon

Web Services. Alpha Press, 2010.

AWS Corporation, Amazon Elastic File System. [Online].

Available: https://aws.amazon.com/efs/. Last

Accessed: 2017-03-15.

SoftLayer. [Online]. Available: http://www.softlayer.com/.

Last Acessed: 2017-03-15.

C. King, “Demystifying Production SAP HANA on

VMware vSphere Implementations” VMWare White

Paper: (2014). [Online]. Available:

http://info.vmware.com/content/31421_Whitepaper_

Reg?asset=whitepaper&cid=70180000000Nlj1&src=

wsite. Last Accessed : 2017-03-15.

E. Dekel, O. Frenkel, G. Goft, Y. Moatti, "Easy:

engineering high availability QoS in wServices",

Reliable Distributed Systems 2003. Proceedings.

22nd International Symposium on, pp. 157-166,

2003, ISSN 1060-9857.

Business Resiliency Framework for Enterprise Workloads in the Cloud

693