Practical Aspects for Effective Monitoring of SLAs in Cloud Computing

and Virtual Platforms

Ali Imran Jehangiri, Edwin Yaqub and Ramin Yahyapour

Gesellschaft f¨ur wissenschaftliche Datenverarbeitung mbH G¨ottingen (GWDG),

Am Faßberg 11, 37077 G¨ottingen, Germany

Keywords:

SLA, Cloud, Root Cause, Monitoring, QoS, Analytics, IaaS, PaaS, SaaS.

Abstract:

Cloud computing is transforming the software landscape. Software services are increasingly designed in mod-

ular and decoupled fashion that communicate over a network and are deployed on the Cloud. Cloud offers

three service models namely Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-

as-a-Service (SaaS). Although this allows better management of resources, the Quality of Service (QoS) in dy-

namically changing environments like Cloud must be legally stipulated as a Service Level Agreement (SLA).

This introduces several challenges in the area of SLA enforcement. A key problem is detecting the root cause

of performance problems which may lie in hosted service or deployment platforms (PaaS or IaaS), and ad-

justing resources accordingly. Monitoring and Analytic methods have emerged as promising and inevitable

solutions in this context, but require precise real time monitoring data. Towards this goal, we assess practical

aspects for effective monitoring of SLA-awareservices hosted in Cloud. Wepresent two real-world application

scenarios for deriving requirements and present the prototype of our Monitoring and Analytics framework. We

claim that this work provides necessary foundations for researching SLA-aware root cause analysis algorithms

under realistic setup.

1 INTRODUCTION

Today more and more (monolithic) applications are

decomposed into smaller components which are then

executed as services on virtualized platforms con-

nected via network communication and orchestrated

to deliver the desired functionality. While the foun-

dations for decomposing, executing and orchestrat-

ing were well settled over the past decade, allocating

the needed resources for and steering the execution of

components to deliver the required Quality of Service

(QoS) is an active area of research. A critical aspect

of steering complex service-based applications on vir-

tualized platforms is effective, non-intrusive, low-

footprint monitoring of key performance indicators

at different provisioning tiers typically Infrastructure-

as-a-Service, Platform-as-a-Service and Software-as-

a-Service. These key performance indicators are

assessed to verify that a Service Level Agreement

(SLA) between a customer and a provider is met. Ide-

ally, the assessment goes beyond simply detecting vi-

olations of the agreed terms, but tries to predict and

pre-empt potential violations. The provider enacts

counter-measures to preventor resolve the violation if

it does occur. Deriving effective counter-measures re-

quires precise monitoring information spanning mul-

tiple tiers of the virtualized platform and analysis of

monitoring data to identify the root cause(s) of per-

formance problems.

Monitoring systems have been used for decades

in different computing paradigms. Monitoring solu-

tions for previous computing paradigms pose signiﬁ-

cant limitations for their widespread adoption in large

scale, virtual platforms. The major obstacles with

these monitoring techniques are, their high perfor-

mance overhead, reliability, isolation, limited scala-

bility, reliance on proprietary protocols and technolo-

gies.

Common performance diagnosis procedures de-

pend on system administrator’s domain knowledge

and associated performance best practices. This pro-

cedure is labor intensive, error prone, and not feasi-

ble for virtual platforms. The prior art on detecting

and diagnosing faults in computing systems can be re-

viewed in (Appleby et al., 2001) (Molenkamp, 2002)

(Agarwal et al., 2004) (Chen et al., 2002)(Barham

et al., 2003). These methods do not consider virtual-

ization technologies and are inappropriate for rapidly

447

Imran Jehangiri A., Yaqub E. and Yahyapour R..

Practical Aspects for Effective Monitoring of SLAs in Cloud Computing and Virtual Platforms.

DOI: 10.5220/0004507504470454

In Proceedings of the 3rd International Conference on Cloud Computing and Services Science (CLOSER-2013), pages 447-454

ISBN: 978-989-8565-52-5

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

changing, large scale virtual platforms that by very

nature require effectiveautomated techniques for QoS

fault diagnosis.

Our research motivations are to study the effec-

tiveness and practicality of different techniques for

performance problem diagnosis and SLA based re-

source management of virtual platforms. This is very

well applicable to Cloud Computing where efﬁcient

monitoring is essential to accomplish these tasks. The

remainder of this paper is organized as follows. Sec-

tion 2 presents scenarios of our interest against which

in Section 3, we derive requirements for monitoring

and analytics. Based on this, in Section 4, we present

our Monitoring and Analytics framework prototype

developed as part of the GWDG Cloud Infrastructure.

Section 5 describes related work and ﬁnally, we con-

clude the paper in Section 6 with a summary and fu-

ture plan.

2 PERFORMANCE

MANAGEMENT SCENARIOS

AT GWDG

The GWDG is a joint data processing institute of the

Georg-August-Universit¨atG¨ottingen and Max Planck

Society. GWDG offers a wide range of information

and communication services. GWDG also owns a

state of the art Cloud Infrastructure. The Cloud infras-

tructure consists of 42 physical servers with a total of

2496 CPU cores and 9.75 Terabytes of RAM. Four of

the servers are Fujitsu PY RX200S7 using Intel Xeon

E5-2670. Thirty eight of the servers are Dell Pow-

erEdge C6145 using AMD Interlagos Opteron. The

raw disk capacity of the servers is 18.55 Terabytes.

Additionally, it hosts 1 PetaByte of distributed data

storage. On top of this, GWDG is offering “GWDG

Compute Cloud” and “GWDG Platform Cloud” ser-

vices. Currently, a self-service portal provides single-

click provisioning of pre-conﬁgured services. In fu-

ture, agents will be introduced to automatically nego-

tiate SLAs embodying desired qualities of procured

services, as outlined in our recent research (Yaqub

et al., 2011) (Yaqub et al., 2012).

GWDG Cloud service customers are divided in

two categories. The ﬁrst category are small institutes

and novice individuals. They require simple off the

shelf software services such as WordPress, Moodle,

MeidaWiki, etc. These services are served by Plat-

form Cloud which can automatically scale and moni-

tor them. The second category of customers are large

institutes and advanced customers. They have ad-

ditional performance, availability and scalability re-

quirements on top of multi-tier architectures and as

a result have much more complex large scale dis-

tributed services. This class of customers prefer to

only procure VMs with a pre-installed base operating

system (OS) from the Compute Cloud. These cus-

tomers already have IT staff that administer the sys-

tem, handle support and scalability concerns and do

not require support from Cloud provider to manage

their services running inside VMs.

As part of the motivation for requirement elicita-

tion, we studied two Learning Management Systems

(LMS), which are web based environments created

especially to support, organize and manage teaching

and learning activities of academic institutes.

2.1 Scenario 1: LMS on GWDG

Platform Cloud

Moodle is a free web based LMS. It is a web ap-

plication written in PHP. A simple Moodle installa-

tion comprises the Moodle code executing in a PHP-

capable web server, a database managed by MySQL

and a ﬁle store for uploaded and generated ﬁles. All

three parts can run on a single server or for scalabil-

ity, they can be separated on different web-servers,

a database server and ﬁle server. Moodle is a mod-

ular system, structured as an application core and

supported by numerous plugins that provide speciﬁc

functionality. Customers can install Moodle with a

single click on the web interface of GWDG Platform

Cloud. One of the most important advantages of host-

ing Moodle on GWDG Platform Cloud is the ability

to scale up or down quickly and easily.

GWDG Platform Cloud is based on open source,

community supported version of RedHat OpenShift

Origin PaaS middleware(OpenShift, 2013). It enables

application developers and teams to build, test, de-

ploy, and run applications in the Cloud. Users can

create applications via command line or IDE client

tools. Platform Cloud is a multi-language PaaS that

supports a variety of languages and middleware out

of the box including Java, Ruby, Python, PHP, Perl,

MySQL and PostgreSQL. Platform Cloud is deployed

on top of GWDG Compute Cloud and is in its early

test phase. Figure 1 depicts the resulting dependen-

cies after hosting Moodle on Platform Cloud.

2.2 Scenario 2: LMS on GWDG

Compute Cloud

Electronic Work Space (EWS) is another LMS that

is used by the University of Dortmund. Teachers

and students of the University use EWS to publish

information and materials for lectures, seminars and

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

448

Figure 1: Scenario 1: Services dependencies.

classes. Currently, there are approximately 30,000

registered users of this service. EWS is a Java EE ap-

plication deployed in JBoss Application Server (AS).

Its structure is highly modular and at Dortmund Uni-

versity, it was tailored to interface with popular ph-

pBB forum and MediaWiki servers. Moreover, to fa-

cilitate collaborative editing and management of doc-

uments, a WebDAV server was also attached with it.

For video streaming, it was interfaced with a dedi-

cated video streaming server from the University of

Duisburg-Essen. Information about research projects,

lectures and scientists working at the University is

managed by another platform called “Lehre Studium

Forschung (LSF)”. Students have the possibility to

look at the course catalog and register for courses

at LSF. Data (participant, room, course description)

from LSF is automatically transferred to EWS by a

custom middleware (a Java EE application) that is de-

ployed on a separate JBoss AS. Oracle database is

used as the content repository.

EWS is a complex application and requires multi-

VM deployment to address scalability, load balanc-

ing, and availability requirements. In addition, secu-

rity and privacy are other main concerns when con-

sidering deployment over Cloud infrastructure. For

such complex applications, GWDG Compute Cloud

is more suitable where advance customers procure

VMs with base OS and some monolithic middle-

Figure 2: Scenario 2: Services dependencies.

ware. Applications running inside these VMs appear

as black-box to the Compute Cloud administrators.

The “GWDG Compute Cloud” is a service sim-

ilar to the well-known commercial IaaS like Ama-

zon EC2. It is especially tailored to the needs of

partner institutes. It provides a simpliﬁed web inter-

face for provisioning and managing the virtualized re-

sources (VMs, disk, public IPs). The self-service in-

terface allows customers to choose different VM ﬂa-

vors (in terms of available processors, memory and

storage) and operatingsystems. Customers can access

their VMs directly from a web browser using the Vir-

tual Network Computing (VNC) protocol. Customers

can also attach the public IP address with VMs on

the ﬂy. The GWDG Compute Cloud is based upon

open source products such as OpenStack, KVM, and

Linux. The service is in public test phase and is

available to members of the Max Planck Society and

the University of G¨ottingen. Figure 2 depicts the re-

sulting dependencies after hosting EWS on Compute

Cloud.

2.3 Discussion

The key concern for Cloud customers is the availabil-

ity and performance of SaaS. In scenario 1, we have

a hierarchical dependency between SaaS, PaaS and

IaaS tiers of Cloud. However, in scenario 2, our LMS

PracticalAspectsforEffectiveMonitoringofSLAsinCloudComputingandVirtualPlatforms

449

application (SaaS) is only dependent on IaaS tier of

the Cloud. These dependencies lead to strong corre-

lation between some performance metrics.

If customers of LMS are experiencing perfor-

mance or availability problems, then both Cloud

provider and customer needs to ﬁnd the root cause

of the problem in their domain of responsibility. QoS

degradation may be due to the internal components

of SaaS or the problem could have propagated from

lower layers of the Cloud stack. For example, LMS

might guarantee to its users that the response time of

an HTTP request should be less than 1 second under

a ﬁxed request invocation rate. If users experience

response time greater than 1 second, then possible

cause could lie in the customer domain, e.g., invo-

cation rate is increased or the network between web

server and database is congested, etc., but the prob-

lem could also lie in PaaS domain, e.g., due to a slow

DNS server, contention of resources caused by col-

located applications. The problem could even lie in

the IaaS domain for instance, due to a malfunctioning

virtual network or contention of resources due to col-

located VMs, etc. Therefore, service providers need

an Analytics module to pinpoint the component/tier

responsible for QoS degradation.

Analytics module process monitoring data from a

wide set of components/tiers involved in service de-

livery. It is vital to determine a clear ownership of re-

sponsibility when problems occur. In real world com-

plex scenarios as the ones mentioned above, this is

a very challenging task. Scenario 1 incorporates in-

frastructure, platform and software service tiers. Our

experience shows that all the components from these

tiers need to be monitored. Although in scenario 2,

we only have SaaS and IaaS tiers, but monitoring is

still difﬁcult as the SaaS tier is highly complex and

appears only as a black-box to the IaaS tier. In the

given context, we identify three challenging problems

that we address in this work. These are:

1. Complexity: Usually the environment to monitor

is highly complex due to the complicated nature

of service delivery tiers and hosted applications.

2. Monitoring Isolation and Heterogeneity: Cloud

tiers are best monitored by separate monitoring

technologiesthat should work in isolated but com-

plementary fashion.

3. Scalability: Monitoring parameters grow expo-

nentially to the number of applications and ele-

ments belonging to the Cloud tiers. Hence, scal-

ability of monitoring approaches is of prime con-

cern as well as the method to deploy them auto-

matically.

3 REQUIREMENTS

In the following, requirements derived from the pre-

sented scenarios and general considerations from the

literature (Ejarque et al., 2011) (Jeune et al., 2012)

(Aceto et al., 2012) (Hasselmeyer and D’Heureuse,

2010) are generalised. These requirements have a

generic applicability where a Software-as-a-Service

(SaaS), e.g., Learning Management System (LMS)

is based upon Platform-as-a-Service (PaaS) or

Infrastructure-as-a-Service (IaaS) tier. We believe

that a thorough understanding of these requirements

provides solid foundations for an effective Monitor-

ing and Analytics solution for virtual platforms and

Cloud Computing.

3.1 Monitoring Framework (MF)

Requirements

M1. Scalability. The MF should be scalable i.e. it

can cope with a large number of monitoring data col-

lectors. This requirement is very important in Cloud

Computing scenarios due to a large number of param-

eters to be monitored for potentially large amount of

services and elements of Cloud tiers that may grow

elastically.

M2. Heterogeneous Data. The MF should consider

a heterogeneous group of metrics. The MF must al-

low the collection of service level runtime monitoring

data, virtual IT-infrastructure monitoring data (e.g.,

VM level runtime monitoring), and ﬁne-grained phys-

ical IT-infrastructure monitoring data (e.g., network

links, computing and storage resources).

M3. Polling Interval. The data collection mech-

anism must allow the dynamic customization of the

polling interval. Dynamic nature of virtual platforms

demands gathering of data in a sufﬁciently frequent

manner, meaning that nodes should be monitored

continuously. Naturally, smaller polling intervals in-

troduce signiﬁcant processing overhead inside nodes

themselves. However, long polling intervals do not

provide a clear picture of the monitored components.

Therefore, an optimal trade-off between polling inter-

val and processing overhead is required.

M4. Relationship. In the above mentioned scenario,

clusters of VMs and Physical Machines (PM) serve

many kinds of applications, so there is a hierarchi-

cal relationship between applications, VMs and PMs.

There is also a possibility of migration of VMs and

applications from one node to another, so relation-

ships can be changed dynamically. The metric’s value

must be tagged to showthat they belong to a particular

instance (e.g., application), and what is their relation

to other instances (e.g., VM and PM).

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

450

M5. Data Repository. The MF requires a data

repository where raw monitoring data needs to be

stored after collection. The original data set must

be stored without down-sampling for auditing pur-

poses. The stored, raw monitoring data can be re-

trieved by consumers to perform QoS fault diagno-

sis, SLA validation, plot rendering, and as an input

for ﬁne grained resource management. The database

must be distributed in order to avoid single point of

failure. Moreover, it must be scalable, and allow to

store thousands of metrics and potentially billions of

data points.

M6. Non-intrusive. The MF must be able to retrieve

data non-intrusivelyfrom a variety of sources (for VM

via libvirt API, for a host via cgroups, for the net-

work via SNMP, for Java applications via JMX, etc.).

Collection mechanism should easily be extensible by

adding more plugins.

M7. Interface. The MF should provide a REST inter-

face that allows access to the current monitoring data

in a uniform and easy way, by abstracting the com-

plexity of underlying monitoring systems. A standard

uniﬁed interface for common management and mon-

itoring tasks can make different virtualization tech-

nologies and Cloud providers interoperable. A REST

interface is a good choice due to ease of implemen-

tation, low overhead and good scalability due to its

session-less architecture.

3.2 Analytics Engine (AE) Requirement

Collecting monitoring data is essential but not sufﬁ-

cient per se to explain the observed performance of

services. In the next phase, we need to analyze and

verify data in light of Service Level Agreement (SLA)

between a customer and a provider. Ideally, the anal-

ysis goes beyond simply detecting violation of agreed

terms and predicts potential violations.

General requirements for an Analytics Engine

(AE) are detailed below.

A1. Data Source. AE must be able to fetch moni-

toring data recorded in the database. Further, it must

be able to query the Cloud middlewares (e.g., that of

OpenStack and OpenShift) and application APIs to

know the current status of the services.

A2. Proactive. AE must support the proactive man-

agement of resources. Proactive management needs

short term and medium term predictions for the evo-

lution of most relevant metrics.

A3. Alerts. Certain QoS metrics need to be processed

in real time and alerts should be triggered when these

QoS metrics are violated or approach certain thresh-

old values.

A4. Event Correlation. Detecting the root cause of

QoS faults and taking effective counter measures re-

quires monitoring information spanning multiple tiers

of the virtualized platforms. Quick incomprehensive

analysis of monitoring data of individual tiers does

not reveal the root cause(s) of the problem precisely

enough. Therefore, Analytics need to exhaustively

aggregate runtime data from different sources and

consolidate information at a high level of abstraction.

A5: Identiﬁcation of Inﬂuential Metrics. Identiﬁca-

tion of the metrics which strongly inﬂuence the QoS

helps in decreasing the monitoring footprint and anal-

ysis complexity.

4 MONITORING AND

ANALYTICS FRAMEWORK

PROTOTYPE

Our initial monitoring prototype focused on require-

ments (M1-M6). We conducted a thorough analysis

of technologies to be used by our framework. In de-

ciding upon technology, our criteria included de-facto

industry standards that are capable of providing a high

degree of ﬂexibility and scalability to our architec-

ture. Figure 3 gives a high level view of our Moni-

toring and Analytics framework. Our framework uses

OpenTSDB(OpenTSDB, 2013) for collecting, aggre-

gating and storing data. OpenTSDB uses the HBase

distributed database system in order to persistently

store incoming data for hosts and applications. HBase

is a highly scaleable database designed to run on a

cluster of computers. HBase scales horizontally as

one adds more machines to the cluster. OpenTSDB

makes data collection linearly scalable by placing the

burden of the collection on the hosts being monitored.

Each host uses tcollector client side collection library

for sending data to OpenTSDB. Tcollector does all

of the connection management work of sending data

to OpenTSDB and de-duplication of repeated values.

We instrumented all OpenStack (OpenStack, 2013)

compute nodes with tcollector framework. Compute

node speciﬁc resource utilization metrics are gathered

by collectors those are part of the base package. Our

OpenStack environmentutilizes KVM as the hypervi-

sor and libvirt for Virtualization management. Libvirt

API can provide us monitoring information of hosted

VMs. We wrote a custom collector that retrieves data

non-intrusively for VMs via libvirt API. The system

resources and security containers provided by Open

Shift are gears and nodes. The nodes run the user

applications in contained environments called gears.

A gear is a unit of CPU, memory, and disk-space

on which application components can run. To en-

PracticalAspectsforEffectiveMonitoringofSLAsinCloudComputingandVirtualPlatforms

451

Figure 3: Monitoring and Analysis framework prototype.

able us to share resources, multiple gears run on a

single node. For performance analysis of hosted ap-

plications, we want to track and report utilization of

gears, where as node utilization is already monitored

by OpenStack collector. To track the utilization of

gears, we instrumented all nodes of PaaS with tcollec-

tor. Linux kernel cgroups are used on OpenShift node

hosts to contain application processes and to fairly al-

locate resources. We wrote a custom collector that

retrieves data non-intrusively for gears via cgroups

ﬁle system. Additionally, monitoring data can be col-

lected from REST APIs of Cloud Services by Service

Subscribers component.

The Analytics component is based on the ESPER

complex event processing (CEP) framework(ESPER,

2013). Analytics component leverages Esper to fore-

cast the evolution of metrics by using Holt-Winters

forecasting. Analytics component implements the

SLA surveillance function and the proper alarms are

triggered when SLAs get violated. Analytics frame-

work’s listener components retrieve service related

events from different APIs. The analytics publisher

component sends alerts to SLA Management compo-

nent for taking corrective measures.

Custom dashboard interfaces are developed for

GWDG Platform and Compute portals. These dash-

boards allow end users to view particular met-

rics of running VMs and applications recorded by

OpenTSDB. For plotting data, we used Flot - a plot-

ting library for jQuery JavaScript framework.

5 RELATED WORK

Discussion of related work is divided into areas of

Monitoring Systems for Large Scale Cloud Platforms

and Root Cause Analysis.

Ganglia (Massie, 2004) is a scalable distributed

monitoring system for high performance computing

systems. Nagios(Nagios, 2013) is an open source

solution to monitor hosts and services (for example

HTTP, FTP, etc.) It can monitor more or less ev-

erything for which a sensor exists and is extensible

through a plug-in mechanism. MonALISA(Stratan

et al., 2009) provides a distributed monitoring ser-

vice based on a scalable dynamic distributed architec-

ture. These systems are designed mainly for monitor-

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

452

ing distributed systems and Grids, but do not address

the requirements of Cloud monitoring, e.g., elasticity.

Elasticity is a fundamentalproperty of Cloud comput-

ing and is not considered as a requirement by these

systems. As a result these system can not cope with

dynamic changes of monitored entities. Moreover,

none of these systems provides built-in advanced an-

alytics e.g. complex event processing (CEP), stream

processing or machine learning algorithms. Open

source Cloud platforms like OpenNebula, OpenStack,

Cloud Foundry and OpenShift Origin offer only very

basic monitoring, which is not very useful for root

cause analysis of performance anomalies. Katsaros

et al., presents service-oriented approach for collect-

ing and storing monitoring data from a physical and

virtual infrastructure. The proposed solution extends

Nagios with a RESTful interface (Katsaros et al.,

2011). Rak et al., presents a brief overview of the

mOSAIC API, that can be used to build up a cus-

tom monitoring system for a given Cloud application

(Rak et al., 2011). Aceto et al., provides speciﬁc anal-

ysis on deﬁnitions, issues and future directions for

Cloud monitoring (Aceto et al., 2012). Hasselmeyer

and D’Heureuse proposes a monitoring infrastructure

that was designed with scalability, multi-tenancy, dy-

namism and simplicity as major design goals (Has-

selmeyer and D’Heureuse, 2010). Most of the above

mentioned monitoring techniques address one spe-

ciﬁc functional tier at a time. This makes them inad-

equate in real world domains, where changes in one

tier effect others.

Root cause analysis is known throughout the lit-

erature. Commercial service management systems

like HP OpenView (OpenView, 2013) or IBM Tivoli

(Tivoli, 2013) can help in root cause analysis. These

tools use expert systems with rules and machine learn-

ing techniques. Magpie (Barham et al., 2003) uses

machine learning to build a probabilistic model of re-

quest behavior moving through the distributed sys-

tem to analyze system performance. InteMon (Hoke

et al., 2006) is an intelligent monitoring system tar-

geting large data centers. It tries to spot correlations

and redundancies by using the concept of hidden vari-

ables. Gruschke et al., introduces an approach for

event correlation that uses a dependency graph to rep-

resent correlation knowledge (Gruschke and Others,

1998). Hanemann introduced a hybrid event Correla-

tion Engine that uses a rule-based and case-based rea-

soner for service fault diagnosis (Hanemann, 2007).

These root cause analysis techniques do not consider

the Virtualization technologies, therefore would fail

to address the challenges of root cause analysis in

Virtualized cloud systems. However, there are few

recent research efforts exploring root cause analysis

issues in virtualized systems like clouds. CloudPD

(Sharma et al., 2012) is a fault detection system for

shared utility Cloud. It uses a layered online learn-

ing approach and pre-computed fault signatures to di-

agnose anomalies. It uses an end-to-end feedback

loop that allows problem remediation to be integrated

with cloud steady state management systems. Peer-

Watch (Kang et al., 2010) utilizes a statistical tech-

nique, canonical correlation analysis (CCA), to model

the correlation between multiple application instances

to detect and localize faults. DAPA(Kang et al.,

2012) is an initial prototype of the application per-

formance diagnostic framework, that is able to lo-

calize the most suspicious attributes of the virtual

machines and physical hosts that are related to the

SLA violations. It utilizes Least Angle Regression

(LARS) and k-means clustering algorithm for their

prototype. PREPARE (Tan et al., 2012) incorporates

Tree-Augmented Naive (TAN) Bayesian network for

anomaly prediction, learning-based cause inference

and predictive prevention actuation to minimize the

performance anomaly penalty.

Root cause analysis work in context of Cloud

Computing is at an early stage. Most existing Cloud

monitoring and Analytics techniques address tier-

speciﬁc issues. These techniques can not deal with

real-world scenarios, where changes in one tier often

affect other tiers.

6 CONCLUSIONS AND FUTURE

WORKS

In this paper, we presented two real-world application

scenarios for deriving requirements and presented the

prototype of our Monitoring and Analytics frame-

work. The architecture is designed with Cloud-scale

scalability and ﬂexibility as major design goals.

We believe that the architecture provides all the

desired features for SLA-aware root cause analysis in

a Cloud environment. The Analytics component of

the framework is under active development. On the

research track, our focus is on time series forecasting,

feature selection algorithms (O’Hara and Sillanp¨a¨a,

2009), and fault localization techniques. In near fu-

ture, we plan to evaluate our algorithms by deploying

them in our prototype.

REFERENCES

Aceto, G., Botta, A., Donato, W. D., and Pescap`e, A.

(2012). Cloud Monitoring: deﬁnitions, issues and fu-

ture directions. In IEEE CLOUDNET 2012.

PracticalAspectsforEffectiveMonitoringofSLAsinCloudComputingandVirtualPlatforms

453

Agarwal, M., Appleby, K., Gupta, M., and Kar, G. (2004).

Problem determination using dependency graphs and

run-time behavior models. Utility Computing, pages

171–182.

Appleby, K., Goldszmidt, G., and Steinder, M. (2001).

Yemanja-a layered event correlation engine for multi-

domain server farms. In Integrated Network Man-

agement Proceedings, 2001 IEEE/IFIP International

Symposium on, volume 00, pages 329–344. IEEE.

Barham, P., Isaacs, R., and Mortier, R. (2003). Magpie:

Online modelling and performance-aware systems. In

In Proceedings of the Ninth Workshop on Hot Topics

in Operating Systems.

Chen, M., Kiciman, E., and Fratkin, E. (2002). Pinpoint:

Problem determination in large, dynamic internet ser-

vices. In In Proc. 2002 Intl. Conf. on Dependable Sys-

tems and Networks.

Ejarque, J., Fit´o, J. O., Katsaros, G., Luis, J., and Martinez,

P. (2011). OPTIMIS Deliverable Requirements Anal-

ysis ( M16 ). Technical report, NTUA, ATOS, SCAI,

SAP, BT, CITY, LUH, 451G, FLEXIANT, ULEEDS.

ESPER (2013). Home page of esper. http://esper.

codehaus.org/index.html. [Online; accessed 26-

March-2013].

Gruschke, B. and Others (1998). Integrated event manage-

ment: Event correlation using dependency graphs. In

Proceedings of the 9th IFIP/IEEE International Work-

shop on Distributed Systems: Operations & Manage-

ment (DSOM 98), pages 130–141.

Hanemann, A. (2007). Automated IT Service Fault Diag-

nosis Based on Event Correlation Techniques. PhD

thesis.

Hasselmeyer, P. and D’Heureuse, N. (2010). Towards holis-

tic multi-tenant monitoring for virtual data centers.

2010 IEEE/IFIP Network Operations and Manage-

ment Symposium Workshops, pages 350–356.

Hoke, E., Sun, J., Strunk, J., and Ganger, G. (2006). Inte-

Mon: continuous mining of sensor data in large-scale

self-infrastructures. ACM SIGOPS Operating Systems

Review, 40(3):38–44.

Jeune, G. L., Garc´ıa, E., Perib´a˜nez, J. M., and Mu˜noz,

H. (2012). 4CaaSt Scientiﬁc and Technical Report

D5.1.1. Technical report, Seventh Framework Pro-

gramme.

Kang, H., Chen, H., and Jiang, G. (2010). PeerWatch: a

fault detection and diagnosis tool for virtualized con-

solidation systems. In Proceedings of the 7th inter-

national conference on Autonomic computing, pages

119–128.

Kang, H., Zhu, X., and Wong, J. (2012). DAPA: diagnos-

ing application performance anomalies for virtualized

infrastructures. 2nd USENIX workshop on Hot-ICE.

Katsaros, G., K¨ubert, R., and Gallizo, G. (2011). Building a

Service-Oriented Monitoring Framework with REST

and Nagios. 2011 IEEE International Conference on

Services Computing, 567:426–431.

Massie, M. (2004). The ganglia distributed monitoring sys-

tem: design, implementation, and experience. Parallel

Computing, 30(7):817–840.

Molenkamp, G. (2002). Diagnosing quality of service faults

in distributed applications. Performance, Computing,

and Communications Conference, 2002. 21st IEEE

International.

Nagios (2013). Home page of nagios. http://

www.nagios.org/. [Online; accessed 26-March-2013].

O’Hara, R. B. and Sillanp¨a¨a, M. J. (2009). A review of

Bayesian variable selection methods: what, how and

which. Bayesian Analysis, 4(1):85–117.

OpenShift (2013). Home page of openshift. https://

www.openshift.com/. [Online; accessed 26-March-

2013].

OpenStack (2013). Home page of openstack. http://

www.openstack.org/. [Online; accessed 26-March-

2013].

OpenTSDB (2013). Home page of opentsdb. http://

opentsdb.net/. [Online; accessed 26-March-2013].

OpenView (2013). Hp openview — wikipedia, the free

encyclopedia. http://en.wikipedia.org/w/index.php?

title=HP OpenView&oldid=547020972. [Online; ac-

cessed 26-March-2013].

Rak, M., Venticinque, S., Mhr, T., Echevarria, G., and Es-

nal, G. (2011). Cloud Application Monitoring: The

mOSAIC Approach. 2011 IEEE Third International

Conference on Cloud Computing Technology and Sci-

ence, pages 758–763.

Sharma, B., Jayachandran, P., Verma, A., and Das, C.

(2012). CloudPD: Problem Determination and Diag-

nosis in Shared Dynamic Clouds. cse.psu.edu, pages

1–30.

Stratan, I. L., Newman, H., Voicu, R., Cirstoiu, C., Grigo-

ras, C., Dobre, C., Muraru, A., Costan, A., Dediu, M.,

and C. (2009). MONALISA: An Agent based , Dy-

namic Service System to Monitor , Control and Op-

timize Grid based Applications The Distributed Ser-

vices. Computer Physics Communications, 180:2472–

2498.

Tan, Y., Nguyen, H., and Shen, Z. (2012). PREPARE: Pre-

dictive Performance Anomaly Prevention for Virtual-

ized Cloud Systems. In Distributed Computing Sys-

tems (ICDCS), 2012 IEEE 32nd International Confer-

ence on, number Vcl.

Tivoli (2013). Home page of ibm tivoli. http://

www.tivoli.com/. [Online; accessed 26-March-2013].

Yaqub, E., Wieder, P., Kotsokalis, C., Mazza, V., Pasquale,

L., Rueda, J. L., G´omez, S. G., and Chimeno, A. E.

(2011). A generic platform for conducting sla negoti-

ations. In Service Level Agreements for Cloud Com-

puting, pages 187–206. Springer.

Yaqub, E., Yahyapour, R., Wieder, P., and Lu, K. (2012).

A protocol development framework for sla negotia-

tions in cloud and service computing. In Service

Level Agreements for Cloud Computing, pages 1–15.

Springer.

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

454