Clustering the Cloud

A Model for (Self-)Tuning of Cloud Data Management Systems

Siba Mohammad, Eike Schallehn and Sebastian Breß

Institute for Technical and Business Information Systems, Otto-von-Guericke-University, Magdeburg, Germany

Keywords:

Cloud Data Management, Tradeoff, Optimization, Tuning, Self-tuning, Logical Cluster.

Abstract:

Popularity and complexity of cloud data management systems are increasing rapidly. Thus providing sophis-

ticated features becomes more important. The focus of this paper is on (self-)tuning where we contribute the

following: (1) we illustrate why (self-)tuning for cloud data management is necessary but yet a much more

complex task than for traditional data management, and (2) propose an model to solve some of the outlined

problems by clustering nodes in zones across data management layers for applications with similar require-

ments.

1 INTRODUCTION

Cloud-based data management and data processing

solutions are signiﬁcantly different from conventional

database management systems (DBMS) by mainly

focusing on providing scalability and availability to

meet the main requirements of cloud applications

while disregarding typical DBMS features such as

complex query interfaces, transactional consistency

management, and stringent data models.

While tuning of conventional DBMSs is a typical task

of administration and its automation as self-tuninghas

gained attention in industrial and academic research,

the (self-)tuning of cloud data management systems

(CDMS) is only in its infancy. Because of the typ-

ical shared nothing architectures with data partition-

ing and replication, some performance aspects can

be easily addressed for the overall system. Never-

theless, the typical multi-layered architecture of sev-

eral component systems adds complexity to the tun-

ing tasks. Moreover, if there are several applica-

tions with different and possibly changing require-

ments using the same database cluster, there is lit-

tle chance to tune for a speciﬁc application. Cur-

rent research efforts on (self-)tuning of CDMSs in-

clude approaches that focus on speciﬁc aspects such

as MapReduce performance (Ahmad et al., 2012), en-

ergy conserving (Bostoen et al., 2012), and minmal

cluster size (Herodotou et al., 2011). To the best of

our knowledge, there is only the work of Florescu

and Kosmann (Florescu and Kossmann, 2009) that

provides an overall view on the tuning problem of

databases in the cloud environment and provides an

architecture designed for the new optimization prob-

lem, as they call it. The idea of creating zones within

a cluster that we present as model for (self-)tuning, is

a generalization of the concept of cold, and hot zones

presented in greenHDFS (Kaushik and Bhandarkar,

2010).

2 TUNING GOALS AND

TRADEOFFS

What makes optimization for a CDMS a complicated

task is the fact that it is not a monolithic system, but

rather a combination of loosely coupled systems that

complement each other. Thus, optimizing a CDMS

for a certain goal becomes a task that expanses across

several system layers and within each one, where de-

cisions in one layer affect possibilities and decisions

in other layers.

Tuning Tradeoffs. Next, we summarize the funda-

mental tradeoffs for a CDMS:

• Read Performance versus Write Performance. In

the cloud, challenges related to partitioning and

replication add to the problem (Cooper et al.,

2010).

• Latency versus Durability. Many CDMSs choose

to write to memory and sync to disk later. This

lowers latency but could result in data loss in the

case of failure (Cooper et al., 2010).

520

Mohammad S., Schallehn E. and Breß S..

Clustering the Cloud - A Model for (Self-)Tuning of Cloud Data Management Systems.

DOI: 10.5220/0004403405200524

In Proceedings of the 3rd International Conference on Cloud Computing and Services Science (CLOSER-2013), pages 520-524

ISBN: 978-989-8565-52-5

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

Distributed Storage System

Structured Data System

Distributed Processing System

Query Language

Users / Applications

Operating System

Hardware

Query optimization

# MapReduce tasks, # Input streams

Replica placement, Conflict resolution,

Partitioning, (Key, Row) caching,

Compaction

Read ahead buffer, Compression,

#(Nnode, Dnode) per RPC call

JVM Heap, OS page cache, Swapping

#CPU cores , RAM size, Disk I/O

Consistency level, Data modeling

Figure 1: Architecture for Cloud Data Management System (CDMS) extended from (Mohammad et al., 2012).

• Security versus Performance. Existing models

supporting security of cloud data depend mainly

on encryption. This affects performance of sys-

tems (Yu et al., 2010), because it leads to over-

head in reading and writing data and increases the

overall latency.

• Resource Utilization versus Performance. Reduc-

ing operating cost is important for private and

public cloud operators. This usually results in

trading performance for higher resource utiliza-

tion (Chen et al., 2010).

• CAP Tradeoffs. The CAP theorem (Fox et al.,

1997) states that consistency, availability, and par-

tition tolerance are systematic requirements of de-

signing and deployment of applications in a dis-

tributed environment. It also states that a scal-

able system can fulﬁll at most two of these three

properties. In the context of CDMS, this means

in most cases scariﬁcing consistency. More de-

tails about this and other implications of CAP on

CDMS are in (Mohammad et al., 2012)

Optimization Goals/Opportunities. Next, we

summarize the fundamental optimization goals for

CDMSs:

Performance. Performance is crucial for CDMSs. It

is affected by several aspects such as geograph-

ical distribution of data, replication, consistency

and durability requirements. Overhead caused by

events, such as data compaction and scaling the

DB cluster, also impacts the overall performance.

Availability and Fault Tolerance. CDMSs work on

clusters of nodes where failure is the normal case,

not the exception. Most of these systems sup-

port fault tolerance and recovery. However, the

promised 99.9% availability is not always suf-

ﬁcient. Leading cloud storage providers, such

as Amazon, Salesforce.com, and Rackspace, had

several outages in the last years causing major

websites and businesses to be out of service. Out-

ages result not only in services time outs, but also

unrecoverable data losses (Dahbur et al., 2011).

Consistency Level. For CDMSs, consistency is not

a matter of yes or no question. There are sev-

eral levels of consistency and one can tune it even

on the granularity of a single query or data ob-

ject. Depending on data type, strict consistency

is not always a requirement (Kraska et al., 2009).

Since decisions about the level of consistency af-

fect the system performance and availability, it

is important to determine the highest achievable

level of consistency for speciﬁc performance re-

quirements.

Minimum Resource Consumption. It is very impor-

tant to minimize resource consumption, but within

speciﬁed performance thresholds. The ﬁrst per-

spective regarding this goal is monetary costs.

The second perspective is energy efﬁciency. En-

ergy consumption of data centers, whether it is

for cooling or operating machines, is high and is

estimated to increase by 18% every year (Zhang

et al., 2010). Besides, up to 35% of this energy

consumption is caused by the storage subsystems.

Hence, minimizing energy consumption is getting

more important (Bostoen et al., 2012).

ClusteringtheCloud-AModelfor(Self-)TuningofCloudDataManagementSystems

521

Local and horizontal (self-)tuning for tier n

. . .

Local and horizontal (self-)tuning for tier n-1

. . .

Vertical

(self-) tuning

accross

tiers

Application Application Application Application

Workload Workload Workload Workload Requirements

. . .

Figure 2: Illustration of complexity of (Self-)tuning multi-layerd CDMSs according to divergent requirements and different

and changing workloads.

3 SELF-TUNING BASED ON

CLUSTERING

An example of a cloud data management system is

composed of a Cassandra cluster that builds on top of

Hadoop Distributed File Systems (HDFS). We take

the case that this cluster should be optimized re-

garding its read performance. Among the factors

that should be dealt with are the following (Capriolo,

2011): index, bloom ﬁlter, consistency level, replica

conﬂict resolution, caching (row and key caching,

Java Virtual Machine (JVM) heap, Operating System

(OS) page cache and swapping), compaction, com-

pression, and hardware deployment (Random Access

Memory (RAM), Central Processing Unit (CPU),

disk and number of nodes). In addition, depending

on whether the application works with mainly range

queries, it would be better to use an order preserv-

ing partitioning technique. In the case that the appli-

cation will use Cassandra as input for MapReduce,

other factors such as the number of input streams,

and the number of map and reduce tasks, etc., would

have to be considered. Figure 1 illustrates how tun-

ing works across several layers with parameter exam-

ples on the right side. As a step toward (self-)tuning

CDMSs for divergentrequirementsacross severallay-

ers, we propose the concept of logical clustering of

shared nothing systems into zones applying the ba-

sic principal of divide and conquer. As illustrated

in Figure 2, tuning becomes even more complicated

when the DB cluster is serving different workload

types with different optimization goals: either in the

case of one application with shifting workloads or

several applications with different workloads. Hori-

zontal (self-)tuning includes aspects such as partition-

ing/load balancing, replication/update strategies, au-

tomatic scaling, etc., which are typically better sup-

ported because of homogeneous processes of a sin-

gle component type within one layer. Nevertheless,

there are still open research questions, e.g., for het-

erogeneous resources/hardware across nodes. Verti-

cal (self-)tuning represents mapping application re-

quirements expressed as optimization goals, service

levels, strategies, etc., to speciﬁc tuning measures on

each level of the architecture. To support this com-

plex process of (self-)tuning, we suggest the idea of

creating logical clusters within a CDMS by cluster-

ing nodes based on application requirements as il-

lustrated in Figure 3. A logical cluster (LC) can be

identiﬁed by applications requirements and assigned

speciﬁc physical resources. This is a generalization

of the concepts introduced by Kaushik and Bhan-

darkar (Kaushik and Bhandarkar, 2010). Their model

depends on an energy-aware data-classiﬁcation data

placement strategy to deﬁne two zones, hot and cold,

with the main purpose of energy saving.

Clustering based on Application Requirements.

We classify applications based on these criteria: data,

workload, optimization goals, and thresholds. Then,

for each class, an LC is created and assigned spe-

ciﬁc physical resources. There are common workload

patterns that are deﬁned by the write/read mix. We

take into consideration other aspects such as the size

and freshness of data that will be accessed to deﬁne

a list of workload types. Typical workload classes

are on-line access and batch processing. On-line ac-

cess is characterized by heavy update, read latest, read

ranges (with small ranges it is equal to lookups), and

random reads (lookups). Batch processing scenarios

typically perform batch writes and scans. Zhang et

al. (Zhang et al., 2010) discusses challenges in ana-

lyzing and deﬁning data trafﬁc patterns in the cloud

computing environment.

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

522

Horizontal (self-)tuning

Vertical (self-)tuning

Application Application

Workload Workload Requirements

. . .

Horizontal (self-)tuning

Vertical (self-)tuning

Application Application

Workload Workload

. . .

Figure 3: Clustering applications with similar requirements and workloads(divide and conquer) to ease tuning.

Clustering based on Physical Resources. Finding

the right cluster size, deciding when to scale, and

minimizing the system overhead that results from

adding new nodes, should be addressed. All CDM

solutions support scalability to satisfy applications

growth. However, some systems, such as Cassan-

dra and Yahoo! Pnuts, show degradation in perfor-

mance and need time to stabilize after adding new

nodes (Cooper et al., 2010). For this reason, and

for cost and energy conserving, increasing cluster

size should not always be the ﬁrst suggested solution

for performance problems. Since the assumption of

homogeneous clusters does not stand in real appli-

cations, CDM is evolving towards adopting hetero-

geneity. In addition, systems supporting heterogene-

ity allow the possibility of improving performance by

adding nodes with higher capacity instead of having

to upgrade, at once, all nodes within a cluster (De-

Candia et al., 2007). However, only with resource-

awarescheduling and load balancingperformance im-

proves. Various studies (Rasool and Down, 2012; Ah-

mad et al., 2012) show that the current implementa-

tions of data-intensive applications do not take into

consideration heterogeneous nodes and show degra-

dation in performance.

Alternatives for Cluster Structures and Cluster-

ing Strategies. While the outlined proposal of log-

ical clusters is already supported in a static manner,

with our research we want to focus on dynamic clus-

tering to support workload-based optimization. This

dynamic clustering requires support for splitting and

merging or re-computing clusters. Furthermore, an

LC may contain other LCs, so hierarchical structures

are desirable for the clustering process as well as the

mapping to clustering criteria. Furthermore, the pos-

sibility to build clusters only on several layers or inde-

pendent clusters across layers should be considered.

4 CONCLUSIONS

In this paper,we outlined the tuning tradeoffdecisions

and optimization goals for cloud data management

systems. The complexity of (self-)tuning for these

systems results from their highly distributed multi-

layerd architecture. (Self-)Tuning gets even more

complicated when one cloud database cluster is serv-

ing one application with shifting workloads or sev-

eral applications with multiple workloads. With the

aim of supporting (self-)tuning in such case, we sug-

gested a general model for creating logical clusters

within a cloud DB system. To create logical clus-

ters, we depend on clustering of applications based

on data, workload, optimization goals and thresholds.

Finally, we brieﬂy discussed different problems and

alternatives for this model.

REFERENCES

Ahmad, F., Chakradhar, S. T., Raghunathan, A., and

Vijaykumar, T. N. (2012). Tarazu: Optimizing

MapReduce On Heterogeneous Clusters. SIGARCH,

40(1):61–74.

Bostoen, T., Mullender, S., and Berbers, Y. (2012). Analy-

sis of disk power management for data-center storage

systems. In e-Energy, pages 2:1–2:10. ACM.

Capriolo, E. (2011). Cassandra High Performance Cook-

book. Packt Publishing.

Chen, Y., Ganapathi, A. S., Grifﬁth, R., and Katz, R. H.

(2010). Towards understanding cloud performance

tradeoffs using statistical workload analysis and re-

play. Technical report, EECS Department, University

of California, Berkeley.

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R.,

and Sears, R. (2010). Benchmarking Cloud Serving

Systems with YCSB. In SoCC, pages 143–154. ACM.

Dahbur, K., Mohammad, B., and Tarakji, A. B. (2011). A

survey of risks, threats and vulnerabilities in cloud

ClusteringtheCloud-AModelfor(Self-)TuningofCloudDataManagementSystems

523

computing. In Proceedings of ISWSA, pages 12:1–

12:6. ACM.

DeCandia, G., Hastorun, D., Jampani, M., Kakulapati,

G., Lakshman, A., Pilchin, A., Sivasubramanian, S.,

Vosshall, P., and Vogels, W. (2007). Dynamo: Ama-

zon’s Highly Available Key-value Store. Symposium

on Operating Systems Principles, 41(6):205–220.

Florescu, D. and Kossmann, D. (2009). Rethinking cost and

performance of database systems. SIGMOD Record,

38(1):43–48.

Fox, A., Gribble, S. D., Chawathe, Y., Brewer, E. A., and

Gauthier, P. (1997). Cluster-based scalable network

services. In Proceedings of the sixteenth ACM sym-

posium on Operating systems principles, SOSP ’97,

pages 78–91. ACM.

Herodotou, H., Dong, F., and Babu, S. (2011). No one

(cluster) size ﬁts all: Automatic cluster sizingfor data-

intensive analytics. In Proceedings of SOSP, pages

18:1–18:14. ACM.

Kaushik, R. T. and Bhandarkar, M. (2010). GreenHDFS:

Towards An Energy-Conserving, Storage-Efﬁcient,

Hybrid Hadoop Compute Cluster. In HotPower, pages

1–9. USENIX Association.

Kraska, T., Hentschel, M., Alonso, G., and Kossmann, D.

(2009). Consistency Rationing in the Cloud: Pay only

When It Matters. In VLDB, pages 253–264. VLDB

Endowment.

Mohammad, S., Breß, S., and Schallehn, E. (2012). Cloud

data management: a short overview and comparison

of current approaches. In GvD, pages 41–46. CEUR-

WS.

Rasool, A. and Down, D. G. (2012). An Adaptive Schedul-

ing Algorithm for Dynamic Heterogeneous Hadoop

Systems.

Yu, S., Wang, C., Ren, K., and Lou, W. (2010). Achiev-

ing Secure, Scalable, and Fine-grained Data Access

Control in Cloud Computing. In INFOCOM, pages

534–542. IEEE.

Zhang, Q., Cheng, L., and Boutaba, R. (2010). Cloud com-

puting: State of the Art and Research Challenges. In-

ternet Services and Applications, 1(1):7–18.

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

524