cess pattern) of the database, (2) inferring the optimal
partitioning scheme for the database through machine
learning techniques, and (3) proposing the most suit-
able replication protocol to be run on each partition
according to the current user demands.
The remainder of this paper is organized as fol-
lows. Section 2 stresses the need for applyinga proper
partitioning schema and using the appropriate replica-
tion protocol on a transactional cloud database. Sec-
tion 3 presents the proposed system architecture. Sec-
tion 4 details the experiments performed. Finally,
Section 5 concludes the paper.
2 PARTITIONING THE CLOUD
As clarified in (Brewer, 2012), a distributed system
may implement a weak form of consistency (Vogels,
2009), a relaxed degree of availability (White, 2012),
and reasonable network partition tolerance (DeCandia
et al., 2007) while still keeping itself scalable. Actu-
ally, most of the existing cloud data repositories be-
have in this direction (DeCandia et al., 2007; White,
2012). However,those applications that demand strict
transactional support are not able to straightforwardly
fit in this idea, since they generally require strong con-
sistency to provide correct executions (Birman, 2012)
while claiming for the appealing characteristics of-
fered by the cloud, therefore refusing to relax avail-
ability constraints. In this context, tolerance to net-
work partitions must be smartly addressed to truly
benefit from the cloud features, which suggests prac-
titioners to be very cautious when defining a parti-
tioning scheme. This section (1) stresses the criti-
calness of managing data partitions, (2) describes the
proposed graph-based partitioning technique that has
been applied in the presented load balancer and (3)
suggests supervised machine learning techniques as
an effective way to address this matter.
2.1 Motivation and Related Work
Partitioning is a very effective way to achieve high
scalability while leveraging data consistency in a dis-
tributed database. Transactions that are executed
within one single data partition require no interaction
with the rest of partitions, hence reducing the com-
munication overhead (Aguilera et al., 2009; Curino et
al., 2010; Das et al., 2010). However, configuring
the partitioning scheme to minimize multi-partition
transactions while avoiding extra costs derived from
resources misusage requires a judicious criterion that
must carefully address the workload pattern to deter-
mine the optimal data partitions.
However, if we miss the perspective here and do
not replicate these partitions, the system will face the
single point of failure issue or suffer from perfor-
mance limitations due to the bottleneck effect. Con-
sequently, besides deciding the most suitable parti-
tioning strategy, it is important to study the work-
load nature that generated each partition to determine
the most appropriate replication protocol. Roughly
speaking, in an update-intensive data partition, the
ideal candidate will be an update-everywhere repli-
cation protocol (Wiesmann and Schiper, 2005); oth-
erwise, the candidate will probably be a primary copy
replication protocol (Daudjee and Salem, 2006).
Overall, there is a strong connection between the
amount of partitions, the database workload, and
the replication protocol running on each partition.
The following subsection discusses a graph-based ap-
proach used to infer these three features.
2.2 Graph-based Partitioning
We propose a structure based on undirected graphs,
which can be used for determining the best partition-
ing scheme. This proposalwill be explained by means
of the example depicted in Figure 1, which defines a
database consisting of two tables
PERSON
and
DEGREE
and a sample workload of four transactions.
This workload will drive the construction of the
undirected graph structure shown in Figure 1. More
specifically, each tuple of
PERSON
and
DEGREE
is rep-
resented by a node, whereas each edge between two
nodes reflects that they are accessed within the same
transaction. The weights of edges are increased ac-
cording to the number of transactions accessing the
connecting nodes. In addition, there is a counter asso-
ciated to each node representing the number of trans-
actions that access it.
The first statement of transaction 1 accesses the
second tuple of
PERSON
and the first tuple of
DEGREE
.
Hence, in Figure 1 we draw two nodes identified by
the
ID
field of each tuple (i.e., node 2 and node 4) and
plot an edge between them with an initial weight of
one. In addition, we set to one the counters of nodes 2
and 4 to reflect the number of times they have been ac-
cessed. The next operation of transaction 1 modifies
node 1, thus we connect it with the previous nodes (2
and 4) using edges of weight one and set the weight
of node 1 to one. Finally, the last operation of trans-
action 1 is a
SELECT
that accesses node 3. We have
to connect node 3 with nodes 1, 2 and 4 with edges
of weight one and increase the counter of node 3 in
one unit. Likewise, we proceed in the same way with
the rest of transactions. For the sake of clarity, in Fig-
ure 1 we have surrounded the nodes accessed by each
CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience
274