how to integrate clean replicas into a highly available
system should be a concern when designing replica-
tion mechanisms, even if this is often not the case
(Vilaca et al., 2009) in the literature (e.g., (Elnikety
et al., 2005; Kemme and Alonso, 2000)). However,
CDBA’s goal of providing highly available database
appliances, requires a suitable recovery mechanism to
be designed and implemented.
The replication approach (active vs passive) and
whether the replication protocol is synchronous or
asynchronous have a considerable impact on the de-
sign of the recovery protocol, namely to determine
when a recovered replica can be considered to be
available. For example, asynchronous replication pro-
tocols may accept a level of outdatedness in some
replicas which may enable less costly recovery strate-
gies or allow recovered replicas to be considered
available sooner without disrupting the expected con-
sistency level.
An important design decision is whether recov-
ery is to be done online (Kemme et al., 2001) or
offline (Amir, 1995). Offline recovery means that
the system somehow becomes unavailable whenever
a replica needs to be recovered. Online recovery, on
the other hand, means that the system remains avail-
able during a recovery process. Again, CDBA’s goal
of providing highly available database appliances re-
quires recovery to be done online.
The main goal of the recovery process requires
some kind of state transfer to the recovering replica.
Several approaches can be undertaken ranging from
transferring state in bulk or using an incremental ap-
proach, from a single or multiple donors. These ap-
proaches define trade-offs on how quickly a replica
is recovered and the impact on the performance of
the overall system. Different approaches may be se-
lected based on a multitude of factors: how far behind
the recovering replica is; whether its state is consis-
tent, even if outdated; the number of available donors,
etc. The impact of these factors on recovery has been
evaluated and analysed in (Vilaca et al., 2009). It
may also be possible to take advantage of workload
patterns and/or data partitioning to improve recovery,
namely when applying missing updates to a recover-
ing replica (Jim
´
enez-Peris et al., 2002).
Intra-datacentre recovery, in the context of CDBA,
presents different challenges from the target models
generally considered when designing recovery proto-
cols. The main differences are: the substantial com-
puting power of each replica, due to the use of the
state-of-the-art, many-core Bullion hardware, as op-
posed to considering commodity or COTS servers;
a very low-latency, high-bandwidth communication
network connecting replicas, as opposed to higher la-
tency LAN or WAN networks; and the number of
available replicas, which is restricted to two, the min-
imum to provide high-availability, as opposed to quo-
rum based solutions, which require a minimum of 3
replicas. These differences have a fundamental im-
pact on the design of the replication mechanism and
consequently, the recovery mechanism, as these open
up the possibility of exploring unusual trade-offs on
throughput and availability.
2.1 Replication in CloudDBAppliance
Due to the interdependence of the replication and re-
covery mechanisms, this section presents an overview
of CDBA’s intra-datacentre replication protocol. Fur-
ther details can be found in (Ferreira et al., 2019).
The recovery mechanisms considered were de-
signed and implemented as part of the replication
middleware layer that provides isolation between
users and the underlying operational database. To
achieve this, the middleware layer is placed as a top
tier layer, intercepting SQL statements and perform-
ing all required steps to accommodate the replication
mechanisms. The replication middleware removes
non-determinism from requests and ensures these are
totally-ordered. The middleware approach simplifies
integration and extends the possibility of considering
this replication mechanism beyond the CloudDBAp-
pliance project, by offering a completely decoupled
solution. This can be done by embedding the mid-
dleware in the JDBC driver used for communication
between clients and database servers.
The replication middleware relies on a JDBC-
enabled API, that contains key interfaces between a
client proxy and server side stub. Key interfaces were
selected by reusing V-JDBC for JDBC request in-
terception and hand-over between client proxies and
server. As detailed and assessed in (Ferreira et al.,
2019), choosing V-JDBC allowed for a flexible en-
vironment where the transport protocol can be cus-
tomised according to the application itself.
The replication architecture, depicted in Figure 1
is based on a set of reliable distributed logs. These
enable decoupling clients from the replication man-
ager instances and the operational databases. The log
allows requests to be stored reliably and with total
order guarantees. Briefly, when a client application
sends a SQL request, the request is sent to a write-
proxy that acts as a handler for the distributed log
structure. After being reliably stored, the distributed
log structure is used by the replication manager in-
stances that pull the requests and push these for exe-
cution at its local operational database instance. The
distributed log, conceptually considered as part of the
ADITCA 2019 - Special Session on Appliances for Data-Intensive and Time Critical Applications
448