Another argument in favor of performing Hadoop
backups is designed to work within a single datacen-
ter (Grishchenko, 2015). In line with (Khoshkholghi
et al., 2014; Hua et al., 2016; Xu et al., 2014), it is a
severe threat to the system if backup data can only be
stored in the same site as primary systems. In order to
achieve higher fault tolerance and reliability, original
data and backup replicas must be stored in geographi-
cally separated locations.
Conforming to (Grishchenko, 2015), backup tar-
get datamarts (result datasets) and aggregated reports
must also be backed up. They usually represent giga-
bytes or terabytes of data.
That said, and in agreement with (Kothuri, 2016;
Barot et al., 2015; Grishchenko, 2015), the follo-
wing are the recommended Hadoop elements that
need backup:
• Data Sets: raw data and result datasets; metadata
(NameNode, Hive, HBASE and other);
• Applications: system and user applications;
• Configuration: configuration of various Hadoop
components.
As shown in Figure 3, previously proposed Ha-
doop backup solutions (Grishchenko, 2015; Barot
et al., 2015; Kothuri, 2016) use a secondary clus-
ter to serve as a safe replica. However, having and
maintaining a second cluster is much more expen-
sive than storing data in tape-libraries, NAS or even
Cloud object storages. This is clear when considering
the Cloud pricing. While Storage can cost U$0.023
per GB per Month for basic plans (Amazon Web
Services, 2017b), a Hadoop cluster machine is char-
ged U$0.053 per Hour, which is the least expensive
among them (Amazon Web Services, 2017a).
Hadoop
Production Cluster
Name Node
Name Node
Hadoop
Backup Cluster
distcp
backup &
restore
Figure 3: Previous Hadoop HDFS distcp Replica Backup
Technique.
Another problem with previously proposed solu-
tions involves inter-cluster replicas, which only save
the current state of data. This does not protect
against undesired modifications and former unnoticed
data loss, providing a poor Recovery Point Objective
(RPO).
As noted by (Grishchenko, 2015), the most promi-
nent challenge of backing up Hadoop cluster involves
HDFS datasets, which may contain petabytes of in-
formation, and therefore the backup duration is one
of the most crucial comparison metrics.
Backup Hadoop application binaries and configu-
rations are very small in size and in the number of
files concerning a typical HDFS workload or even in
comparison with other applications. They should be
easily protected by a file level backup (e.g., using the
Bacula client), and its performance impact is ignored
in this study.
In the opinion of (Kothuri, 2016), there was no
proper out of the box point in time recovery solution
for Hadoop, at least until now.
In line (Khoshkholghi et al., 2014), disaster reco-
very is a persistent problem in information technology
platforms, and even more crucial in distributed sys-
tems and cloud computing. Service Providers must
provide services to their customers even if the data
center is down (due to a disaster). Researchers have
shown more interest in Disaster Recovery using cloud
computing in the past few years and a considerable
amount of literature has been published in this area.
As describe by (Alhazmi and Malaiya, 2013), Re-
covery Point Objective (RPO) and Recovery Time
Objective (RTO) are the two main parameters that all
recovery mechanisms should observe. If the RPO and
RTO values are lower, higher business continuity can
be achieved. RPO may be interpreted as the amount
of lost data cost in a disaster. RTO refers to the time
frame between disruption and restoration of service.
As demonstrated by Equation 1, the Recovery
Point Objective value is inversely proportional to the
frequency of backups completed over the time, where
FB represents the Frequency of Backup
RPO ∝
1
FB
(1)
On the other hand, as shown by Equation 2, Re-
covery Time Objective formula usually includes a
fraction of RPO, the readiness of the backup and five
fail over steps delays depending on backup capabili-
ties.
RTO = f raction o f RPO + jmin + S1
+S2 + S3 + S4 + S5
(2)
The variables used in Equation 2 are:
fraction of RPO Computation time lost since the
last backup;
jmin Depends on service readiness of the backup;
S1 Hardware setup time;
A Hadoop Open Source Backup Solution
653