system namespace, which is where we maintain the
file system tree and the metadata for all the files and
directories. This information is persistently stored on
the local disks in two files: the namespace image
and the edit log. The NN also keeps track of the DN
on which all the blocks for a given file are located.
However, it does not store block locations
persistently, since this information is reconstructed
from NNs when the system starts (White, 2009).
The single point of failure (SPoF) in an HDFS
cluster is the NN, while the loss of any other node
(intermittently or permanently) does not result in a
data loss. So, NN loss results in HDFS cluster
unavailability. The permanent loss of NN data
would render the HDFS cluster inoperable (Yahoo,
2010). For this reason, it is important to make the
NN resilient to failure, and HDFS provides a manual
mechanism for achieving this.
The steps of the mechanism are as follows. First,
a backup up is made of the files that make up the
persistent state of the file system metadata, where
the usual configuration choice is to write to the local
disk as well as to a remote NFS mount. Then, a
secondary NameNode (SNN) must be run, which
will periodically merge the namespace image with
the edit log. This is necessary to prevent the edit log
from becoming too large. In HDFS, it is
recommended that the SNN run on a separate
physical computer, since this merge requires as
much CPU and memory as the NN (Apache, 2010).
However, when a failure occurs, a manual
intervention is necessary to copy the NN metadata
files, which are on the NFS, to the SNN that will
become the new NN.
There are currently some efforts planned to
convert the SNN to a standby node, which, besides
handling merging, could also maintain the up-to-date
state of the namespace, by processing constant edits
from the NN, and of the checkpoint node (which
creates the checkpoints of the namespace). This
standby node approach has been named a Backup
Node (BN) (Apache, 2008).
To resolve the SPoF, the BN would provide real-
time streaming of edits from an NN to a BN. This
would allow constant updating of the namespace
state. The BN would also conduct a checkpointing
function, ensuring the availability of the HDFS
namespace in memory and getting rid of the current
need to store the namespace on disk. Finally, the BN
proposal would offer the availability of a standby
node. This node, coupled with an automatic
switching (failover) function, would eliminate
potential data loss, unavailability, and manual
interventions into HDFS NN failures.
However, if the BN fails, what will take its
place?
3 PROPOSED SOLUTION
In this paper, we propose a distributed solution to
the problem of NN and BN failures, which makes
use of a coordination scheme and leader election
function within BN replicas. This can be achieved
using a service, such as ZooKeeper, for maintaining
configuration information, for naming, and for
distributed synchronization and group coordination.
3.1 Distributed Applications
with ZooKeeper
ZooKeeper is a service that allows distributed
processes to coordinate with each other through a
shared hierarchal namespace of data registers. It has
proven that it can be useful for large distributed
systems applications (Apache, 2008).
One of the main failure recovery problems with
distributed applications is partial failure. For
example, when a message is sent across the network
and it fails, the message will not be received, or
when the receiver’s process dies, the sender does not
know the reason for the failure. ZooKeeper provides
a set of tools to protect distributed applications when
this type of failure occurs.
Also of interest is that Zookeeper runs on a
cluster of computers called an ensemble, and is
designed to be highly available due to its replicated
mode. It has great potential to help solve the SPoF
problem of HDFS. We propose to use it to design
and manage a high availability BN cluster. With this
approach, if the Primary Backup Node (PBN) fails,
then an election mechanism for choosing a new PBN
is initiated. There could be a number of Replicated
Backup Nodes (RBN), as shown in Figure 1.
Figure 1: ZooKeeper Service for PBN Election.
CLOSER 2011 - International Conference on Cloud Computing and Services Science
588