A few weeks back, Cloudera announced CDH 4.1, the latest update release to Cloudera’s Distribution including Apache Hadoop. This is the first release to introduce truly standalone High Availability for the HDFS NameNode, with no dependence on special hardware or external software. This post explains the inner workings of this new feature from a developer’s standpoint. If, instead, you are seeking information on configuring and operating this feature, please refer to the CDH4 High Availability Guide.
Since the beginning of the project, HDFS has been designed around a very simple architecture: a master daemon, called the NameNode, stores filesystem metadata, while slave daemons, called DataNodes, store the filesystem data. The NameNode is highly reliable and efficient, and the simple architecture is what has allowed HDFS to reliably store petabytes of production-critical data in thousands of clusters for many years; however, for quite some time, the NameNode was also a single point of failure (SPOF) for an HDFS cluster. Since the first beta release of CDH4 in February, this issue has been addressed by the introduction of a Standby NameNode, which provides automatic hot failover capability to a backup. For a detailed discussion of the design of the HA NameNode, please refer to the earlier post by my colleague Aaron Myers.
Limitations of NameNode HA in Previous Versions
As described in the March blog post, NameNode High Availability relies on shared storage - in particular, it requires some place in which to store the HDFS edit log which can be written by the Active NameNode, and simultaneously read by the Standby NameNode. In addition, the shared storage must itself be highly available — if it becomes inaccessible, the Active NameNode will no longer be able to continue taking namespace edits.
In versions of HDFS prior to CDH4.1, we required that this shared storage be provided in the form of an NFS mount, typically on an enterprise-grade NAS device. For some organizations, this fit well with their existing operational practices, and indeed we have several customers running highly available NameNode setups in production environments. However, other customers and community members found a number of limitations with the NFS-based storage:
Custom hardware – the hardware requirements of a NAS device can be expensive. Additionally, fencing configurations may require a remotely controllable Power Distribution Unit (PDU) or other specialized hardware. In addition to the financial expenses, there may be operational costs: many organizations choose not to deploy NAS devices or other custom hardware in their datacenter.
Complex deployment – even after HDFS is installed, the administrator must take extra steps to configure NFS mounts, custom fencing scripts, etc. This complicates HA deployment and may even cause unavailability if misconfigured.
Poor NFS client implementations – Many versions of Linux include NFS client implementations that are buggy and difficult to configure. For example, it is easy for an administrator to misconfigure mount options in such a way that the NameNodes will freeze unrecoverably in some outage scenarios.
External dependencies - depending on a NAS device for storage requires that operators monitor and maintain one more piece of infrastructure. At the minimum, this involves configuring extra alerts and metrics, and in some organizations may also introduce an inter-team dependency: the operations team responsible for storage may be part of a different organizational unit as those responsible for Hadoop Operations.
Removing These Limitations
Given the above limitations and downsides, we evaluated many options and created a short list of requirements for a viable replacement:
No requirement for special hardware - like the rest of Hadoop, we should depend only on commodity hardware — in particular, on physical nodes that are part of existing clusters.
No requirement for custom fencing configuration - fencing methods such as STONITH require custom hardware; instead, we should rely only on software methods.
No SPOFs - since the goal here is HA, we don’t want to simply push the HA requirement onto another component.
Given the requirement to avoid SPOFs and custom hardware, we knew that any design we decided upon would involve storing multiple replicas of the metadata on multiple commodity nodes. Given this, we added the following additional requirements:
Configurable for any number of failures - rather than designing a system which only tolerates a single failure, we should give operators the flexibility to choose their desired level of resiliency, by adding extra replicas of the metadata.
One slow replica should not affect latency - because the metadata write path is a critical component for the performance of NameNode operations, we need to ensure that the latency remains low. If we have several replicas, we need to ensure that a failure or slow disk on one of the replicas does not impact the latency of the system.
Adding journal replicas should not negatively impact latency - if we allow administrators to configure extra replicas to tolerate several simultaneous failures, this should not result in an adverse impact on performance.
As a company focused on making Hadoop easier to deploy and operate, we also considered the following operational requirements:
Consistency with other Hadoop components- any new components introduced by the design should operate similarly to existing components; for example, they should use XML-based configuration files, log4j logging, and the same metrics framework.
Operations-focused metrics - since the system is a critical part of NameNode operation, we put a high emphasis on exposing metrics. The new system needs to expose all important metrics so that it can be operated in a long-lived production cluster and give early warnings of any problems — long before they cause unavailability.
Security - CDH offers comprehensive security, including encryption on the wire and strong authentication via Kerberos. Any new components introduced for the design must uphold the same standards as the rest of the stack: for customers requiring encryption, it is just as important to encrypt metadata as it is to encrypt data.
After discussion internally at Cloudera, with our customers, and with the community, we designed a system called QuorumJournalManager. This system is based around a simple idea: rather than store the HDFS edit logs in a single location (eg an NFS filer), instead store them on several locations, and use a distributed protocol to ensure that these several locations stay correctly synchronized. In our system, the remote storage is on a new type of HDFS daemon called the JournalNode. The NameNode acts as a client and writes edits to a set of JournalNodes, and considers the edits committed when they have been replicated successfully to a majority of these nodes.