This long run

Distributed systems in theory and in practice.

Monday, May 25, 2015

CAP: if all you have is a timeout, everything looks like a partition

“I suppose it is tempting, if the only tool you have is a timeout, to treat everything as if it were a partition.” - Abraham Maslow, quoted from memory

This post is part of the CAP theorem series. You may want to start by my post on ACID vs. CAP if you have a database background but have never really been exposed to the CAP theorem. The post discussing some traps in the ‘Availability’ and ‘Consistency’ definition of CAP should also be used as an introduction if you know CAP but haven’t looked at its formal definition.

Partitions in the CAP theorem are network partitions, and nothing else. There’re not node failure or process crash. But does this matter in real-life? Let’s answer this question by looking at an optimisation implemented on Apache HBase and Apache ZooKeeper when handling process crash. This is applicable by anyone using ZooKeeper.

Apache HBase, ZooKeeper and failure detection

Using ZooKeeper website: “ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.”

And HBase website: “Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.“ HBase is an open source implementation of Google’s Bigtable. Cluster size varies from 5 to x000 nodes.

ZooKeeper is used by HBase for a few things, one of them being, like in Bigtable, detecting failures. ZooKeeper manages failures by giving the user the possibility to create “ephemeral nodes”. “These znodes exist as long as the session that created the znode is active. When the session ends the znode is deleted.” And then “Expirations happen when the cluster does not hear from the client within the specified session timeout period (i.e. no heartbeat). At session expiration the cluster will delete any/all ephemeral nodes owned by that session and immediately notify any/all connected clients of the change (anyone watching those znodes).”

Ephemeral znodes are a key element of many use cases for ZooKeeper, such as distributed locks. The timeout means that if the process holding the lock crashes or is partitioned, the lock will be freed only after a few dozen seconds--typically 20 seconds.

HBase uses ZooKeeper and ephemeral znodes for failure detection (while the exclusion mechanism is not in ZooKeeper, as described in [V10]). The ZooKeeper timeout represents the major cost for MTTR in a failure scenario, and this for any type of failure: process crash, node failure, network partitions.

A tortoise can be fast--sometimes.

Network partition compared to process crash

There are two issues with network partition. First, it is complex to detect. Second, by definition, you don’t know what’s going on on the other side of the partition: it can be dead (then the CAP theorem does not apply), or not (then you have to choose between consistency and availability).

If all you have is a timeout, then you have no information about what happened: the process might be dead, might be definitively partitioned… or might be already back to life and kicking. A ‘partition-tolerant’ application has to expect the worse.

But a process crash is much simpler than a partition:

The node itself is still there, hence it can communicate the process’ death immediately.
Since the process is dead there is no risk to have a split brain situation.

So we can distinguish both cases we have an easy improvement.

Implementation details

There is a simple implementation here: HBase, as many Unix applications are, is launched with a shell script. The last line of the script launches the process. So we just have to add an ultimate line to this script to request ZooKeeper to delete the ephemeral znode. That’s implemented by the HBase scripts.

So simple?

Nearly. The script things is a little bit more complex than what I just said, because the processes are launched as background tasks. So the work in the script is in the order of 20 lines rather than just one.

More importantly, it also means that the software looking after the node timeout must have an API to allow an external update of the nodes status. It’s the case with ZooKeeper with this ephemeral znode deletion, even if it’s a little bit hacky: it would be better to expire the session instead of deleting a fixed znode.

Alternatives

ZOOKEEPER-922 is a very interesting change request. I copy/paste the plain clear description here: “In the case when a client connection is closed due to socket error instead of the client calling close explicitly, it would be nice to enable the session associated with that client to timeout faster than the negotiated session timeout. This would enable a zookeeper ensemble that is acting as a dynamic discovery provider to remove ephemeral nodes for crashed clients quickly, while allowing for a longer heartbeat-based timeout for java clients that need to do long stop-the-world GC.” These socket errors are clear because if the socket is closed while the remote node is still up, you do have immediate and (somehow) clear exceptions. So this is also leveraging the fact that the node is still up.

That’s exactly what we want. But change request contains also the analysis of the possible solutions:

“How do you deal with the following race condition?”
“There may be a way to do this reliably, but we need to vet the design first.”
“What happens if the client disconnects from one server and moves to another?”

This is CAP in action. A closed socket does not imply that the remote process is dead. Maybe the process is dead--that’s the case you want to optimize for, but you’re not sure, and sometimes it’s not dead (in this case, because the ZooKeeper client can try to connect to another node). The analysis is complex, error prone and time consuming. The ZooKeeper team tends to hold the commit when in doubt, and this change request is still open.

Is adding this extra case for process crash useful?

If your processes never crash, or are never killed by the operating system or by the administration team as a last resort, it’s not very useful. Also, if having walking-dead nodes in the cluster is not an issue, early death detection does not help much--this is more a CP thing than an AP one. But remember that most eventual consistency databases around have “tunable consistency” and can be used in a CP mode.

Interestingly, both Google’s Bigtable and Apache HBase teams added this independently to their implementation--it is not present in the original Bigtable paper. At Facebook--a large HBase user, it’s also reported as being very useful when doing a rolling upgrade: if anything goes wrong, it’s possible to kill9 the suspicious processes without impacting availability. Immediate notification, hence recovery, simplifies system administration. [V12] also mentions “a technique that does involve killing, which is used to increase the availability of Web servers and other services, is to deploy a local script that periodically checks if the application process is running. If not, or if the process has erratic behavior (such as very high CPU usage), the script restarts the application, killing it first if necessary.” This technique can be generalized if a crash is detected immediately.

Conclusion

Consensus servers, like ZooKeeper, rely on “unreliable failure detector”, as defined by Chandra and Toueg in 1996 [V3]. The simplest and most common design for these failure detectors is timeout-based: that’s the one ZooKeeper comes with. But there are a lot of performance benefits found when improving these failure detectors. Here we saw a cheap crash detector. But some network equipment can help as well [V11], with callbacks to allow for integration with your failure detector. Falcon [V12], “coordinating a network of spies, each monitoring a layer of the system” is also an interesting approach. Once again, this means that openness is a key requirement for a failure detector.

Keep in mind that the ‘any failure is a partition’ motto hides many possible improvements. Distinguishing the different cases and using all available information from all possible sources is key to improve system availability.