Jay Taylor's notes

back to listing index

What are the ways to design a self healing distributed system? Please provide practical references. - Quora

[web search]
Original source (www.quora.com)
Tags: distributed-systems architecture self-healing www.quora.com
Clipped on: 2017-11-07

What are the ways to design a self healing distributed system? Please provide practical references.

4 Answers
Image (Asset 1/3) alt=
  • Design your data structures (in memory and on disk) to facilitate detection and repair of errors - checksums, journals, snapshots and redundant copies, etc.
  • Use simpler protocols wherever possible.  Proving correctness is right, but the real point in this context is to limit the number of possible states that your self-heal code has to deal with.  Nothing's worse than having the self-heal code itself be incorrect and/or unverifiable.
  • Maximize the resources available for self-healing.  Again, this is where many of Drew's points - especially #3 and #7 - become even more important.
  • Look for trouble.  Compare copies/fragments, look at logs, and so on to verify that things happened as they should and left things in a correct state.  If not, repair it.
  • Make trouble.  In addition to Drew's suggestions about model checking or simulation, I recommend Netflix's Simian Army approach of injecting faults even into live system.  As painful as it might be to deal with an error because the "chaos monkey" triggered an incorrect recovery path, it's even more painful when that same recovery path gets triggered some other way and you don't even know how or when.
3k Views · 9 Upvotes
Image (Asset 2/3) alt=
Assume Everything Will Fail

Build your system with the assumption that every call to any external service (network/disk/etc) will fail. Be sure to handle that case.

Ensure Everything Will Fail

Take a page out of Netflix's book and build tools to inject failures into your system so you can be sure your system can recover.

Isolate Failures

Build your system such that failures in one area don't ripple out into other areas. Keeping various parts of the system loosely coupled will help with this.

Circuit Breaker Pattern

When a part of the system fails make sure you can (automatically) shut off traffic to that part of the system. This will help with isolation and will help speed recovery of the failed area.

Real World Examples

I would look into what Netflix is doing with Hystrix


And look at The Pragmatic Bookshelf | Release It!
616 Views · 6 Upvotes · Answer requested by Goms Muthuvinayagam