Jay Taylor's notes
back to listing indexWhat are the ways to design a self healing distributed system? Please provide practical references. - Quora
[web search]
Original source (www.quora.com)
Clipped on: 2017-11-07
Ask New Question
Sign In
What are the ways to design a self healing distributed system? Please provide practical references.
4 Answers
- Design your data structures (in memory and on disk) to facilitate detection and repair of errors - checksums, journals, snapshots and redundant copies, etc.
- Use simpler protocols wherever possible. Proving correctness is right, but the real point in this context is to limit the number of possible states that your self-heal code has to deal with. Nothing's worse than having the self-heal code itself be incorrect and/or unverifiable.
- Maximize the resources available for self-healing. Again, this is where many of Drew's points - especially #3 and #7 - become even more important.
- Look for trouble. Compare copies/fragments, look at logs, and so on to verify that things happened as they should and left things in a correct state. If not, repair it.
- Make trouble. In addition to Drew's suggestions about model checking or simulation, I recommend Netflix's Simian Army approach of injecting faults even into live system. As painful as it might be to deal with an error because the "chaos monkey" triggered an incorrect recovery path, it's even more painful when that same recovery path gets triggered some other way and you don't even know how or when.
Assume Everything Will Fail
Build your system with the assumption that every call to any external service (network/disk/etc) will fail. Be sure to handle that case.
Ensure Everything Will Fail
Take a page out of Netflix's book and build tools to inject failures into your system so you can be sure your system can recover.
Isolate Failures
Build your system such that failures in one area don't ripple out into other areas. Keeping various parts of the system loosely coupled will help with this.
Circuit Breaker Pattern
When a part of the system fails make sure you can (automatically) shut off traffic to that part of the system. This will help with isolation and will help speed recovery of the failed area.
Real World Examples
I would look into what Netflix is doing with Hystrix
- Introducing Hystrix for Resilience Engineering
- Fault Tolerance in a High Volume, Distributed System
- Making the Netflix API More Resilient
- Hystrix
And look at The Pragmatic Bookshelf | Release It!