What are the ways to design a self healing distributed system? Please provide practical references.
Design your data structures (in memory and on disk) to facilitate detection and repair of errors - checksums, journals, snapshots and redundant copies, etc.
Use simpler protocols wherever possible. Proving correctness is right, but the real point in this context is to limit the number of possible states that your self-heal code has to deal with. Nothing's worse than having the self-heal code itself be incorrect and/or unverifiable.
Maximize the resources available for self-healing. Again, this is where many of Drew's points - especially #3 and #7 - become even more important.
Look for trouble. Compare copies/fragments, look at logs, and so on to verify that things happened as they should and left things in a correct state. If not, repair it.
Make trouble. In addition to Drew's suggestions about model checking or simulation, I recommend Netflix's Simian Army approach of injecting faults even into live system. As painful as it might be to deal with an error because the "chaos monkey" triggered an incorrect recovery path, it's even more painful when that same recovery path gets triggered some other way and you don't even know how or when.