AWS is the leader when it comes to the cloud, and for good reason. AWS is well ahead in the quality and breadth of services they offer.
However, when a service is running at the scale of AWS, it is natural to expect some failures to occur. According to AWS EBS availability is designed for 99.999%.
The annual failure rate (AFR) is 0.1% - 0.2%, where failure means a complete or partial failure. For example, if you had 1,000 EBS discs, you should expect 1 or 2 to have a failure per year. In our experience, partial failure is significantly more common than a complete loss. Even so, a partial loss can take a lot of time to resolve and can still be debilitating to a business.
Over the years, there have been some AWS failures that made news headlines due to havoc caused for both companies and their users. These incidents put a spotlight on AWS’ imperfections.
In 2011, a major AWS failure took down hundreds of sites including Quora and Reddit. From this outage, Netflix learnt to always be prepared by intentionally simulating failures with a service called Chaos Monkey. In 2013, one of the biggest failures happened when the AWS U.S. East data center went down, affecting Netflix, Airbnb, Instagram and Amazon.com itself. Just weeks ago on February 28th, AWS S3 storage had a major outage due to high error rates, again in the U.S. East data center. Prominent companies including Atlassian, Slack and Expedia were hit.
Introduction to High Availability and Disaster Recovery
In the real world, insurance is used to manage risk when a natural disaster such as a hurricane or flood strikes. In the database world, there are two critical methods of insurance. High Availability (HA) replicates the latest database version virtually instantly. Disaster Recovery (DR) offers continuous protection by saving every database change, allowing database restoration to any point in time.
In what follows, we’ll dig deeper as to what disaster recovery and high availability are, as well as how we’ve implemented them for Citus Cloud.
What is High Availability and Disaster Recovery?
High availability and disaster recovery are both forms of data backups that are mutually exclusive and inter-related. The difference between them, is that HA has a secondary reader database replica (often referred to as stand-by or follower) ready to take over at any moment, but DR just writes to cold storage (in the case of Amazon that’s S3) and has latency in the time for the main database to recover data.
Overview of High Availability