Designing High Availability Systems
# CHAPTER 11
Designing High Availability Systems
1. Chapter Introduction
In a FAANG interview, the interviewer will constantly try to break your system. They will ask: "What happens if this server dies? What happens if the database crashes? What happens if an earthquake destroys the AWS US-East-1 data center?" If your answer is "the system goes down," you fail. This chapter explains how to design Highly Available (HA) systems using redundancy, fault tolerance, and multi-region disaster recovery.2. What is High Availability (The "9s")
Availability is the percentage of time a system is fully operational. It is measured in "nines."- 99% (Two 9s): System is down for ~3.6 days a year. (Unacceptable for businesses).
- 99.9% (Three 9s): System is down for ~8.7 hours a year. (Standard for internal tools).
- 99.99% (Four 9s): System is down for ~52 minutes a year. (Standard enterprise goal).
- 99.999% (Five 9s): System is down for ~5 minutes a year. (The holy grail: Telecoms, Pacemakers, Aviation).
*To achieve Five 9s, human intervention is too slow. The system must automatically detect failures and heal itself.*
3. Redundancy (Eliminating SPOFs)
A Single Point of Failure (SPOF) is any component whose failure brings down the entire system. The core tenet of High Availability is: Eliminate all SPOFs through Redundancy.- *Web Tier Redundancy:* Run 3 web servers instead of 1. If Server A dies, the Load Balancer routes traffic to B and C.
- *Load Balancer Redundancy:* Run Active-Passive load balancers.
- *Database Redundancy:* Run a Master-Slave replication cluster. If the Master dies, promote a Slave.
4. Fault Tolerance vs. High Availability
These terms are often confused.- High Availability: The system might experience a brief 2-second hiccup when a server dies while the load balancer re-routes traffic, but it stays online.
- Fault Tolerance: The system is designed to have *zero* interruption. If a component fails, a perfectly mirrored component seamlessly takes its place with zero dropped packets (Requires extremely expensive, specialized hardware).
5. Multi-AZ and Multi-Region Architectures
Cloud providers like AWS define physical locations in two ways:- 1. Availability Zones (AZ): Distinct data centers within the same city (e.g., US-East-1a and US-East-1b). They have independent power and cooling. *Rule:* Always deploy your servers across at least 2 AZs. If one data center loses power, the other takes over instantly.
- 2. Regions: Entirely different geographic areas (e.g., Virginia vs. Tokyo).
Disaster Recovery (Multi-Region): What if a hurricane hits Virginia and takes out all AZs in US-East-1? A truly Highly Available system utilizes Multi-Region Deployment. You replicate your entire architecture in US-West (California). You use Geo-DNS (Route53) to monitor the Virginia region. If it goes dark, DNS automatically points all global traffic to California.
6. The Chaos Monkey
How do you know your system is highly available? You break it on purpose. Netflix invented Chaos Engineering (The Chaos Monkey). It is a script that randomly shuts down production servers during business hours. If the engineering team built proper redundancy, the system auto-heals and users never notice. If the system crashes, the engineers learn where their SPOFs are.7. Real-World Scenario: The Payment Gateway Crash
*Scenario:* An e-commerce site processes payments via Stripe. Stripe's API goes down globally for 1 hour. The e-commerce site cannot process orders and loses $1M in revenue. *The HA Fix (Graceful Degradation):* Do not fail hard. Implement a fallback mechanism. If the primary Stripe API is unreachable, the system automatically routes the payment request to a backup gateway (like PayPal or Braintree), or drops the order into a Kafka queue to be processed the moment Stripe comes back online.8. Mini Project: Audit for SPOFs
Look at the following architecture and identify the Single Points of Failure:Client -> Active Load Balancer -> [App Server 1, App Server 2] -> Single MySQL Database
*Answers:*
- 1. The Load Balancer (needs an Active-Passive pair).
- 2. The MySQL Database (needs a Master-Slave replication setup).
9. Common Mistakes
- Ignoring the Database: Having 100 redundant stateless web servers is useless if they all connect to a single database server. The database is always the hardest thing to make highly available.
- Manual Failover: Writing a runbook that says "If the master database dies, an engineer must wake up at 3 AM and manually promote the replica." Human failover is too slow to achieve Four 9s of availability.
10. Best Practices
- Rate Limiting and Circuit Breakers: High availability isn't just about hardware failure; it's about surviving malicious attacks. Implement Rate Limiting to block DDoS attacks. Use Circuit Breaker patterns to stop calling an internal microservice if it is repeatedly failing, preventing cascading system crashes.
11. Exercises
- 1. If your system is down for 10 hours a year, how many "nines" of availability are you achieving?
- 2. Explain the difference between an Availability Zone (AZ) and a Region in AWS.
12. MCQs
What does the term "Five Nines" (99.999%) of Availability mean?
What is a Single Point of Failure (SPOF)?
How do you eliminate a Single Point of Failure?
What is the difference between an AWS Availability Zone (AZ) and a Region?
What is Multi-Region Disaster Recovery?
What is "Chaos Engineering" (e.g., Netflix's Chaos Monkey)?
What is "Graceful Degradation"?
If you have 50 Web Servers behind a single Load Balancer, is the system Highly Available?
What is a "Circuit Breaker" pattern in Microservices?
Why is manual human failover (e.g., an engineer waking up to restart a server) unacceptable for High Availability systems?
14. Interview Questions
- Q: "Design a globally available URL shortener. If the AWS US-East region goes completely offline, how does a user in New York still get redirected?"
15. FAQs
- Q: Does High Availability cost a lot of money?