Skip to main content
System Design
CHAPTER 11 Beginner

Designing High Availability Systems

Updated: May 18, 2026
5 min read

# CHAPTER 11

Designing High Availability Systems

1. Chapter Introduction

In a FAANG interview, the interviewer will constantly try to break your system. They will ask: "What happens if this server dies? What happens if the database crashes? What happens if an earthquake destroys the AWS US-East-1 data center?" If your answer is "the system goes down," you fail. This chapter explains how to design Highly Available (HA) systems using redundancy, fault tolerance, and multi-region disaster recovery.

2. What is High Availability (The "9s")

Availability is the percentage of time a system is fully operational. It is measured in "nines."
  • 99% (Two 9s): System is down for ~3.6 days a year. (Unacceptable for businesses).
  • 99.9% (Three 9s): System is down for ~8.7 hours a year. (Standard for internal tools).
  • 99.99% (Four 9s): System is down for ~52 minutes a year. (Standard enterprise goal).
  • 99.999% (Five 9s): System is down for ~5 minutes a year. (The holy grail: Telecoms, Pacemakers, Aviation).

*To achieve Five 9s, human intervention is too slow. The system must automatically detect failures and heal itself.*

3. Redundancy (Eliminating SPOFs)

A Single Point of Failure (SPOF) is any component whose failure brings down the entire system. The core tenet of High Availability is: Eliminate all SPOFs through Redundancy.
  • *Web Tier Redundancy:* Run 3 web servers instead of 1. If Server A dies, the Load Balancer routes traffic to B and C.
  • *Load Balancer Redundancy:* Run Active-Passive load balancers.
  • *Database Redundancy:* Run a Master-Slave replication cluster. If the Master dies, promote a Slave.

4. Fault Tolerance vs. High Availability

These terms are often confused.
  • High Availability: The system might experience a brief 2-second hiccup when a server dies while the load balancer re-routes traffic, but it stays online.
  • Fault Tolerance: The system is designed to have *zero* interruption. If a component fails, a perfectly mirrored component seamlessly takes its place with zero dropped packets (Requires extremely expensive, specialized hardware).
*Most internet web applications aim for High Availability, not strict Fault Tolerance.*

5. Multi-AZ and Multi-Region Architectures

Cloud providers like AWS define physical locations in two ways:
  1. 1. Availability Zones (AZ): Distinct data centers within the same city (e.g., US-East-1a and US-East-1b). They have independent power and cooling. *Rule:* Always deploy your servers across at least 2 AZs. If one data center loses power, the other takes over instantly.
  1. 2. Regions: Entirely different geographic areas (e.g., Virginia vs. Tokyo).

Disaster Recovery (Multi-Region): What if a hurricane hits Virginia and takes out all AZs in US-East-1? A truly Highly Available system utilizes Multi-Region Deployment. You replicate your entire architecture in US-West (California). You use Geo-DNS (Route53) to monitor the Virginia region. If it goes dark, DNS automatically points all global traffic to California.

6. The Chaos Monkey

How do you know your system is highly available? You break it on purpose. Netflix invented Chaos Engineering (The Chaos Monkey). It is a script that randomly shuts down production servers during business hours. If the engineering team built proper redundancy, the system auto-heals and users never notice. If the system crashes, the engineers learn where their SPOFs are.

7. Real-World Scenario: The Payment Gateway Crash

*Scenario:* An e-commerce site processes payments via Stripe. Stripe's API goes down globally for 1 hour. The e-commerce site cannot process orders and loses $1M in revenue. *The HA Fix (Graceful Degradation):* Do not fail hard. Implement a fallback mechanism. If the primary Stripe API is unreachable, the system automatically routes the payment request to a backup gateway (like PayPal or Braintree), or drops the order into a Kafka queue to be processed the moment Stripe comes back online.

8. Mini Project: Audit for SPOFs

Look at the following architecture and identify the Single Points of Failure: Client -> Active Load Balancer -> [App Server 1, App Server 2] -> Single MySQL Database *Answers:*
  1. 1. The Load Balancer (needs an Active-Passive pair).
  1. 2. The MySQL Database (needs a Master-Slave replication setup).
*Note: The App Servers are redundant (there are two of them).*

9. Common Mistakes

  • Ignoring the Database: Having 100 redundant stateless web servers is useless if they all connect to a single database server. The database is always the hardest thing to make highly available.
  • Manual Failover: Writing a runbook that says "If the master database dies, an engineer must wake up at 3 AM and manually promote the replica." Human failover is too slow to achieve Four 9s of availability.

10. Best Practices

  • Rate Limiting and Circuit Breakers: High availability isn't just about hardware failure; it's about surviving malicious attacks. Implement Rate Limiting to block DDoS attacks. Use Circuit Breaker patterns to stop calling an internal microservice if it is repeatedly failing, preventing cascading system crashes.

11. Exercises

  1. 1. If your system is down for 10 hours a year, how many "nines" of availability are you achieving?
  1. 2. Explain the difference between an Availability Zone (AZ) and a Region in AWS.

12. MCQs

Question 1

What does the term "Five Nines" (99.999%) of Availability mean?

Question 2

What is a Single Point of Failure (SPOF)?

Question 3

How do you eliminate a Single Point of Failure?

Question 4

What is the difference between an AWS Availability Zone (AZ) and a Region?

Question 5

What is Multi-Region Disaster Recovery?

Question 6

What is "Chaos Engineering" (e.g., Netflix's Chaos Monkey)?

Question 7

What is "Graceful Degradation"?

Question 8

If you have 50 Web Servers behind a single Load Balancer, is the system Highly Available?

Question 9

What is a "Circuit Breaker" pattern in Microservices?

Question 10

Why is manual human failover (e.g., an engineer waking up to restart a server) unacceptable for High Availability systems?

14. Interview Questions

  • Q: "Design a globally available URL shortener. If the AWS US-East region goes completely offline, how does a user in New York still get redirected?"

15. FAQs

  • Q: Does High Availability cost a lot of money?
A: Yes. Multi-region redundancy effectively doubles your infrastructure bill. You must weigh the cost of the infrastructure against the financial cost of 1 hour of downtime for your specific business.

16. Summary

High Availability ensures a system stays online despite inevitable hardware failures, aiming for 99.99% uptime. The core strategy is eliminating Single Points of Failure (SPOFs) through redundancy at the Load Balancer, Web, and Database tiers. To survive localized outages, deploy across multiple Availability Zones. To survive natural disasters, deploy across multiple geographic Regions and utilize DNS failover.

17. Next Chapter Recommendation

We know we need database redundancy, but replicating massive relational databases is incredibly complex. In Chapter 12: Database Scaling and Sharding, we will dive deep into Master-Slave replication, Read Replicas, and the ultimate scaling technique: Sharding.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·