AWS Backup and Disaster Recovery
# CHAPTER 26
AWS Backup and Disaster Recovery
1. Introduction
Hardware fails. Data centers lose power. Humans make mistakes and accidentally delete critical production databases. If you do not have a robust disaster recovery plan, a single failure can destroy a company instantly. In the cloud, hope is not a strategy; replication is. In this chapter, we will explore the mechanisms of Data Durability, the importance of EBS Snapshots, and how to utilize AWS Backup to automate a bulletproof Disaster Recovery (DR) strategy across multiple geographic Regions.2. Learning Objectives
By the end of this chapter, you will be able to:- Define RPO (Recovery Point Objective) and RTO (Recovery Time Objective).
- Understand how to create EC2/EBS Snapshots.
- Utilize AWS Backup for centralized backup management.
- Understand Cross-Region Replication for ultimate disaster recovery.
- Differentiate between the 4 main Disaster Recovery strategies.
3. Beginner-Friendly Explanation
Imagine writing a 500-page novel.- No Backup: You save it only on your laptop. If you spill coffee on the laptop, the book is gone forever.
- RPO (How much data you can afford to lose): You decide to save a copy to a USB drive every night at midnight. If you spill coffee at 5:00 PM, you lose everything you wrote that specific day, but you have yesterday's copy. Your RPO is 24 hours.
- RTO (How fast you need to recover): Your laptop dies. You must drive to the store, buy a new laptop, and copy the files from the USB. This takes 4 hours. Your RTO is 4 hours.
Disaster Recovery is the science of minimizing RPO (saving data frequently) and minimizing RTO (restoring the system rapidly) at the lowest possible cost.
4. EBS Snapshots
An EBS Snapshot is a point-in-time backup of an EC2 hard drive. When you click "Create Snapshot," AWS takes a picture of the drive and saves it securely into Amazon S3. Snapshots are Incremental. The first snapshot copies the entire 50GB drive. The next day, if you only changed 2GB of files, the second snapshot only copies the 2GB difference! This saves massive amounts of storage cost.If an EC2 instance crashes, you can use the Snapshot to create a brand new EBS volume with all your exact data intact, and attach it to a new server.
5. AWS Backup
Managing snapshots manually across 50 servers and 10 databases is chaotic. AWS Backup is a centralized, fully managed service. You create a Backup Plan: *"Every night at 2:00 AM, take a snapshot of every EC2 instance and RDS database that has the tagEnvironment: Production. Keep the backups for 30 days, then delete them."*
AWS Backup completely automates your RPO compliance.
6. Cross-Region Replication (The Ultimate Fail-Safe)
What if a meteor strikes theus-east-1 (N. Virginia) AWS Region, destroying all the data centers? If your live database AND your backup snapshots are both in N. Virginia, you lose the company.
True Disaster Recovery requires Cross-Region Replication.
You configure AWS Backup to take the snapshot in N. Virginia, and instantly copy that snapshot over the AWS global network to a vault in us-west-1 (California). Now, your data is isolated across continental fault lines.
7. The 4 Disaster Recovery Strategies
You must balance cost vs. recovery speed (RTO):- 1. Backup and Restore (Cheapest, Slowest): Take daily snapshots. If a disaster hits, you manually launch new EC2 instances and restore the databases from the snapshots. (RTO: Hours).
- 2. Pilot Light (Cheap, Faster): You keep a tiny, minimal version of your app running in a second Region (like a pilot light on a stove). In a disaster, you "turn the gas on," rapidly scaling up the tiny app to full size. (RTO: Tens of Minutes).
- 3. Warm Standby (Expensive, Fast): You keep a medium-sized, fully functional version of your app running in a second Region at all times. In a disaster, you just flip the DNS switch. (RTO: Minutes).
- 4. Multi-Site Active/Active (Massive Cost, Instant): You run 100% full production environments in both N. Virginia and California simultaneously. Traffic flows to both. If Virginia dies, California takes 100% of the traffic instantly. (RTO: Seconds/Zero).
8. Mini Project: Configure Automated Backups
Let's automate the safety of our EC2 server.Step-by-Step Conceptual Tutorial:
- 1. Open the AWS Console and search for AWS Backup.
-
2.
Click Create backup plan. Choose Build a new plan. Name it
Daily-Production-Backup.
- 3. Add a Backup Rule:
-
Rule Name:
Nightly-Rule.
-
Frequency:
Daily.
-
Backup window: Start at
05:00 UTC(Middle of the night).
-
Lifecycle: Transition to cold storage after
Never, Expire (Delete) after30 Days. (This prevents you from paying for 10-year-old backups).
- Click Save Rule.
- 4. Assign Resources: Tell the plan what to backup. Click "Assign resources".
-
5.
Assignment name:
Prod-Servers.
-
6.
Resource selection: Choose Tags. Key:
Environment, Value:Production.
- 7. Click Assign.
9. Best Practices
- Test Your Backups: A backup is useless if it is corrupted and you don't realize it until the disaster happens. Every 6 months, companies must run "Game Days." They intentionally shut down a server and prove that the engineering team can successfully restore the database from a Snapshot within the required RTO timeframe.
10. Common Mistakes
- Confusing Snapshots with AMIs: An AMI (Amazon Machine Image) is a blueprint for the *entire* server (OS + configuration + attached hard drives), used to launch new servers. A Snapshot is a backup of a *single* hard drive. You can create an AMI from a Snapshot, but they serve different immediate purposes.
11. Exercises
- 1. Define the difference between Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
- 2. Why are EBS Snapshots considered highly cost-effective compared to making full raw copies of a hard drive every day?
12. MCQs with Answers
An enterprise company requires that if a catastrophic natural disaster destroys their primary AWS Region, they must be able to restore their database in a completely different geographical location. What AWS Backup feature MUST be enabled to satisfy this requirement?
Which Disaster Recovery strategy involves maintaining a scaled-down, minimal version of the core architecture running constantly in a secondary region, ready to be scaled up rapidly during an emergency?
13. Interview Questions
- Q: Explain the concept of Incremental Backups in AWS EBS Snapshots. How does this mechanism drastically reduce monthly storage costs for a database that is backed up daily?
- Q: Contrast the "Backup and Restore" DR strategy with the "Multi-Site Active/Active" strategy. In what business scenario would a company be forced to pay the massive premium required for Active/Active?