Disaster Recovery
Disaster recovery (DR) restores systems after region-wide failures, data corruption, or ransomware.
Introduction
Disaster recovery (DR) restores systems after region-wide failures, data corruption, or ransomware. Key metrics: RPO (Recovery Point Objective — max acceptable data loss) and RTO (Recovery Time Objective — max downtime). Strategies: backup/restore, pilot light, warm standby, active-active multi-region.
HLD interviews ask multi-AZ vs multi-region, failover DNS, database promotion, and chaos testing. Banking demands minutes RPO; analytics may tolerate hours.
This lesson covers DR tiers, runbooks, and testing culture.
Understanding the topic
Key concepts
- Multi-AZ: same region failure isolation — RDS Multi-AZ failover ~minutes.
- Multi-region: geographic DR — async replication lag = RPO floor.
- Backup: snapshots + WAL/PITR — test restore quarterly.
- Active-passive: secondary region scaled down until failover.
- Active-active: both regions serve traffic — conflict resolution harder.
- Chaos engineering validates failover before real disaster.
flowchart LRPrimary[Region A] -->|replicate| Secondary[Region B]Secondary -->|failover| Traffic
Internal architecture
Architecture overview
flowchart LRPrimary[Region A] -->|replicate| Secondary[Region B]Secondary -->|failover| Traffic
Step-by-step explanation
- Primary region active; secondary async DB replica lag monitored <30s RPO.
- Route53 health checks failover DNS to secondary ALB on primary failure.
- S3 cross-region replication for static assets and backups.
- Kafka MirrorMaker 2 replicates critical topics.
- Runbook: promote replica, update DNS TTL 60s, verify smoke tests.
- Game days twice yearly execute full regional failover drill.
Informative example
Infrastructure DR snippet — RDS cross-region read replica and Route53 failover (conceptual Terraform-style YAML):
disaster_recovery:rpo_target_seconds: 60rto_target_minutes: 15strategy: warm_standbydatabase:primary_region: us-east-1dr_replica_region: us-west-2replication: asyncpromotion_runbook: docs/dr/db-promote.mddns:record: api.shop.example.comprimary: us-east-1-albsecondary: us-west-2-albhealth_check_path: /actuator/health/readinessfailover_policy: secondary_on_primary_failurebackups:rds_snapshots: dailyretention_days: 35restore_test_schedule: quarterly
State RPO/RTO aloud in interviews. Match spend to tier — not everything needs active-active.
Real-world use
Real-world use cases
- Banking core: active-passive region RPO near-zero with sync replication cost.
- E-commerce: warm standby acceptable RTO 30min for non-peak.
- OTT catalog metadata multi-region active-active; origin failover.
- Healthcare HIPAA requires tested backup restore documentation.
Best practices
- Automate failover where possible — manual steps fail at 3am.
- Test restores — backup without restore test is wishful thinking.
- Document RPO/RTO per tier-1 service.
- Isolate blast radius with multi-AZ minimum always.
- Encrypt backups; immutable backup vault against ransomware.
- Run game days with cross-team participation.
Common mistakes
- DR region never tested — failover fails on subtle config drift.
- DNS TTL hours — slow traffic shift during failover.
- Assuming cloud AZ equals region redundancy.
- Active-active without conflict resolution design.
- No runbook ownership — outdated contacts.
Advanced interview questions
Q1BeginnerRPO vs RTO?
Q2BeginnerMulti-AZ vs multi-region?
Q3IntermediateWarm standby vs pilot light?
Q4IntermediateHow measure replication lag for RPO?
Q5AdvancedDR design for global payment API 99.99% SLA.
Summary
DR plans define RPO/RTO and failover architecture. Multi-AZ for AZ failure; multi-region for geographic disaster. Test backup restore and failover drills regularly. Active-active costs more — reserve for highest SLA tiers. Runbooks and automation reduce RTO during stress. Case studies apply full HLD toolkit to real products.