High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 34“Scalable systems, HLD interviews & case studies”

Disaster Recovery

Q: RPO vs RTO?

RPO max data loss window; RTO max downtime to restore service.

Q: Multi-AZ vs multi-region?

Multi-AZ survives AZ failure in one region; multi-region survives entire region loss.

Q: Warm standby vs pilot light?

Pilot light minimal core running in DR; warm standby scaled subset ready faster — higher cost.

Q: How measure replication lag for RPO?

Monitor DB replica lag seconds; alert if exceeds RPO budget.

Q: DR design for global payment API 99.99% SLA.

Active-active two regions, sync ledger shard quorum, async elsewhere, Route53 latency + failover, RPO 0 money RPO 60s analytics, quarterly game day, immutable backups.

Disaster recovery (DR) restores systems after region-wide failures, data corruption, or ransomware.

Course progress0%

Focus

10 guided sections

Practice signal

Examples included

Career prep

Interview Q&A included

Introduction

Disaster recovery (DR) restores systems after region-wide failures, data corruption, or ransomware. Key metrics: RPO (Recovery Point Objective — max acceptable data loss) and RTO (Recovery Time Objective — max downtime). Strategies: backup/restore, pilot light, warm standby, active-active multi-region.

HLD interviews ask multi-AZ vs multi-region, failover DNS, database promotion, and chaos testing. Banking demands minutes RPO; analytics may tolerate hours.

This lesson covers DR tiers, runbooks, and testing culture.

Understanding the topic

Key concepts

Multi-AZ: same region failure isolation — RDS Multi-AZ failover ~minutes.
Multi-region: geographic DR — async replication lag = RPO floor.
Backup: snapshots + WAL/PITR — test restore quarterly.
Active-passive: secondary region scaled down until failover.
Active-active: both regions serve traffic — conflict resolution harder.
Chaos engineering validates failover before real disaster.

text

flowchart LR
  Primary[Region A] -->|replicate| Secondary[Region B]
  Secondary -->|failover| Traffic

Internal architecture

Architecture overview

text

flowchart LR
  Primary[Region A] -->|replicate| Secondary[Region B]
  Secondary -->|failover| Traffic

Step-by-step explanation

Primary region active; secondary async DB replica lag monitored <30s RPO.
Route53 health checks failover DNS to secondary ALB on primary failure.
S3 cross-region replication for static assets and backups.
Kafka MirrorMaker 2 replicates critical topics.
Runbook: promote replica, update DNS TTL 60s, verify smoke tests.
Game days twice yearly execute full regional failover drill.

Informative example

Infrastructure DR snippet — RDS cross-region read replica and Route53 failover (conceptual Terraform-style YAML):

yaml

disaster_recovery:
  rpo_target_seconds: 60
  rto_target_minutes: 15
  strategy: warm_standby

database:
  primary_region: us-east-1
  dr_replica_region: us-west-2
  replication: async
  promotion_runbook: docs/dr/db-promote.md

dns:
  record: api.shop.example.com
  primary: us-east-1-alb
  secondary: us-west-2-alb
  health_check_path: /actuator/health/readiness
  failover_policy: secondary_on_primary_failure

backups:
  rds_snapshots: daily
  retention_days: 35
  restore_test_schedule: quarterly

State RPO/RTO aloud in interviews. Match spend to tier — not everything needs active-active.

Real-world use

Real-world use cases

Banking core: active-passive region RPO near-zero with sync replication cost.
E-commerce: warm standby acceptable RTO 30min for non-peak.
OTT catalog metadata multi-region active-active; origin failover.
Healthcare HIPAA requires tested backup restore documentation.

Best practices

Automate failover where possible — manual steps fail at 3am.
Test restores — backup without restore test is wishful thinking.
Document RPO/RTO per tier-1 service.
Isolate blast radius with multi-AZ minimum always.
Encrypt backups; immutable backup vault against ransomware.
Run game days with cross-team participation.

Common mistakes

DR region never tested — failover fails on subtle config drift.
DNS TTL hours — slow traffic shift during failover.
Assuming cloud AZ equals region redundancy.
Active-active without conflict resolution design.
No runbook ownership — outdated contacts.

Advanced interview questions

Q1BeginnerRPO vs RTO?

RPO max data loss window; RTO max downtime to restore service.

Q2BeginnerMulti-AZ vs multi-region?

Multi-AZ survives AZ failure in one region; multi-region survives entire region loss.

Q3IntermediateWarm standby vs pilot light?

Pilot light minimal core running in DR; warm standby scaled subset ready faster — higher cost.

Q4IntermediateHow measure replication lag for RPO?

Monitor DB replica lag seconds; alert if exceeds RPO budget.

Q5AdvancedDR design for global payment API 99.99% SLA.

Active-active two regions, sync ledger shard quorum, async elsewhere, Route53 latency + failover, RPO 0 money RPO 60s analytics, quarterly game day, immutable backups.

Summary

DR plans define RPO/RTO and failover architecture. Multi-AZ for AZ failure; multi-region for geographic disaster. Test backup restore and failover drills regularly. Active-active costs more — reserve for highest SLA tiers. Runbooks and automation reduce RTO during stress. Case studies apply full HLD toolkit to real products.

Ready to mark this lesson complete?Track your journey across the entire course.