High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 34

    Disaster Recovery

    Disaster recovery (DR) restores systems after region-wide failures, data corruption, or ransomware.

    Course progress0%
    Focus
    10 guided sections
    Practice signal
    Examples included
    Career prep
    Interview Q&A included

    Introduction

    Disaster recovery (DR) restores systems after region-wide failures, data corruption, or ransomware. Key metrics: RPO (Recovery Point Objective — max acceptable data loss) and RTO (Recovery Time Objective — max downtime). Strategies: backup/restore, pilot light, warm standby, active-active multi-region.

    HLD interviews ask multi-AZ vs multi-region, failover DNS, database promotion, and chaos testing. Banking demands minutes RPO; analytics may tolerate hours.

    This lesson covers DR tiers, runbooks, and testing culture.

    Understanding the topic

    Key concepts

    • Multi-AZ: same region failure isolation — RDS Multi-AZ failover ~minutes.
    • Multi-region: geographic DR — async replication lag = RPO floor.
    • Backup: snapshots + WAL/PITR — test restore quarterly.
    • Active-passive: secondary region scaled down until failover.
    • Active-active: both regions serve traffic — conflict resolution harder.
    • Chaos engineering validates failover before real disaster.
    text
    flowchart LR
    Primary[Region A] -->|replicate| Secondary[Region B]
    Secondary -->|failover| Traffic

    Internal architecture

    Architecture overview

    text
    flowchart LR
    Primary[Region A] -->|replicate| Secondary[Region B]
    Secondary -->|failover| Traffic

    Step-by-step explanation

    1. Primary region active; secondary async DB replica lag monitored <30s RPO.
    2. Route53 health checks failover DNS to secondary ALB on primary failure.
    3. S3 cross-region replication for static assets and backups.
    4. Kafka MirrorMaker 2 replicates critical topics.
    5. Runbook: promote replica, update DNS TTL 60s, verify smoke tests.
    6. Game days twice yearly execute full regional failover drill.

    Informative example

    Infrastructure DR snippet — RDS cross-region read replica and Route53 failover (conceptual Terraform-style YAML):

    yaml
    disaster_recovery:
    rpo_target_seconds: 60
    rto_target_minutes: 15
    strategy: warm_standby
    database:
    primary_region: us-east-1
    dr_replica_region: us-west-2
    replication: async
    promotion_runbook: docs/dr/db-promote.md
    dns:
    record: api.shop.example.com
    primary: us-east-1-alb
    secondary: us-west-2-alb
    health_check_path: /actuator/health/readiness
    failover_policy: secondary_on_primary_failure
    backups:
    rds_snapshots: daily
    retention_days: 35
    restore_test_schedule: quarterly

    State RPO/RTO aloud in interviews. Match spend to tier — not everything needs active-active.

    Real-world use

    Real-world use cases

    • Banking core: active-passive region RPO near-zero with sync replication cost.
    • E-commerce: warm standby acceptable RTO 30min for non-peak.
    • OTT catalog metadata multi-region active-active; origin failover.
    • Healthcare HIPAA requires tested backup restore documentation.

    Best practices

    • Automate failover where possible — manual steps fail at 3am.
    • Test restores — backup without restore test is wishful thinking.
    • Document RPO/RTO per tier-1 service.
    • Isolate blast radius with multi-AZ minimum always.
    • Encrypt backups; immutable backup vault against ransomware.
    • Run game days with cross-team participation.

    Common mistakes

    • DR region never tested — failover fails on subtle config drift.
    • DNS TTL hours — slow traffic shift during failover.
    • Assuming cloud AZ equals region redundancy.
    • Active-active without conflict resolution design.
    • No runbook ownership — outdated contacts.

    Advanced interview questions

    Q1BeginnerRPO vs RTO?
    RPO max data loss window; RTO max downtime to restore service.
    Q2BeginnerMulti-AZ vs multi-region?
    Multi-AZ survives AZ failure in one region; multi-region survives entire region loss.
    Q3IntermediateWarm standby vs pilot light?
    Pilot light minimal core running in DR; warm standby scaled subset ready faster — higher cost.
    Q4IntermediateHow measure replication lag for RPO?
    Monitor DB replica lag seconds; alert if exceeds RPO budget.
    Q5AdvancedDR design for global payment API 99.99% SLA.
    Active-active two regions, sync ledger shard quorum, async elsewhere, Route53 latency + failover, RPO 0 money RPO 60s analytics, quarterly game day, immutable backups.

    Summary

    DR plans define RPO/RTO and failover architecture. Multi-AZ for AZ failure; multi-region for geographic disaster. Test backup restore and failover drills regularly. Active-active costs more — reserve for highest SLA tiers. Runbooks and automation reduce RTO during stress. Case studies apply full HLD toolkit to real products.

    Ready to mark this lesson complete?Track your journey across the entire course.