High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 4

    Capacity Estimation

    Capacity estimation (back-of-the-envelope math) turns vague scale statements into concrete resource needs.

    Course progress0%
    Focus
    10 guided sections
    Practice signal
    Examples included
    Career prep
    Interview Q&A included

    Introduction

    Capacity estimation (back-of-the-envelope math) turns vague scale statements into concrete resource needs. Interviewers expect you to estimate daily active users (DAU), queries per second (QPS), storage growth, and bandwidth — then use those numbers to justify databases, caches, and shard counts.

    You do not need a calculator or exact figures. Order-of-magnitude correctness matters: knowing you need terabytes vs petabytes, thousands vs millions of QPS, separates senior candidates from those who pick Cassandra because it sounds scalable.

    This lesson walks through standard formulas, sensible assumptions, and how to connect estimates to architectural decisions in HLD interviews.

    Understanding the topic

    Key concepts

    • DAU × actions per user per day ÷ 86400 ≈ average QPS; multiply by peak factor (2–3×).
    • Storage = records × record size × retention; account for indexes (often 2–3× raw data).
    • Bandwidth = QPS × average response payload; CDN reduces origin bandwidth for static assets.
    • Memory for cache: cache hottest 20% of data if Zipf-like access applies.
    • Use powers of two and round aggressively — 37,842 → ~40k.
    • Estimates validate bottlenecks: single PostgreSQL primary ~10k simple writes/s is a planning anchor.
    text
    flowchart LR
    DAU --> QPS
    QPS --> Storage
    Storage --> Bandwidth

    Internal architecture

    Architecture overview

    text
    flowchart LR
    DAU --> QPS
    QPS --> Storage
    Storage --> Bandwidth

    Step-by-step explanation

    1. Start from users: total users, DAU percentage, geographic distribution.
    2. Derive operations: reads/writes per session, peak hours, seasonal spikes.
    3. Compute QPS (peak), storage (5-year), bandwidth (egress + ingress for uploads).
    4. Compare against single-node limits; introduce sharding, cache, CDN where exceeded.
    5. Sanity-check: cost rough order ($/GB/month, $/vCPU) if interviewer cares about budget.
    6. Document formulas on the whiteboard — interviewers follow the math.

    Informative example

    URL shortener estimation — a classic interview exercise showing read-heavy QPS and storage math:

    java
    public final class CapacityEstimate {
    // Assumptions stated aloud in interview
    static final long DAU = 100_000_000L;
    static final int SHORTENS_PER_USER_PER_DAY = 1;
    static final int REDIRECTS_PER_USER_PER_DAY = 10;
    static final double PEAK_FACTOR = 3.0;
    static long peakQps(long dailyOps) {
    double avg = dailyOps / 86_400.0;
    return Math.round(avg * PEAK_FACTOR);
    }
    public static void main(String[] args) {
    long writesDay = DAU * SHORTENS_PER_USER_PER_DAY;
    long readsDay = DAU * REDIRECTS_PER_USER_PER_DAY;
    System.out.println("Write QPS peak ~ " + peakQps(writesDay)); // ~3.5k
    System.out.println("Read QPS peak ~ " + peakQps(readsDay)); // ~35k
    long bytesPerUrl = 500; // slug + long URL + metadata
    long storage5yr = writesDay * 365 * 5 * bytesPerUrl;
    System.out.println("Storage 5yr ~ " + storage5yr / 1_000_000_000_000L + " TB");
    }
    }

    35k read QPS → Redis cache + CDN for redirects; ~3.5k writes → single sharded SQL or Cassandra cluster. Always tie numbers to components.

    Real-world use

    Real-world use cases

    • Banking: transaction TPS limits drive mainframe vs distributed ledger choices.
    • Social: fan-out write amplification affects Kafka partition sizing.
    • Video OTT: egress bandwidth dominates cost — CDN and adaptive bitrate mandatory.
    • Food delivery: peak lunch window QPS drives auto-scaling policies.

    Best practices

    • State every assumption; interviewers often adjust DAU to test adaptability.
    • Separate read and write QPS — they scale differently.
    • Include growth (YoY) for storage, not just launch-day snapshot.
    • Use industry anchors: 1M QPS is huge; 100 QPS fits one modest server.
    • Round up for headroom (30–50%) for failures and deploys.
    • Connect cache size to working set, not total dataset.

    Common mistakes

    • Using total users instead of DAU for traffic math.
    • Forgetting peak factor — average QPS underestimates infra by 3×.
    • Ignoring metadata, indexes, and replicas in storage estimates.
    • Assuming infinite horizontal scale without discussing hot keys.
    • Precise false precision (37,842.117 QPS) without explaining inputs.

    Advanced interview questions

    Q1BeginnerWhy do capacity estimates matter in HLD?
    They justify component choices — cache, sharding, CDN — and show you can connect users to infrastructure.
    Q2BeginnerHow convert DAU to QPS?
    DAU × ops per user per day ÷ 86400 for average; multiply by peak factor for peak QPS.
    Q3IntermediateEstimate storage for 5 years of tweets.
    DAU × tweets/day × bytes/tweet × 365 × 5 × replication/index factor — state assumptions for media vs text.
    Q4IntermediateWhat peak factor do you use?
    Typically 2–3× average for consumer apps; higher for flash sales or live events — always say it depends on traffic shape.
    Q5AdvancedDesign capacity plan for 1B daily notifications.
    Compute write QPS to notification queue, fan-out to channels, storage for delivery logs, retry DLQ size, and regional partition counts with hot-tenant isolation.

    Summary

    Back-of-envelope math validates architecture before deep dives. DAU → daily ops → QPS with peak factor is the core flow. Storage and bandwidth estimates prevent surprise bottlenecks. Round aggressively; order of magnitude beats false precision. Always link estimates to specific components (cache, shards, CDN). Practice three classic problems: Twitter, URL shortener, YouTube.

    Ready to mark this lesson complete?Track your journey across the entire course.