Capacity Estimation
Capacity estimation (back-of-the-envelope math) turns vague scale statements into concrete resource needs.
Introduction
Capacity estimation (back-of-the-envelope math) turns vague scale statements into concrete resource needs. Interviewers expect you to estimate daily active users (DAU), queries per second (QPS), storage growth, and bandwidth — then use those numbers to justify databases, caches, and shard counts.
You do not need a calculator or exact figures. Order-of-magnitude correctness matters: knowing you need terabytes vs petabytes, thousands vs millions of QPS, separates senior candidates from those who pick Cassandra because it sounds scalable.
This lesson walks through standard formulas, sensible assumptions, and how to connect estimates to architectural decisions in HLD interviews.
Understanding the topic
Key concepts
- DAU × actions per user per day ÷ 86400 ≈ average QPS; multiply by peak factor (2–3×).
- Storage = records × record size × retention; account for indexes (often 2–3× raw data).
- Bandwidth = QPS × average response payload; CDN reduces origin bandwidth for static assets.
- Memory for cache: cache hottest 20% of data if Zipf-like access applies.
- Use powers of two and round aggressively — 37,842 → ~40k.
- Estimates validate bottlenecks: single PostgreSQL primary ~10k simple writes/s is a planning anchor.
flowchart LRDAU --> QPSQPS --> StorageStorage --> Bandwidth
Internal architecture
Architecture overview
flowchart LRDAU --> QPSQPS --> StorageStorage --> Bandwidth
Step-by-step explanation
- Start from users: total users, DAU percentage, geographic distribution.
- Derive operations: reads/writes per session, peak hours, seasonal spikes.
- Compute QPS (peak), storage (5-year), bandwidth (egress + ingress for uploads).
- Compare against single-node limits; introduce sharding, cache, CDN where exceeded.
- Sanity-check: cost rough order ($/GB/month, $/vCPU) if interviewer cares about budget.
- Document formulas on the whiteboard — interviewers follow the math.
Informative example
URL shortener estimation — a classic interview exercise showing read-heavy QPS and storage math:
public final class CapacityEstimate {// Assumptions stated aloud in interviewstatic final long DAU = 100_000_000L;static final int SHORTENS_PER_USER_PER_DAY = 1;static final int REDIRECTS_PER_USER_PER_DAY = 10;static final double PEAK_FACTOR = 3.0;static long peakQps(long dailyOps) {double avg = dailyOps / 86_400.0;return Math.round(avg * PEAK_FACTOR);}public static void main(String[] args) {long writesDay = DAU * SHORTENS_PER_USER_PER_DAY;long readsDay = DAU * REDIRECTS_PER_USER_PER_DAY;System.out.println("Write QPS peak ~ " + peakQps(writesDay)); // ~3.5kSystem.out.println("Read QPS peak ~ " + peakQps(readsDay)); // ~35klong bytesPerUrl = 500; // slug + long URL + metadatalong storage5yr = writesDay * 365 * 5 * bytesPerUrl;System.out.println("Storage 5yr ~ " + storage5yr / 1_000_000_000_000L + " TB");}}
35k read QPS → Redis cache + CDN for redirects; ~3.5k writes → single sharded SQL or Cassandra cluster. Always tie numbers to components.
Real-world use
Real-world use cases
- Banking: transaction TPS limits drive mainframe vs distributed ledger choices.
- Social: fan-out write amplification affects Kafka partition sizing.
- Video OTT: egress bandwidth dominates cost — CDN and adaptive bitrate mandatory.
- Food delivery: peak lunch window QPS drives auto-scaling policies.
Best practices
- State every assumption; interviewers often adjust DAU to test adaptability.
- Separate read and write QPS — they scale differently.
- Include growth (YoY) for storage, not just launch-day snapshot.
- Use industry anchors: 1M QPS is huge; 100 QPS fits one modest server.
- Round up for headroom (30–50%) for failures and deploys.
- Connect cache size to working set, not total dataset.
Common mistakes
- Using total users instead of DAU for traffic math.
- Forgetting peak factor — average QPS underestimates infra by 3×.
- Ignoring metadata, indexes, and replicas in storage estimates.
- Assuming infinite horizontal scale without discussing hot keys.
- Precise false precision (37,842.117 QPS) without explaining inputs.
Advanced interview questions
Q1BeginnerWhy do capacity estimates matter in HLD?
Q2BeginnerHow convert DAU to QPS?
Q3IntermediateEstimate storage for 5 years of tweets.
Q4IntermediateWhat peak factor do you use?
Q5AdvancedDesign capacity plan for 1B daily notifications.
Summary
Back-of-envelope math validates architecture before deep dives. DAU → daily ops → QPS with peak factor is the core flow. Storage and bandwidth estimates prevent surprise bottlenecks. Round aggressively; order of magnitude beats false precision. Always link estimates to specific components (cache, shards, CDN). Practice three classic problems: Twitter, URL shortener, YouTube.