How estimate QPS in system design interviews?

Multiply DAU by daily actions per user, divide by 86400 for average QPS, then apply a 2–3× peak multiplier.

High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 4“Scalable systems, HLD interviews & case studies”

Capacity Estimation

Capacity estimation (back-of-the-envelope math) turns vague scale statements into concrete resource needs.

Course progress0%

Focus

10 guided sections

Practice signal

Examples included

Career prep

Interview Q&A included

Introduction

Capacity estimation (back-of-the-envelope math) turns vague scale statements into concrete resource needs. Interviewers expect you to estimate daily active users (DAU), queries per second (QPS), storage growth, and bandwidth — then use those numbers to justify databases, caches, and shard counts.

You do not need a calculator or exact figures. Order-of-magnitude correctness matters: knowing you need terabytes vs petabytes, thousands vs millions of QPS, separates senior candidates from those who pick Cassandra because it sounds scalable.

This lesson walks through standard formulas, sensible assumptions, and how to connect estimates to architectural decisions in HLD interviews.

Understanding the topic

Key concepts

DAU × actions per user per day ÷ 86400 ≈ average QPS; multiply by peak factor (2–3×).
Storage = records × record size × retention; account for indexes (often 2–3× raw data).
Bandwidth = QPS × average response payload; CDN reduces origin bandwidth for static assets.
Memory for cache: cache hottest 20% of data if Zipf-like access applies.
Use powers of two and round aggressively — 37,842 → ~40k.
Estimates validate bottlenecks: single PostgreSQL primary ~10k simple writes/s is a planning anchor.

text

flowchart LR
  DAU --> QPS
  QPS --> Storage
  Storage --> Bandwidth

Internal architecture

Architecture overview

text

flowchart LR
  DAU --> QPS
  QPS --> Storage
  Storage --> Bandwidth

Step-by-step explanation

Start from users: total users, DAU percentage, geographic distribution.
Derive operations: reads/writes per session, peak hours, seasonal spikes.
Compute QPS (peak), storage (5-year), bandwidth (egress + ingress for uploads).
Compare against single-node limits; introduce sharding, cache, CDN where exceeded.
Sanity-check: cost rough order ($/GB/month, $/vCPU) if interviewer cares about budget.
Document formulas on the whiteboard — interviewers follow the math.

Informative example

URL shortener estimation — a classic interview exercise showing read-heavy QPS and storage math:

java

public final class CapacityEstimate {
    // Assumptions stated aloud in interview
    static final long DAU = 100_000_000L;
    static final int SHORTENS_PER_USER_PER_DAY = 1;
    static final int REDIRECTS_PER_USER_PER_DAY = 10;
    static final double PEAK_FACTOR = 3.0;

    static long peakQps(long dailyOps) {
        double avg = dailyOps / 86_400.0;
        return Math.round(avg * PEAK_FACTOR);
    }

    public static void main(String[] args) {
        long writesDay = DAU * SHORTENS_PER_USER_PER_DAY;
        long readsDay = DAU * REDIRECTS_PER_USER_PER_DAY;
        System.out.println("Write QPS peak ~ " + peakQps(writesDay));   // ~3.5k
        System.out.println("Read QPS peak ~ " + peakQps(readsDay));     // ~35k
        long bytesPerUrl = 500; // slug + long URL + metadata
        long storage5yr = writesDay * 365 * 5 * bytesPerUrl;
        System.out.println("Storage 5yr ~ " + storage5yr / 1_000_000_000_000L + " TB");
    }
}

35k read QPS → Redis cache + CDN for redirects; ~3.5k writes → single sharded SQL or Cassandra cluster. Always tie numbers to components.

Real-world use

Real-world use cases

Banking: transaction TPS limits drive mainframe vs distributed ledger choices.
Social: fan-out write amplification affects Kafka partition sizing.
Video OTT: egress bandwidth dominates cost — CDN and adaptive bitrate mandatory.
Food delivery: peak lunch window QPS drives auto-scaling policies.

Best practices

State every assumption; interviewers often adjust DAU to test adaptability.
Separate read and write QPS — they scale differently.
Include growth (YoY) for storage, not just launch-day snapshot.
Use industry anchors: 1M QPS is huge; 100 QPS fits one modest server.
Round up for headroom (30–50%) for failures and deploys.
Connect cache size to working set, not total dataset.

Common mistakes

Using total users instead of DAU for traffic math.
Forgetting peak factor — average QPS underestimates infra by 3×.
Ignoring metadata, indexes, and replicas in storage estimates.
Assuming infinite horizontal scale without discussing hot keys.
Precise false precision (37,842.117 QPS) without explaining inputs.

Advanced interview questions

Q1BeginnerWhy do capacity estimates matter in HLD?

They justify component choices — cache, sharding, CDN — and show you can connect users to infrastructure.

Q2BeginnerHow convert DAU to QPS?

DAU × ops per user per day ÷ 86400 for average; multiply by peak factor for peak QPS.

Q3IntermediateEstimate storage for 5 years of tweets.

DAU × tweets/day × bytes/tweet × 365 × 5 × replication/index factor — state assumptions for media vs text.

Q4IntermediateWhat peak factor do you use?

Typically 2–3× average for consumer apps; higher for flash sales or live events — always say it depends on traffic shape.

Q5AdvancedDesign capacity plan for 1B daily notifications.

Compute write QPS to notification queue, fan-out to channels, storage for delivery logs, retry DLQ size, and regional partition counts with hot-tenant isolation.

Summary

Back-of-envelope math validates architecture before deep dives. DAU → daily ops → QPS with peak factor is the core flow. Storage and bandwidth estimates prevent surprise bottlenecks. Round aggressively; order of magnitude beats false precision. Always link estimates to specific components (cache, shards, CDN). Practice three classic problems: Twitter, URL shortener, YouTube.

Ready to mark this lesson complete?Track your journey across the entire course.