High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 10

    Horizontal vs Vertical Scaling

    Scaling increases system capacity to meet load.

    Course progress0%
    Focus
    10 guided sections
    Practice signal
    Examples included
    Career prep
    Interview Q&A included

    Introduction

    Scaling increases system capacity to meet load. Vertical scaling (scale up) adds CPU, RAM, or faster disks to existing machines. Horizontal scaling (scale out) adds more machines and distributes work across them. Most cloud-native HLD solutions scale out; vertical scaling remains useful for databases and quick wins.

    Interviewers expect you to pick the right lever per component: stateless API servers scale out easily; relational primary databases scale up first, then shard. Understanding limits of each approach prevents designs that assume infinite linear scale.

    This lesson compares trade-offs, auto-scaling patterns, and how scaling choices interact with load balancers and data tiers.

    Understanding the topic

    Key concepts

    • Vertical: simpler ops, no partition issues, hard ceiling (largest instance), downtime during resize.
    • Horizontal: near-linear capacity for stateless tiers, requires load distribution, fault tolerant.
    • Stateless services: store session in Redis; any instance handles any request.
    • Stateful tiers: databases, Kafka brokers need careful horizontal strategies (sharding, RF).
    • Auto-scaling: HPA on CPU/RPS/custom metrics in Kubernetes.
    • Diminishing returns: Amdahl's law — serial portions cap speedup.
    text
    flowchart LR
    subgraph Vertical
    VM1[More CPU RAM]
    end
    subgraph Horizontal
    VM2 --> VM3 --> VM4
    end

    Internal architecture

    Architecture overview

    text
    flowchart LR
    subgraph Vertical
    VM1[More CPU RAM]
    end
    subgraph Horizontal
    VM2 --> VM3 --> VM4
    end

    Step-by-step explanation

    1. Start vertical on DB until metrics show CPU/IO saturation near instance max.
    2. Add read replicas (horizontal read scale) before sharding writes.
    3. Scale API tier horizontally behind LB; minimum 2 AZ instances for HA.
    4. Use connection poolers (PgBouncer) when many app instances hit DB.
    5. Cache hot reads in Redis cluster (horizontal shard by key hash).
    6. Define scaling triggers: CPU > 70%, p99 latency SLO breach, queue depth threshold.

    Informative example

    Kubernetes HPA manifest scaling Spring Boot deployment on CPU and custom RPS metric:

    yaml
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
    name: catalog-api-hpa
    spec:
    scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: catalog-api
    minReplicas: 3
    maxReplicas: 40
    metrics:
    - type: Resource
    resource:
    name: cpu
    target:
    type: Utilization
    averageUtilization: 65
    - type: Pods
    pods:
    metric:
    name: http_requests_per_second
    target:
    type: AverageValue
    averageValue: "800"
    behavior:
    scaleUp:
    stabilizationWindowSeconds: 60
    scaleDown:
    stabilizationWindowSeconds: 300

    State DB scaling separately — HPA on apps does not fix saturated PostgreSQL primary. Mention read replicas and connection limits.

    Real-world use

    Real-world use cases

    • OTT video origin: horizontal edge caches; vertical transcode workers with GPUs.
    • E-commerce flash sale: horizontal pod burst + queue absorption.
    • Banking batch jobs: vertical high-memory nodes for overnight reconciliation.
    • Social feed: horizontal stateless feed generators; sharded Cassandra for writes.

    Best practices

    • Make app tier stateless before scaling out.
    • Load test to find knee in curve — don't over-provision blindly.
    • Scale down slowly to avoid flapping during traffic dips.
    • Monitor DB connections per scale-out event.
    • Use multi-AZ horizontal spread for availability, not just capacity.
    • Document max shard/instance limits in capacity estimates.

    Common mistakes

    • Horizontal scaling sticky-session apps without shared session store.
    • Adding 100 app pods while DB remains single small instance.
    • Ignoring cold start time — new pods slow during scale-up if JVM warmup heavy.
    • Assuming linear scale with synchronized writes to one primary DB.
    • Vertical scaling production DB during peak without replica failover plan.

    Advanced interview questions

    Q1BeginnerDifference between horizontal and vertical scaling?
    Vertical adds resources to one node; horizontal adds more nodes to distribute load.
    Q2BeginnerWhich tier scales horizontally easiest?
    Stateless application/API servers behind a load balancer.
    Q3IntermediateWhy databases resist horizontal write scaling?
    Strong consistency and single primary ordering limit write parallelism — sharding required.
    Q4IntermediateWhat triggers auto-scaling in Kubernetes?
    Metrics like CPU, memory, custom RPS, queue depth compared to targets in HPA spec.
    Q5AdvancedScale design for 10× traffic spike in 5 minutes?
    Pre-warmed min replicas, HPA on RPS, Redis cache, Kafka buffer, CDN static, DB read replicas, rate limit non-critical paths, load test validated.

    Summary

    Scale out stateless tiers; scale up or shard stateful data stores. Auto-scaling ties infrastructure to SLOs and cost control. Connection pools and caches protect databases during horizontal app growth. Multi-AZ horizontal deployment improves availability and capacity. Capacity estimation informs when each lever activates. Load balancers distribute horizontal app instances — next lesson.

    Ready to mark this lesson complete?Track your journey across the entire course.