High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 10“Scalable systems, HLD interviews & case studies”

Horizontal vs Vertical Scaling

Scaling increases system capacity to meet load.

Course progress0%

Focus

10 guided sections

Practice signal

Examples included

Career prep

Interview Q&A included

Introduction

Scaling increases system capacity to meet load. Vertical scaling (scale up) adds CPU, RAM, or faster disks to existing machines. Horizontal scaling (scale out) adds more machines and distributes work across them. Most cloud-native HLD solutions scale out; vertical scaling remains useful for databases and quick wins.

Interviewers expect you to pick the right lever per component: stateless API servers scale out easily; relational primary databases scale up first, then shard. Understanding limits of each approach prevents designs that assume infinite linear scale.

This lesson compares trade-offs, auto-scaling patterns, and how scaling choices interact with load balancers and data tiers.

Understanding the topic

Key concepts

Vertical: simpler ops, no partition issues, hard ceiling (largest instance), downtime during resize.
Horizontal: near-linear capacity for stateless tiers, requires load distribution, fault tolerant.
Stateless services: store session in Redis; any instance handles any request.
Stateful tiers: databases, Kafka brokers need careful horizontal strategies (sharding, RF).
Auto-scaling: HPA on CPU/RPS/custom metrics in Kubernetes.
Diminishing returns: Amdahl's law — serial portions cap speedup.

text

flowchart LR
  subgraph Vertical
    VM1[More CPU RAM]
  end
  subgraph Horizontal
    VM2 --> VM3 --> VM4
  end

Internal architecture

Architecture overview

text

flowchart LR
  subgraph Vertical
    VM1[More CPU RAM]
  end
  subgraph Horizontal
    VM2 --> VM3 --> VM4
  end

Step-by-step explanation

Start vertical on DB until metrics show CPU/IO saturation near instance max.
Add read replicas (horizontal read scale) before sharding writes.
Scale API tier horizontally behind LB; minimum 2 AZ instances for HA.
Use connection poolers (PgBouncer) when many app instances hit DB.
Cache hot reads in Redis cluster (horizontal shard by key hash).
Define scaling triggers: CPU > 70%, p99 latency SLO breach, queue depth threshold.

Informative example

Kubernetes HPA manifest scaling Spring Boot deployment on CPU and custom RPS metric:

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: catalog-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: catalog-api
  minReplicas: 3
  maxReplicas: 40
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "800"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

State DB scaling separately — HPA on apps does not fix saturated PostgreSQL primary. Mention read replicas and connection limits.

Real-world use

Real-world use cases

OTT video origin: horizontal edge caches; vertical transcode workers with GPUs.
E-commerce flash sale: horizontal pod burst + queue absorption.
Banking batch jobs: vertical high-memory nodes for overnight reconciliation.
Social feed: horizontal stateless feed generators; sharded Cassandra for writes.

Best practices

Make app tier stateless before scaling out.
Load test to find knee in curve — don't over-provision blindly.
Scale down slowly to avoid flapping during traffic dips.
Monitor DB connections per scale-out event.
Use multi-AZ horizontal spread for availability, not just capacity.
Document max shard/instance limits in capacity estimates.

Common mistakes

Horizontal scaling sticky-session apps without shared session store.
Adding 100 app pods while DB remains single small instance.
Ignoring cold start time — new pods slow during scale-up if JVM warmup heavy.
Assuming linear scale with synchronized writes to one primary DB.
Vertical scaling production DB during peak without replica failover plan.

Advanced interview questions

Q1BeginnerDifference between horizontal and vertical scaling?

Vertical adds resources to one node; horizontal adds more nodes to distribute load.

Q2BeginnerWhich tier scales horizontally easiest?

Stateless application/API servers behind a load balancer.

Q3IntermediateWhy databases resist horizontal write scaling?

Strong consistency and single primary ordering limit write parallelism — sharding required.

Q4IntermediateWhat triggers auto-scaling in Kubernetes?

Metrics like CPU, memory, custom RPS, queue depth compared to targets in HPA spec.

Q5AdvancedScale design for 10× traffic spike in 5 minutes?

Pre-warmed min replicas, HPA on RPS, Redis cache, Kafka buffer, CDN static, DB read replicas, rate limit non-critical paths, load test validated.

Summary

Scale out stateless tiers; scale up or shard stateful data stores. Auto-scaling ties infrastructure to SLOs and cost control. Connection pools and caches protect databases during horizontal app growth. Multi-AZ horizontal deployment improves availability and capacity. Capacity estimation informs when each lever activates. Load balancers distribute horizontal app instances — next lesson.

Ready to mark this lesson complete?Track your journey across the entire course.