High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 16“Scalable systems, HLD interviews & case studies”

Database Sharding

Sharding (horizontal partitioning) splits data across multiple database instances so no single node holds the entire dataset or write load.

Course progress0%

Focus

10 guided sections

Practice signal

Examples included

Career prep

Interview Q&A included

Introduction

Sharding (horizontal partitioning) splits data across multiple database instances so no single node holds the entire dataset or write load. Each shard owns a subset of keys (user_id % N, geographic region, tenant_id). Sharding enables write scale beyond one primary but adds routing complexity, cross-shard query pain, and rebalancing operations.

HLD interviews introduce sharding when capacity estimates exceed single-node limits — often 1–10 TB or tens of thousands of writes per second depending on workload. Always mention shard key selection; a bad key creates hot shards worse than no sharding.

This lesson covers sharding strategies, routing layers, and alternatives (Cassandra, Vitess, Citus) that hide some complexity.

Understanding the topic

Key concepts

Shard key: column(s) determining shard — must match majority of queries for locality.
Hash sharding: even distribution; range sharding: locality, risk of hot ranges.
Directory-based sharding: lookup table mapping key → shard (flexible, SPOF risk).
Cross-shard queries expensive — design APIs to avoid them or use scatter-gather.
Rebalancing: consistent hashing minimizes data movement on shard add.
Application-level vs proxy (Vitess, ProxySQL) vs native (Cassandra partitions).

text

flowchart LR
  Router --> Shard1[(Shard 1)]
  Router --> Shard2[(Shard 2)]
  Router --> Shard3[(Shard 3)]

Internal architecture

Architecture overview

text

flowchart LR
  Router --> Shard1[(Shard 1)]
  Router --> Shard2[(Shard 2)]
  Router --> Shard3[(Shard 3)]

Step-by-step explanation

Router layer (app logic or middleware) computes shard from userId hash.
Each shard: primary + replicas; no cross-shard foreign keys.
Global tables (small reference data) replicated to all shards or separate lookup service.
Auto-increment IDs replaced with UUID/snowflake — global uniqueness without coordination.
Monitoring per-shard QPS, storage, replication lag — detect hot shards.
Migration: dual-write or backfill with feature flag cutover per shard.

Informative example

Application-level shard routing with consistent hashing in Java:

java

@Component
public class ShardRouter {
    private final List<DataSource> shards;

    public ShardRouter(@Qualifier("shardDataSources") List<DataSource> shards) {
        this.shards = List.copyOf(shards);
    }

    public DataSource route(String userId) {
        int slot = Math.floorMod(userId.hashCode(), shards.size());
        return shards.get(slot);
    }

    public Connection connectionForUser(String userId) throws SQLException {
        return route(userId).getConnection();
    }
}

@Repository
public class MessageRepository {
    private final ShardRouter router;

    public MessageRepository(ShardRouter router) { this.router = router; }

    public void save(String userId, Message msg) throws SQLException {
        try (var conn = router.connectionForUser(userId);
             var ps = conn.prepareStatement("INSERT INTO messages(id, user_id, body) VALUES (?,?,?)")) {
            ps.setString(1, msg.id());
            ps.setString(2, userId);
            ps.setString(3, msg.body());
            ps.executeUpdate();
        }
    }
}

Shard on userId if queries are user-scoped. Avoid shard on timestamp alone — hot latest shard.

Real-world use

Real-world use cases

Multi-tenant SaaS: shard by tenant_id for large enterprise customers.
Social messaging: shard by conversation_id or user_id.
Fintech ledger: shard by account_id with strict per-shard transactions.
Gaming leaderboards: separate shard strategy from user profile shards.

Best practices

Choose shard key from dominant query pattern — co-locate related rows.
Plan for 2× expected data when picking shard count — resharding is painful.
Avoid cross-shard transactions; use sagas for rare cross-shard workflows.
Use connection pool per shard; monitor saturation independently.
Canary new shards with synthetic traffic before full rebalance.
Consider managed sharding (Vitess, DynamoDB) before bespoke routers.

Common mistakes

Hot shard from celebrity user or sequential tenant onboarding.
Shard key change requiring full data migration surprise.
JOIN across shards in application loop — latency explosion.
Underestimating operational overhead of N independent PostgreSQL clusters.
Sharding before exhausting cache and read replicas.

Advanced interview questions

Q1BeginnerWhat is database sharding?

Horizontally splitting data across multiple database instances by a shard key.

Q2BeginnerWhy is shard key selection critical?

Bad keys cause uneven load (hot shards) or force expensive cross-shard queries.

Q3IntermediateHash vs range sharding?

Hash distributes evenly; range keeps related keys local but risks hot ranges on latest data.

Q4IntermediateHow handle cross-shard query?

Avoid in design; scatter-gather parallel queries; or maintain global index/search service.

Q5AdvancedShard design for 500M user chat history.

Shard by userId hash to 256 shards, messages table per shard, snowflake IDs, no cross-shard threads — group chats via separate mapping service or co-shard by conversationId if size bounded.

Summary

Sharding splits data for write/storage scale beyond one node. Shard key drives distribution and query locality. Cross-shard operations are expensive — design them away. Rebalancing and migrations need explicit plans. Managed solutions reduce custom router risk. Replication provides read scale and HA within each shard.

Ready to mark this lesson complete?Track your journey across the entire course.