High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 17

    Database Replication

    Database replication copies data from a primary (leader) to one or more replicas (followers) for read scaling, high availability, and geographic proximity.

    Course progress0%
    Focus
    10 guided sections
    Practice signal
    Examples included
    Career prep
    Interview Q&A included

    Introduction

    Database replication copies data from a primary (leader) to one or more replicas (followers) for read scaling, high availability, and geographic proximity. PostgreSQL streaming replication, MySQL binlog replication, and MongoDB replica sets are standard patterns.

    Replication is asynchronous by default — replicas may lag milliseconds to seconds. Strongly consistent reads require routing to primary or sync replication with latency cost. HLD interviews pair replication with failover, read-your-writes semantics, and disaster recovery.

    This lesson contrasts sync vs async, single-leader vs multi-leader, and how apps route reads and writes.

    Understanding the topic

    Key concepts

    • Single-leader: all writes to primary; replicas serve read-only traffic.
    • Async replication: primary acks before replica confirms — faster, possible stale reads.
    • Sync replication: quorum before commit — stronger durability, higher write latency.
    • Failover: promote replica to primary on leader failure — risk of split-brain without fencing.
    • Read replicas offload analytics, search indexing, reporting queries.
    • Replication ≠ backup — deleted data replicates; need PITR snapshots too.
    text
    flowchart LR
    Primary -->|WAL| Replica1
    Primary -->|WAL| Replica2

    Internal architecture

    Architecture overview

    text
    flowchart LR
    Primary -->|WAL| Replica1
    Primary -->|WAL| Replica2

    Step-by-step explanation

    1. App writes → primary PostgreSQL; reads → round-robin replicas via connection pool router.
    2. Critical reads (balance after transfer) → primary or causal consistency token.
    3. Replica in secondary AZ/region for DR; async lag monitored (seconds SLO).
    4. Automated failover with Patroni/etcd or RDS Multi-AZ.
    5. Connection string separates write endpoint vs read endpoint.
    6. CDC from replica binlog to Kafka for search index without primary load.

    Informative example

    Spring routing datasource — writes to primary, reads to replica with transaction awareness:

    java
    @Configuration
    public class DatabaseRoutingConfig {
    @Bean
    public DataSource routingDataSource(
    @Qualifier("primaryDataSource") DataSource primary,
    @Qualifier("replicaDataSource") DataSource replica) {
    var map = new HashMap<Object, Object>();
    map.put(RoutingKey.PRIMARY, primary);
    map.put(RoutingKey.REPLICA, replica);
    var routing = new AbstractRoutingDataSource() {
    @Override
    protected Object determineCurrentLookupKey() {
    return TransactionSynchronizationManager.isCurrentTransactionReadOnly()
    ? RoutingKey.REPLICA : RoutingKey.PRIMARY;
    }
    };
    routing.setTargetDataSources(map);
    routing.setDefaultTargetDataSource(primary);
    return routing;
    }
    }
    // @Transactional(readOnly = true) on query services -> replica

    After write, user expecting immediate read may hit stale replica — route session to primary briefly or use sync replication for money paths.

    Real-world use

    Real-world use cases

    • E-commerce product catalog reads from replicas during sales.
    • Banking statements generated from replica without impacting OLTP primary.
    • Global app: read replica in EU for GDPR data residency reads.
    • Social analytics dashboards on replica, not production primary.

    Best practices

    • Monitor replication lag alert > 5s for user-facing replicas.
    • Use read-only transactions for replica routing in ORMs.
    • Test failover quarterly — promoted replica data loss window understood.
    • Combine replication with backups (WAL archiving, snapshots).
    • Limit replica count — each adds primary replication overhead.
    • Document which queries require strong consistency → primary.

    Common mistakes

    • Reading balance from lagging replica after deposit — user sees wrong funds.
    • No automated failover — manual promotion during outage extends downtime.
    • Treating replica as backup — corruption or DROP TABLE replicates.
    • Too many replicas saturating primary replication bandwidth.
    • Split-brain two primaries accepting writes after network partition.

    Advanced interview questions

    Q1BeginnerWhy replicate a database?
    Read scaling, high availability, geographic reads, and disaster recovery.
    Q2BeginnerAsync vs sync replication?
    Async is faster with possible stale replicas; sync waits for replica ack — stronger durability, higher latency.
    Q3IntermediateWhat is replication lag?
    Delay between primary commit and replica applying change — causes stale read risk.
    Q4IntermediateRead-your-writes after signup?
    Route user's session reads to primary briefly, or use sync replica, or track session version token.
    Q5AdvancedDesign replication for payment ledger.
    Sync replica in same region for DR, async cross-region with lag alert, all balance reads from primary or linearizable store, automated failover with fencing, PITR backups.

    Summary

    Replication copies data from primary to replicas for scale and HA. Async replication enables stale reads — route carefully. Failover requires split-brain prevention and runbooks. Replicas offload read-heavy workloads, not replace backups. CDC from replicas feeds search and analytics pipelines. Partitioning complements replication for large tables.

    Ready to mark this lesson complete?Track your journey across the entire course.