High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 17“Scalable systems, HLD interviews & case studies”

Database Replication

Database replication copies data from a primary (leader) to one or more replicas (followers) for read scaling, high availability, and geographic proximity.

Course progress0%

Focus

10 guided sections

Practice signal

Examples included

Career prep

Interview Q&A included

Introduction

Database replication copies data from a primary (leader) to one or more replicas (followers) for read scaling, high availability, and geographic proximity. PostgreSQL streaming replication, MySQL binlog replication, and MongoDB replica sets are standard patterns.

Replication is asynchronous by default — replicas may lag milliseconds to seconds. Strongly consistent reads require routing to primary or sync replication with latency cost. HLD interviews pair replication with failover, read-your-writes semantics, and disaster recovery.

This lesson contrasts sync vs async, single-leader vs multi-leader, and how apps route reads and writes.

Understanding the topic

Key concepts

Single-leader: all writes to primary; replicas serve read-only traffic.
Async replication: primary acks before replica confirms — faster, possible stale reads.
Sync replication: quorum before commit — stronger durability, higher write latency.
Failover: promote replica to primary on leader failure — risk of split-brain without fencing.
Read replicas offload analytics, search indexing, reporting queries.
Replication ≠ backup — deleted data replicates; need PITR snapshots too.

text

flowchart LR
  Primary -->|WAL| Replica1
  Primary -->|WAL| Replica2

Internal architecture

Architecture overview

text

flowchart LR
  Primary -->|WAL| Replica1
  Primary -->|WAL| Replica2

Step-by-step explanation

App writes → primary PostgreSQL; reads → round-robin replicas via connection pool router.
Critical reads (balance after transfer) → primary or causal consistency token.
Replica in secondary AZ/region for DR; async lag monitored (seconds SLO).
Automated failover with Patroni/etcd or RDS Multi-AZ.
Connection string separates write endpoint vs read endpoint.
CDC from replica binlog to Kafka for search index without primary load.

Informative example

Spring routing datasource — writes to primary, reads to replica with transaction awareness:

java

@Configuration
public class DatabaseRoutingConfig {

    @Bean
    public DataSource routingDataSource(
            @Qualifier("primaryDataSource") DataSource primary,
            @Qualifier("replicaDataSource") DataSource replica) {
        var map = new HashMap<Object, Object>();
        map.put(RoutingKey.PRIMARY, primary);
        map.put(RoutingKey.REPLICA, replica);
        var routing = new AbstractRoutingDataSource() {
            @Override
            protected Object determineCurrentLookupKey() {
                return TransactionSynchronizationManager.isCurrentTransactionReadOnly()
                    ? RoutingKey.REPLICA : RoutingKey.PRIMARY;
            }
        };
        routing.setTargetDataSources(map);
        routing.setDefaultTargetDataSource(primary);
        return routing;
    }
}

// @Transactional(readOnly = true) on query services -> replica

After write, user expecting immediate read may hit stale replica — route session to primary briefly or use sync replication for money paths.

Real-world use

Real-world use cases

E-commerce product catalog reads from replicas during sales.
Banking statements generated from replica without impacting OLTP primary.
Global app: read replica in EU for GDPR data residency reads.
Social analytics dashboards on replica, not production primary.

Best practices

Monitor replication lag alert > 5s for user-facing replicas.
Use read-only transactions for replica routing in ORMs.
Test failover quarterly — promoted replica data loss window understood.
Combine replication with backups (WAL archiving, snapshots).
Limit replica count — each adds primary replication overhead.
Document which queries require strong consistency → primary.

Common mistakes

Reading balance from lagging replica after deposit — user sees wrong funds.
No automated failover — manual promotion during outage extends downtime.
Treating replica as backup — corruption or DROP TABLE replicates.
Too many replicas saturating primary replication bandwidth.
Split-brain two primaries accepting writes after network partition.

Advanced interview questions

Q1BeginnerWhy replicate a database?

Read scaling, high availability, geographic reads, and disaster recovery.

Q2BeginnerAsync vs sync replication?

Async is faster with possible stale replicas; sync waits for replica ack — stronger durability, higher latency.

Q3IntermediateWhat is replication lag?

Delay between primary commit and replica applying change — causes stale read risk.

Q4IntermediateRead-your-writes after signup?

Route user's session reads to primary briefly, or use sync replica, or track session version token.

Q5AdvancedDesign replication for payment ledger.

Sync replica in same region for DR, async cross-region with lag alert, all balance reads from primary or linearizable store, automated failover with fencing, PITR backups.

Summary

Replication copies data from primary to replicas for scale and HA. Async replication enables stale reads — route carefully. Failover requires split-brain prevention and runbooks. Replicas offload read-heavy workloads, not replace backups. CDC from replicas feeds search and analytics pipelines. Partitioning complements replication for large tables.

Ready to mark this lesson complete?Track your journey across the entire course.