High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 19

    CAP Theorem

    The CAP theorem states that a distributed data store cannot simultaneously provide all three: Consistency (every read sees latest write), Availability (every request gets a resp…

    Course progress0%
    Focus
    10 guided sections
    Practice signal
    Examples included
    Career prep
    Interview Q&A included

    Introduction

    The CAP theorem states that a distributed data store cannot simultaneously provide all three: Consistency (every read sees latest write), Availability (every request gets a response), and Partition tolerance (system works despite network splits). During a partition, you choose CP or AP.

    CAP is interview shorthand for trade-off thinking — not a literal prohibition. Modern systems offer tunable consistency (Quorum reads in Cassandra, read concern in MongoDB). Partition tolerance is mandatory in distributed systems; the real choice is consistency vs availability during failure.

    This lesson applies CAP to database selection, multi-region design, and explaining eventual consistency to interviewers.

    Understanding the topic

    Key concepts

    • Partition: network split isolates nodes — messages lost or delayed between AZs/regions.
    • CP: reject requests or return errors to preserve consistency (ZooKeeper, etcd, sync SQL primary).
    • AP: accept writes on both sides, resolve conflicts later (Dynamo-style, Cassandra tunable down).
    • Eventual consistency: replicas converge if no new writes; window of stale reads.
    • PACELC extension: if Partition, choose A or C; Else choose Latency or Consistency.
    • Linearizability strongest single-object guarantee — expensive globally.
    text
    flowchart TB
    C[Consistency]
    A[Availability]
    P[Partition Tolerance]
    C --- P
    A --- P

    Internal architecture

    Architecture overview

    text
    flowchart TB
    C[Consistency]
    A[Availability]
    P[Partition Tolerance]
    C --- P
    A --- P

    Step-by-step explanation

    1. Payment ledger: CP — single primary, sync replica quorum, fail rather than double-spend.
    2. Social like count: AP — async increment, eventual display, high availability prioritized.
    3. Multi-region active-active: conflict resolution (LWW, vector clocks, CRDTs) for AP choice.
    4. Read repair and hinted handoff in Cassandra for convergence.
    5. SLA defines acceptable staleness — drives AP vs CP per feature.
    6. Health checks and fencing during partition to prevent split-brain writes.

    Informative example

    Cassandra QUORUM read/write — tunable consistency between ONE and ALL:

    java
    @Configuration
    public class CassandraConfig extends AbstractCassandraConfiguration {
    @Override
    public String getKeyspaceName() { return "social"; }
    @Bean
    public CqlSession cqlSession() {
    return CqlSession.builder()
    .addContactPoint(new InetSocketAddress("cassandra", 9042))
    .withLocalDatacenter("dc1")
    .withKeyspace(getKeyspaceName())
    .build();
    }
    }
    @Service
    public class LikeService {
    private final CqlSession session;
    public LikeService(CqlSession session) { this.session = session; }
    public void like(String postId, String userId) {
    session.execute("""
    INSERT INTO likes (post_id, user_id, liked_at)
    VALUES (?, ?, toTimestamp(now()))
    USING QUORUM
    """, postId, userId);
    }
    public long count(String postId) {
    var rs = session.execute(
    "SELECT COUNT(*) FROM likes WHERE post_id = ? USING LOCAL_QUORUM", postId);
    return rs.one().getLong(0);
    }
    }

    QUORUM = CP-leaning during normal ops; during partition behavior depends on replica overlap. Match consistency level to business tolerance.

    Real-world use

    Real-world use cases

    • Banking CP on account balances; AP on marketing preference flags.
    • E-commerce inventory during partition: reserve conservatively (CP) vs oversell risk (AP).
    • OTT view counts AP; DRM license tokens CP short TTL.
    • Healthcare critical alerts CP routing; analytics dashboards AP.

    Best practices

    • Classify features by consistency requirement before picking stores.
    • Document staleness SLOs user-visible (e.g., follower count ±30s).
    • Use consensus systems (Raft) for small strongly consistent metadata.
    • Test partition behavior with chaos engineering (iptables, Toxiproxy).
    • Don't cite CAP to justify any inconsistency — be feature-specific.
    • Combine AP write path with sync read from primary for hybrid flows.

    Common mistakes

    • Claiming 'we are CAP compliant' without specifying which letter under partition.
    • Using AP store for money without idempotency and reconciliation.
    • Ignoring network partitions as rare — they happen during deploys and AZ failures.
    • Confusing CAP consistency with ACID isolation levels.
    • Single-datacenter design ignoring partition tolerance until multi-region asked.

    Advanced interview questions

    Q1BeginnerWhat does CAP stand for?
    Consistency, Availability, Partition tolerance — pick two emphasis during network partition.
    Q2BeginnerWhy is partition tolerance mandatory in distributed systems?
    Network failures always possible; ignoring P means pretending system is not distributed.
    Q3IntermediateCP vs AP example?
    CP: bank transfer on single leader; AP: Twitter like count with eventual sync.
    Q4IntermediateWhat is PACELC?
    If Partition choose A or C; Else choose Latency or Consistency — captures normal-operation trade-offs.
    Q5AdvancedDesign multi-region product catalog.
    AP with version vectors or LWW for title updates, conflict audit log, read from nearest replica, admin writes to primary region, cache invalidation event, RPO/RTO defined per conflict type.

    Summary

    CAP forces consistency vs availability choice during network partitions. Partition tolerance is non-negotiable in distributed HLD. Match CP/AP per feature, not per whole system. PACELC extends trade-offs to latency in normal conditions. Eventual consistency needs product-level staleness acceptance. Redis and caches layer consistency models on top of CAP choices.

    Ready to mark this lesson complete?Track your journey across the entire course.