Practical rebuilds of these systems — real failovers & chaos drills — are in production onYouTube, soon.

CAP Theorem & Tradeoffs

The precise CAP statement, why partition tolerance is mandatory, CP vs AP with real systems, quorums as a per-request dial, PACELC's latency tax, and the misreadings that wreck prod.

21 min readupdated 2026-06-28
On this page

CAP is the most-quoted theorem in distributed systems and the most misquoted. Engineers say “we’re an AP shop” the way someone names their star sign — as if it were a fixed trait of the company rather than a decision the system makes, per request, only during a network partition. The actual result is narrow and surgical: when the network between your nodes breaks, you must choose between answering every request and keeping every replica in agreement. You cannot do both. When the network is healthy, CAP is silent — and the healthy case is where you live more than 99.9% of the time.

That silence is the trap. People internalize CAP, pick “AP,” and never notice that the decision that actually shapes their p99 latency — how much cross-replica coordination to pay for on every normal request — is a tradeoff CAP doesn’t even mention. PACELC does, and we’ll get there. But the first job is to state CAP precisely enough that you stop making the three mistakes that show up in every shallow design review and a depressing number of real outages.

This is the long-form context article. It is the thing I wish someone had handed me before I watched a “highly available” cluster go completely dark during a fifteen-second switch reboot, and before I watched an “eventually consistent” store oversell a limited-edition product by four thousand units in ninety seconds. CAP is not a system you adopt; it is a lens you apply every time data lives on more than one machine. It builds directly on Consistency & Consensus, shapes how you read Database Replication lag, and explains the consistency knobs in Cassandra and DynamoDB. The coordination services that lean hard CP — ZooKeeper & etcd — are the cleanest place to watch the theorem bite.

The single sharpest thing to carry out of this article: C and A are not a global setting you flip in a config file. They are a per-operation choice that only matters when nodes can’t talk, and “partition tolerance” is not optional — the network is going to partition whether your architecture admits it or not.

A motivating failure

A retailer I worked alongside ran inventory on a multi-region Cassandra cluster, replication factor 3 per region, reads and writes at consistency level ONE because someone two years earlier had benchmarked QUORUM against ONE, seen the latency difference, and “made it fast.” Inventory decrements went through a UPDATE ... SET stock = stock - 1 path guarded by an application-level check. It had worked through two holiday seasons.

On the morning of a limited sneaker drop, a top-of-rack switch in one availability zone wedged itself for about ninety seconds — not down, just dropping most inter-AZ packets. The cluster didn’t fail. That’s the point. Every node kept answering, because ONE means “any single reachable replica responds.” Writes on the isolated side committed locally. Reads on both sides returned whatever their nearest replica happened to hold.

For ninety seconds, two halves of the cluster each believed they had the authoritative stock count, and each cheerfully decremented from its own stale view. The product had 1,200 units. We sold a little over 5,000. There were no errors. No alarms fired on the database — every dashboard was green, because from Cassandra’s perspective nothing was wrong: it was doing exactly what ONE promises, which is stay available and reconcile later. The reconciliation, when the partition healed, used last-write-wins, and the “winner” was whichever decrement carried the higher timestamp — which had nothing to do with which order was real.

The bug was not in Cassandra. It was in a belief: that a system marketed as “highly available” would also keep counts correct during a network event. CAP says you can’t have both. The team had unknowingly chosen availability, and the partition arrived to collect on a choice nobody remembered making.

The one-sentence mental model

When a network partition splits your nodes into groups that can’t reach each other, you can keep every node answering (availability) or keep every node agreeing (consistency) — never both — and the partition is a when, not an if.

flowchart TB
  P{Partition\nnow?}
  P -->|No| Both[C and A both\nfree here]
  P -->|Yes| Pick{Pick one}
  Pick -->|CP| Block[reject or block\nminority side]
  Pick -->|AP| Stale[serve stale\nor conflicting]

Unpack each word, because each is a precise technical claim and the imprecision is where bad answers come from:

  • Consistency here means linearizability — every read observes the most recent completed write, as if there were a single copy of the data and one global order of operations. This is not the “C” in ACID, which is about constraints and invariants. Conflating the two is the first classic mistake.
  • Availability means every request to a non-failing node gets a non-error response — not “the system is mostly up,” but a hard guarantee that a reachable node never replies with “I can’t serve you right now.” A 200 that returns stale data is available; a 503 is not.
  • Partition tolerance means the system keeps operating when the network drops or arbitrarily delays messages between nodes. This is the clause people try to drop, and it’s the one you don’t get to drop.

The load-bearing word is when. Switches reboot, links saturate, a NIC starts dropping 5% of packets, a Kubernetes node goes through a stop-the-world GC pause that makes it look dead for eight seconds. In any real network, partitions are routine operational weather. You are not choosing whether to tolerate them. You are choosing, in advance, what your system does in the middle of one — and that choice is CP or AP.

How it actually works

Why partition tolerance isn’t on the menu

The “pick two of three” framing is where CAP goes to die in interviews. It invites the answer “we’ll choose CA — consistency and availability — and just not tolerate partitions.” For a distributed system, that option does not exist.

A single-node database is “CA” in a trivial, useless sense: there’s no network between replicas, so there’s nothing to partition. The instant your data lives on two machines connected by a network, you cannot legislate packet loss out of existence. The link will fail at some point. When it does, your system either keeps serving (giving up C) or stops serving on at least one side (giving up A). There is no third behavior where both sides stay correct and responsive while unable to communicate — that would require sending information faster than the network allows, which is to say, magic.

So the real menu has two items: CP and AP. “CA” describes a system that hasn’t met its first network failure yet.

A partition, frame by frame

Picture two replicas, a client talking to each, and the link between the replicas cut.

sequenceDiagram
  participant CA as Client A
  participant R1 as Replica 1
  participant R2 as Replica 2
  participant CB as Client B
  Note over R1,R2: link cut (partition)
  CA->>R1: write x=5
  R1--xR2: replicate fails
  R1-->>CA: CP reject / AP ack
  CB->>R2: read x
  R2-->>CB: CP refuse / AP stale x
  • CP behavior: Replica 1 can’t confirm that Replica 2 has seen the write, so it refuses the write (or Replica 2 refuses the read). The system stays linearizable by becoming unavailable on at least one side. Clients on the wrong side of the partition get errors, retries, or timeouts.
  • AP behavior: Both replicas keep serving from local state. Replica 1 accepts x=5 and acks. Replica 2, not knowing about it, returns the old x. The system stays available by allowing the two sides to diverge, with a promise to reconcile once the link heals.

Neither is “wrong.” They’re answers to different questions. CP answers “would you rather be down than wrong?” AP answers “would you rather be wrong than down?” The retailer above had answered the second question without realizing the first one was on the table.

Quorums: the dial that turns CAP into a spectrum

Production systems don’t hardcode CP or AP for the whole cluster. They expose a quorum model and let you pick per request. With N replicas holding each key, a write waits for W acknowledgments and a read consults R replicas. The key inequality:

W + R > N   →   every read overlaps at least one replica
                that saw the latest write (strong-ish reads)
flowchart LR
  K[key] --> N[N=3 replicas]
  N --> W[W=2 write\nacks needed]
  N --> R[R=2 read\nreplicas asked]
  W --> Q{W + R > N ?}
  R --> Q
  Q -->|yes| Strong[overlap guarantees\nfresh read]
  Q -->|no| Weak[may read\nstale value]

Set N=3, W=2, R=2: 2+2 > 3, so any read intersects any write on at least one replica — you get linearizable-leaning reads, and a partition that isolates a single node still lets the majority side reach quorum. This is the CP-leaning posture. Lose two of three nodes and the majority is gone, so quorum operations fail — correct, but unavailable.

Set W=1, R=1 (Cassandra’s ONE): 1+1 = 2, not greater than 3. Any reachable replica answers immediately. This is AP-leaning — fast, always-up, and exactly the configuration that oversold the sneakers. The same cluster, same data, can serve a QUORUM request (CP) and a ONE request (AP) microseconds apart. CAP isn’t a property of the cluster; it’s a property of the operation. This is the same quorum machinery that underpins Database Replication durability math and the consistency levels in Cassandra and DynamoDB.

What “agreement” actually costs

Keeping replicas linearizable requires them to coordinate — to establish a single order of operations everyone accepts. That’s a consensus problem, solved by protocols like Raft or Paxos, covered in Consistency & Consensus. The mechanical cost is round trips: a leader must hear back from a majority before it can commit. That cost is invisible in a diagram and very visible in a latency histogram, which is the whole point of PACELC below.

The tradeoffs that bite

These are the decisions that look like a one-line config change and turn into an incident review.

DecisionThe free-looking choiceWhat it actually costs
C vs A under partition”we’re highly available”silent staleness / conflicting writes you must reconcile
Quorum sizeONE for low latencyno overlap → reads miss recent writes even with no partition
Conflict resolution (AP)last-write-wins, it’s simpleclock-skew silently drops an acked write
Strong consistency cost”just make it consistent”a coordination round trip on every write, forever
Failure detectionaggressive timeoutsslow node mistaken for dead → needless failover or divergence
Global configone consistency level cluster-widemoney and “likes” forced to share one wrong answer

Two of these deserve a hard stare.

Last-write-wins is a data-loss strategy wearing a convenience costume. When two sides of an AP partition both accept a write to the same key, something has to break the tie at merge time. LWW picks the higher timestamp. But timestamps come from clocks, clocks skew, and “the write with the later wall-clock time” has no relationship to “the write the user actually intended to keep.” The losing write is discarded with no error, no log line, no trace. If the data matters, LWW is not a resolution strategy — it’s deferred corruption. The correct-but-expensive alternatives are vector clocks (track causality, surface conflicts to the app) or CRDTs (data types that merge deterministically — counters, sets, registers).

Global consistency level is a tax you pay on the data that doesn’t need it. Inventory and identity want CP. A view counter and a presence indicator are happy with AP. Force one level across the cluster and you either over-pay coordination cost on the cheap data or under-protect the expensive data. The fix is to decide per data class, which the next sections make concrete.

CP vs AP, with systems you’ll actually run

SystemDefault leanBehavior during a partition
ZooKeeper / etcdCPminority side stops serving writes (no quorum)
ConsulCPleader-side consistency via Raft
Spanner / HBaseCPblock rather than return inconsistent data
Cassandra QUORUMCP-leaning (tunable)minority can’t reach quorum → errors
Cassandra ONE / Dynamo-styleAPany replica answers; reconcile after heal
Riak / CouchDBAPaccept writes on both sides; resolve conflicts
PostgreSQL single-leaderCPisolated followers can’t accept writes

The coordination tax: PACELC

CAP describes behavior during a partition — a rare event. It says nothing about the common case, and the common case is where almost all of your latency budget gets spent. PACELC, from Daniel Abadi, closes the gap:

if Partition (P): choose Availability (A) or Consistency (C); Else (E): choose Latency (L) or Consistency (C).

The “else” clause is the one that actually governs your day-to-day p99. Even with a perfectly healthy network, a linearizable system has to coordinate replicas before it can answer — and coordination means physics. A single-region Raft commit needs a round trip to a majority of replicas, typically 1–3 ms inside one datacenter. Stretch that consensus group across availability zones and you add the inter-AZ RTT, often 1–2 ms each way. Stretch it across regions — say us-east-1 to eu-west-1 — and every consistent write eats the speed of light: ~70–90 ms one way, so 150 ms+ per commit. Spanner pays this openly as TrueTime commit-wait.

flowchart LR
  S{Partition\nnow?}
  S -->|P yes| AC{A or C}
  S -->|E no| LC{L or C}
  AC -->|A| a1[up, risk stale]
  AC -->|C| c1[correct, risk error]
  LC -->|L| l1[answer local\nfast, maybe stale]
  LC -->|C| c2[coordinate\npay round trips]

PACELC gives you four quadrants and a far more useful vocabulary than “CP or AP”:

  • PA/EL — available under partition, latency-optimized otherwise. DynamoDB (eventually consistent reads), Cassandra at ONE. The default shape of “web scale.”
  • PC/EC — consistent always, paying coordination cost in both regimes. Spanner, ZooKeeper/etcd, Cassandra at QUORUM. The default shape of “control plane” and “money.”
  • PA/EC and PC/EL exist but are rarer; they describe systems that switch posture between the two regimes.

When someone asks you “CP or AP?” the senior answer names the PACELC quadrant, because it forces the question they forgot to ask: what are you paying for consistency when the network is fine? For a system answering a billion reads a day, the EL-vs-EC choice moves more money than the partition behavior ever will.

A concrete number to anchor it: I’ve seen a team flip a hot read path from a cross-region strongly-consistent read to a local eventually-consistent read and watch p99 drop from 180 ms to 4 ms — with zero partition involved. That entire win was the PACELC “else” clause, the part of the tradeoff CAP can’t see.

Failure modes

How CAP misjudgments actually page people. Symptom → root cause → prevention.

  • The “we’re CA” architecture. Symptom: during a switch failure, both sides of the cluster accept conflicting writes and there’s no merge plan; data is corrupt after heal. Root cause: the team assumed partitions wouldn’t happen and built no strategy for one. Prevention: accept that you are CP or AP, never CA; pick one explicitly per data class and write down the partition behavior.

  • Silent staleness on AP reads. Symptom: a user updates a setting, refreshes, and sees the old value — “I just changed this!” Root cause: the read hit a replica that hadn’t received the write; read-your-writes was never guaranteed. Prevention: route read-after-write paths to the primary or use a session-consistency token; treat replica reads as eventually consistent and design the UI around it. This is the same class of bug as Database Replication lag.

  • CP unavailability surprise. Symptom: a quorum system loses a majority (two of three nodes, or a region split) and goes completely unavailable for writes; on-call panics because “distributed” was supposed to mean “always up.” Root cause: CP correctly refuses to serve without quorum. Prevention: know that quorum loss = unavailability by design; rehearse it, set client retry/backoff, and size quorums so common failures stay within survivable bounds.

  • Last-write-wins data loss. Symptom: an acknowledged write vanishes with no error. Root cause: AP conflict resolution via LWW plus clock skew discarded it. Prevention: don’t use LWW for data you can’t lose; use vector clocks or CRDTs, and sync clocks with NTP/PTP while never trusting them for correctness.

  • The partition that wasn’t a partition. Symptom: a healthy-but-slow node gets fenced and failed over, causing a needless write outage; or an AP system starts diverging because it thinks a peer died. Root cause: a long GC pause or a saturated link is indistinguishable from a real partition to a naive failure detector. Prevention: tune timeouts to be generous-but-bounded, use phi-accrual or similar adaptive detectors, and use fencing tokens so a “resurrected” node can’t act on stale authority.

The most expensive misconception in this whole topic is treating C-versus-A as a permanent, global setting. It is a per-operation, partition-time decision. A bank ledger and a “like” counter can live in the same company with opposite answers — and a single Cassandra cluster can serve both correctly if you set the consistency level per query. The teams that get burned are the ones who picked an answer once, for the whole system, and forgot they’d picked it.

A few more misreadings worth killing on sight. “Eventually consistent” does not mean “consistent after a brief delay you can ignore.” It means there is no upper bound on the convergence window unless the system explicitly gives you one; under sustained write load and replication backpressure, “eventually” can stretch to seconds or minutes. And “strong consistency” is not one thing: linearizability (real-time ordering of single-object operations), sequential consistency (a single order, not necessarily real-time), and serializability (transaction isolation — the “I” in ACID) are distinct guarantees with very different costs. Conflate them and you either over-pay for coordination you didn’t need or assume an ordering guarantee you never actually bought. Consistency & Consensus takes these apart properly.

Scaling it

CAP doesn’t get “scaled” — but the way you apply it determines whether scale amplifies your good decisions or your bad ones.

  1. Decide per data class, not per company. Money, inventory, identity, and config want CP/EC. Feeds, counters, presence, recommendations, and analytics tolerate AP/EL. Imposing one global answer guarantees you’re wrong for half your data. This is the single highest-leverage move.

  2. Tune the quorum knob per workload. Use W + R > N for the paths that need fresh reads; relax to ONE/ONE for latency-critical, staleness-tolerant paths. Reconfigure per request wherever the datastore allows — that flexibility is the whole reason to run a tunable system.

  3. Keep consensus groups inside a low-latency domain. A synchronous quorum that spans regions pays the full inter-region RTT on every write (the PACELC EC tax, 150 ms+ cross-continent). Keep the consensus group within a single region and replicate asynchronously across regions, accepting an AP boundary between regions where the latency would otherwise be unacceptable. This is how most global systems are actually built: CP within a region, AP across them.

  4. Shrink the blast radius with partitioning. Smaller partitions mean a partition event (the network kind) affects fewer keys, and quorum loss is scoped to one slice of the keyspace rather than the entire dataset. A cluster of many small shards degrades gracefully where one giant shard fails wholesale.

  5. Plan reconciliation before you go AP, not after. Choose your conflict strategy — LWW (cheap, lossy), vector clocks (correct, complex), or CRDTs (correct for specific types) — at design time. “We’ll figure out conflicts later” means “we’ll discover corruption during an incident.”

When to reach for it (and when not to)

CAP isn’t a tool you adopt; it’s the question you ask every time data spans more than one node. So “reaching for it” means deciding which side to land on for a given dataset.

Choose CP when correctness beats uptime: ledgers, inventory decrements, leader election, unique-constraint enforcement, and globally-agreed configuration. These are jobs for ZooKeeper & etcd-style coordination, a single-leader PostgreSQL primary, or a QUORUM-configured store. Accept that a partition that costs you quorum will cost you availability for those operations — that’s the deal, and for money it’s the right deal.

Choose AP when uptime beats freshness: shopping carts (reconcile at checkout), social feeds, view counts, presence, metrics, and recommendations. A few seconds of staleness is invisible to users; a 503 is not. Just make sure you’ve actually decided this rather than inheriting it from an old benchmark — the sneaker story is what happens when AP is a default nobody chose.

Don’t invoke CAP to excuse losing data that matters. “We’re AP” is not a justification for dropping a payment; it’s a statement that you put that payment in the wrong system. And don’t ever claim “CA” for a distributed system — if it spans a network, it’s CP or AP when the network breaks, full stop.

When to consider alternatives

CAP points you at the right sibling tool for each job:

  • Strong coordination, leader election, locks that must be correctZooKeeper & etcd (CP, consensus-backed).
  • Durable transactional source of truthPostgreSQL (CP single-leader) or DynamoDB with strongly-consistent reads.
  • Write-heavy, always-on, tunable consistencyCassandra (dial CP or AP per request).
  • Fast, rebuildable, lossy stateRedis (AP-flavored; never the system of record for data you can’t lose).
  • Ordered, durable event log to reconcile fromKafka, which gives AP systems a replayable truth to converge against.
  • Read fan-out with explicit lagDatabase Replication read replicas, where you choose freshness vs latency per query.

The recurring pattern: there is no system that escapes CAP, only systems that expose the choice at different granularities. The good ones let you choose per operation; the dangerous ones make the choice for you and don’t tell you which one.

Operational checklist

  • Classify every dataset as CP or AP explicitly, and write the classification down next to the schema — not in someone’s head.
  • For each AP dataset, define and test the conflict-resolution strategy (LWW / vector clocks / CRDT) before launch; prove it doesn’t silently drop writes.
  • For each CP dataset, rehearse the unavailability behavior: what error clients see, retry/backoff policy, and measured failover time.
  • Set quorum levels (W, R, N) per workload, not globally; alert when reachable replicas for a key drop below quorum.
  • Keep synchronous consensus groups inside one region/low-latency domain; replicate cross-region asynchronously and treat that boundary as AP.
  • Tune failure detectors (generous-but-bounded timeouts, phi-accrual) so GC pauses and slow links aren’t misread as partitions; use fencing tokens against stale nodes.
  • Inject partitions deliberately in a game day (drop inter-AZ packets, kill a majority) and confirm both sides behave exactly as designed.
  • Name the PACELC quadrant for each critical path and measure the EC/EL latency cost on the healthy network, where you actually pay it.

Summary

CAP says one precise thing: during a network partition, a distributed system can stay available or stay linearizable, never both, and partitions are inevitable — so your real choice is CP or AP, never the mythical CA. The mistakes that hurt are treating that choice as a permanent global setting instead of a per-operation, partition-time decision; reading CAP’s “C” as ACID consistency when it means linearizability; and forgetting that the choice CAP can’t see — coordination latency on the healthy network, captured by PACELC’s EL-vs-EC — is the one you pay for on every request. Decide per data class, tune quorums per workload, keep consensus inside a region, pick a conflict strategy before you go AP, and rehearse the unavailability you signed up for if you go CP. Do that and CAP stops being the thing that surprises you at 3am and becomes the framework that told you exactly what would happen — because you decided it on purpose.

Appendix: a quick linearizability refresher

If the body leaned on terms you’d like restated cleanly:

  • Linearizability — operations appear to take effect instantaneously at some point between their start and finish, consistent with real time. There’s a single, real-time-respecting order. This is CAP’s “C.”
  • Sequential consistency — there’s a single order all nodes agree on, but it need not match real time. Weaker than linearizable, cheaper to provide.
  • Serializability — the “I” in ACID: concurrent transactions produce a result equivalent to some serial order. About transaction isolation, not single-object real-time ordering. Distinct from linearizability; a system can offer one without the other.
  • Eventual consistency — replicas converge to the same value if writes stop, with no bound on how long that takes unless the system specifies one. The default of AP systems.

The reason these distinctions matter operationally: each stronger guarantee costs more coordination, and coordination costs latency and availability. Buying “strong consistency” without knowing which strong consistency means you’re either over-paying or under-protected. Consistency & Consensus is the deeper treatment.

Further reading

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

No incident deep-dives yet. See the roadmap for what's coming.