ZooKeeper & etcd

Coordination services aren't databases. What ZooKeeper and etcd actually do, why ZAB and Raft matter, and how watches, sessions, leases, and fencing tokens bite in production.

22 min readupdated 2026-06-28

On this page

People reach for ZooKeeper or etcd the first time they need “a place to keep a little shared state,” and then they treat it like a tiny, very reliable database. That instinct is precisely how you take down a cluster. A coordination service is not a database. It is a replicated state machine that trades throughput for the ability to make every node agree on the same truth — and the moment you push real data volume or write rate through it, it punishes you, slowly at first and then all at once.

The job these systems do is narrow and load-bearing: leader election, distributed locks, dynamic configuration, and membership. They are the part of your architecture that answers the question “who is in charge right now, and what is the current config?” with a single, consistent answer that every node trusts. ZooKeeper backs Kafka’s classic control plane, HBase, and a generation of Hadoop infrastructure; etcd backs Kubernetes and is, for most of us today, the coordination service you actually depend on whether you know it or not. They solve the same problem with different APIs and different consensus protocols (ZAB and Raft), and they fail in the same small handful of ways.

This article is the mental model for the systems that hold a cluster together. It assumes you have already been surprised by one in production. The patterns here connect directly to Consistency & Consensus for the theory, to the brokers in Apache Kafka and the control plane in Kubernetes for the consumers, and to the CAP theorem for why a coordination service deliberately chooses to stop serving writes when it loses quorum rather than risk disagreeing with itself. The one belief I want you to leave with: the consensus protocol is not where you get hurt. The API semantics are — one-shot watches, session expiry, lease renewal, and the fencing token you forgot to check.

A motivating failure

A team I worked with ran a fleet of stateful workers behind a ZooKeeper-based leader lock. Exactly one worker was supposed to own a billing reconciliation job at any instant. The lock recipe was textbook: each candidate created an ephemeral sequential znode under /billing/leader/, and the holder of the lowest sequence number ran the job. If the leader died, its ephemeral node vanished with its session, the next-lowest node won, and life continued. They had run it for two years.

One afternoon the leader process hit a brutal stop-the-world GC pause — a 14-second full collection under memory pressure. Its ZooKeeper session timeout was 10s. At second 10, the cluster declared the session dead and deleted the leader’s ephemeral node. The standby saw its watch fire, became leader, and started reconciling. At second 14, the original process woke up from GC, completely unaware that any time had passed, and kept running the reconciliation it thought it still owned.

For about four seconds, two processes both believed they were the sole leader. They both wrote settlement records. The reconciliation job was not idempotent — nobody had needed it to be, because “there is only ever one leader.” The result was a double batch of settlement entries that took finance three days and a forensic SQL session to unwind.

Nothing was broken. ZooKeeper did exactly what it promised: the session expired, the ephemeral node disappeared, a new leader was elected. The bug lived in an unwritten assumption — that “I hold the lock” and “I am still allowed to act” are the same statement. They are not. A coordination service can guarantee its own state never has two leaders. It cannot reach into a process that stopped listening and stop its hands. The only fix is fencing: every action the leader takes must carry the monotonic token from the lock, and the downstream system must reject any token older than the highest it has seen. They added the zxid as a fencing token, made the writer reject stale tokens, and the class of bug disappeared. It is the most important lesson in this entire article, and it is why I put it first.

The one-sentence mental model

A coordination service is a small, strongly-consistent, fully-replicated key-value store where every write goes through a consensus quorum, so it can tell every client the same truth about who is leader and what the config is — and it will refuse to answer at all rather than risk telling two clients different things.

Unpack each clause, because each one is an operational limit you will eventually hit:

Small — the entire dataset is expected to fit in memory on every node and stay in the low hundreds of MB. etcd’s default storage quota is 2GiB (--quota-backend-bytes); cross it and the cluster goes read-only with mvcc: database space exceeded. This is a feature, not a bug — it is the system refusing to let you misuse it.
Strongly-consistent — a successful write is durable on a majority and linearizable. There is no “eventually.” When the API returns success, every subsequent read across the cluster reflects it.
Fully-replicated — every node holds the whole dataset. You do not shard a coordination service. You scale it down, not out.
Consensus quorum — every write costs a round trip to a majority of nodes, each doing an fsync. Throughput is bounded by the slowest member of the quorum, not the fastest.

flowchart TB
  C1[Client write] --> L[Leader]
  C2[Client read\nlocal ok] --> F1[Follower 1]
  L -->|propose| F1
  L -->|propose| F2[Follower 2]
  F1 -->|ack fsync| L
  F2 -->|ack fsync| L
  L -->|commit on\nmajority ack| C1
  L -. heartbeat .-> F1
  L -. heartbeat .-> F2

The leader sequences every write and replicates it; a write commits only once a majority (quorum) has durably acknowledged it. A 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2. Even numbers buy you nothing — a 4-node cluster still needs 3 for quorum, so it tolerates only 1 failure while costing more to replicate and making every write wait on a larger set. Run odd numbers, always. This is the same quorum math that underpins database replication with synchronous commit, applied to a system whose entire purpose is agreement.

How it actually works

Two data models for the same job

ZooKeeper exposes a hierarchical namespace of znodes — a filesystem-like tree, e.g. /services/kafka/brokers/ids/1. Each znode holds a small payload (the jute.maxbuffer ceiling defaults to roughly 1MB per node, and you should stay orders of magnitude below it) and is one of three kinds:

Persistent — survives until explicitly deleted. Configuration, registries, durable metadata.
Ephemeral — tied to the client’s session; vanishes automatically when that session ends. This is the primitive behind membership and locks — “I exist as long as I’m connected.”
Sequential — appends a monotonic counter to the name (lock-0000000042), giving you globally ordered, unique nodes. Combine ephemeral + sequential and you have the building block for fair locks and queues.

etcd takes a different shape: a flat, MVCC key-value store with a single global, monotonically increasing revision. Every mutation bumps the revision, and old revisions are retained until compaction. Instead of ephemeral znodes, etcd has leases: you create a lease with a TTL, attach keys to it, and keep it alive with periodic KeepAlive calls. Stop renewing — because you crashed or got partitioned — and when the TTL elapses the lease expires and every key bound to it is deleted atomically. Same outcome as an ephemeral znode, different mechanism, and the atomic multi-key delete is genuinely useful.

Concept	ZooKeeper	etcd
Data model	Hierarchical znode tree	Flat MVCC key-value
Liveness primitive	Ephemeral node + session	Key + lease (TTL)
Change notification	One-shot watch	Streaming `Watch` from a revision
Ordering token	Sequential znode counter (`zxid`)	Global `revision`
Consensus	ZAB	Raft
Wire API	Custom binary (Jute)	gRPC / protobuf
Typical home	Kafka classic, HBase, Hadoop	Kubernetes, cloud-native stacks

Watches: how clients learn the truth changed

You do not poll a coordination service in a tight loop — that would turn a low-write system into a read storm and defeat the point. Instead you set a watch and get notified when the thing you care about changes. This is what makes leader election and config distribution cheap to read: thousands of clients sit idle until something actually moves, then get told.

sequenceDiagram
  participant A as Candidate A
  participant ZK as Coordination cluster
  participant B as Candidate B
  A->>ZK: create ephemeral+seq lock-
  ZK-->>A: lock-0000000001
  B->>ZK: create ephemeral+seq lock-
  ZK-->>B: lock-0000000002
  A->>ZK: lowest seq? yes
  Note over A: A is leader
  B->>ZK: watch lock-0000000001
  Note over A: A session dies
  ZK-->>B: watch fires lock gone
  B->>ZK: lowest seq now? yes
  Note over B: B is leader

The crucial detail: ZooKeeper watches are one-shot. After a watch fires once, it is gone, and you must re-register it. Any change that happens between the fire and your re-registration is something you only discover by re-reading state. This is not a flaw to route around — it is a design that forces you to treat the watch as an interrupt that says “go re-read,” never as a reliable change feed. etcd’s watches are streaming and resume from a given revision, which closes the gap between events, but they come with their own sharp edge: if your starting revision is older than what etcd still retains after compaction, you get a compacted error and must re-list and resume. Either way, the discipline is identical — the watch tells you that something changed, your re-read tells you what.

Leader election and fencing, end to end

The election recipe above is only half the story. The half that prevents the opening incident is what the winner does next. The lowest sequence number — or the etcd lease revision — is not just an election tiebreaker; it is a fencing token: a number that only ever increases, attached to every write the leader makes to the system it controls.

flowchart LR
  E[Win election\ntoken = 42] --> W1[Write payload\ntoken 42]
  W1 --> DS{Downstream\nhighest seen?}
  DS -->|42 >= 41| OK[accept\nstore 42]
  G[GC pause\nlost session] --> NL[New leader\ntoken = 43]
  NL --> W2[Write payload\ntoken 43]
  W2 --> DS
  DS -->|stale 42 < 43| REJ[reject old\nleader writes]
  style REJ fill:#e11d48,color:#fff

The downstream store remembers the highest token it has accepted and rejects anything lower. When the GC-paused old leader wakes up and tries to write with token 42, the store has already seen 43 and refuses it. The coordination service guaranteed correctness in its own state; the fencing token extends that guarantee into the system you actually care about. If you take one engineering practice from this page, it is this one.

ZAB vs Raft

ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast); etcd uses Raft. Both elect a single leader, both commit on majority quorum, both produce a totally-ordered log of state changes. The differences are mostly in framing, and they rarely matter to your application code.

Leader election. ZAB elects the node with the most up-to-date transaction log (zxid). Raft elects a candidate that wins a majority of votes within a term, and a node only votes for a candidate whose log is at least as current as its own. Both guarantee a freshly elected leader can never silently lose a committed write.
Log identity. ZAB’s zxid is a 64-bit value: the high bits are the leader epoch, the low bits a per-epoch counter. Raft uses (term, index). Same idea — a logical clock that makes a stale leader detectable the instant it tries to act.
Reads. Both can serve stale reads from followers for speed. etcd offers linearizable reads via a quorum check (ReadIndex) when you opt in; ZooKeeper reads are served locally by the connected server and can lag, which is why you call sync() before a read that absolutely must reflect the latest committed write. Defaulting to local reads without understanding this is how you get an election bug that only appears under partition.

The honest summary: pick the system that fits your ecosystem (etcd in Kubernetes-land, ZooKeeper in Kafka-classic/Hadoop-land), and spend your worry budget on session timeouts and fencing, not on which consensus paper is more elegant.

The tradeoffs that bite

These are the decisions that look free when you draw the architecture and bill you in production.

Tradeoff	The free-looking choice	What it actually costs
Consistency vs throughput	”It’s a KV store, just write to it”	Every write is a quorum `fsync` round trip; ~low thousands/sec ceiling
Read freshness vs latency	Reading from any follower	Stale reads; a leader-election bug that only shows under partition
Cluster size vs write latency	”More nodes = more reliable”	7 nodes commit slower than 3; you bought latency, not safety
Session TTL vs false death	A short timeout for fast detection	A GC pause or blip kills a healthy leader (the opening story)
Small vs convenient	Storing payloads “because it’s reliable”	Quota exceeded → whole cluster read-only
Lock vs correctness	Trusting the lock without a fence	Two leaders during a pause both act; corruption downstream

Two of these deserve emphasis. Consistency vs throughput is the one that catches the most teams: a coordination service typically handles low thousands of writes per second, which is three to four orders of magnitude below a real database. If your write rate scales with user traffic — per-request counters, per-session state — you have chosen the wrong tool and the cluster will tell you so under load. And session TTL vs false death has no correct universal value: a short TTL detects genuinely dead clients fast but converts a GC pause or network blip into a false death; a long TTL tolerates jitter but leaves dead members “alive” for seconds while nothing makes progress. You tune it for your GC behavior and network reality, and you write down the false-death window you accepted.

Performance: what’s fast, what’s slow, and the levers

The performance profile of a coordination service is unusual because the system is deliberately not built for speed. Understanding what is cheap versus catastrophic keeps you from designing a slow-motion outage.

What is fast. Watch notifications and local reads. A client sitting on a watch costs the server almost nothing until the watched node changes. Local (non-linearizable) reads are served straight from the connected node’s memory — sub-millisecond, and they scale with the number of nodes you can read from. This is why a coordination service is a fine place to read configuration thousands of times a second: the reads are local and the data is in memory.

What is slow. Every write. A write is a proposal, a replication round trip to a majority, an fsync on each of those members, and a commit — all serialized through the single leader. There is no batching away the quorum round trip’s latency floor, only amortizing it across concurrent proposals. Linearizable reads are also “slow” relative to local reads because they pay a quorum ReadIndex check to prove they aren’t stale.

The levers that actually move the needle:

Don’t write on the hot path. The single biggest win is architectural: writes should be rare and triggered by state changes (a leader died, a config was updated), not by user requests. If you find yourself writing per-request, redesign.
Fast, dedicated disk. Because every commit fsyncs, disk latency is write latency. etcd is notoriously sensitive; put the WAL on a fast, isolated device (--wal-dir), and never share the disk with a noisy neighbor. A degraded volume on one member drags the whole quorum down because the leader waits for a majority.
Keep the cluster small. Five voting members is the usual ceiling. Each additional voter makes every commit wait for a larger majority to fsync. You add nodes for fault tolerance, never for write speed.
Compact and defrag (etcd). MVCC retains every revision until you compact; without a compaction schedule the keyspace grows, reads slow as history accumulates, and you drift toward the quota. --auto-compaction-retention plus periodic defrag keeps it lean.
Prefer local reads, opt into linearizable only when correctness needs it. Most config reads tolerate a few milliseconds of staleness. Reserve linearizable reads (and ZooKeeper sync()) for the reads that gate a correctness decision.

Watch the right metrics, not the obvious ones. CPU is rarely the problem. Watch commit/fsync latency (etcd etcd_disk_wal_fsync_duration_seconds, etcd_disk_backend_commit_duration_seconds), leader changes (a flapping leader is an early warning of disk or network trouble), proposal failures, DB size against the quota, and watch/lease counts. The connection to broader observability is direct: these are the signals that tell you the coordination layer is about to take your whole platform with it.

Failure modes

The recurring outages, in rough order of how often they page someone. Each is symptom → root cause → prevention.

The herd effect on watches. Symptom: a single lock release or leader change causes a CPU spike and a burst of traffic across the whole cluster. Root cause: a naive lock or election has every waiter set a watch on the same node. When it changes, the server notifies all of them at once; they all wake, all re-read, all re-register, and hammer the cluster simultaneously — a self-inflicted thundering herd that gets worse with every added waiter. Prevention: make each waiter watch only the node immediately ahead of it in the sequential chain (the recipe in the election diagram), so a single release wakes exactly one client.

Session expiry under GC or network blip. Symptom: a healthy leader suddenly isn’t, and a standby takes over while the old process keeps running. Root cause: ephemeral nodes/leases vanish when the session expires, and a long stop-the-world GC pause or a transient partition can expire a session even though the process is perfectly alive (the opening story). Prevention: tune the TTL for your real GC behavior, and fence every action with the monotonic token so a returning zombie leader is rejected downstream.

Storage quota exceeded. Symptom: in etcd, mvcc: database space exceeded; the cluster reads but refuses all writes. In Kubernetes, the API server can read but not write — no new pods, no scaling, no recovery — until you intervene. Root cause: someone treated it like a database, or compaction was never scheduled, and the backend crossed --quota-backend-bytes. Prevention: alert on DB size well below quota, schedule compact + defrag, and store pointers, not payloads.

Disk latency stalls the quorum. Symptom: write latency climbs cluster-wide for no obvious CPU reason. Root cause: one member’s fsync got slow (a degraded volume, a noisy neighbor), and because commits need a majority, the leader waits on it. Prevention: fast dedicated disks, alert on fsync/commit latency per member, and pull a chronically slow member.

Loss of quorum. Symptom: the cluster stops accepting writes entirely and cannot elect a leader. Root cause: a majority of nodes is down — a 3-node cluster lost 2. Prevention: this is correct behavior (it prevents split-brain), so the prevention is operational: spread nodes across failure domains, run 5 nodes if losing 2 simultaneously is plausible, and have a documented disaster-recovery restore-from-snapshot runbook because once you’re below quorum, no amount of restarting the survivors brings writes back.

The single most expensive mistake is the split-second of two leaders. A client whose session expired during a GC pause may still believe it holds the lock and keep acting, while the cluster has already handed the lock to someone else. Coordination services prevent split-brain in their own state; they cannot stop a process that has stopped listening to them. Always fence with the monotonic token — the zxid, the sequential number, the lease revision — so the downstream system rejects the stale leader’s writes. A lock without a fence is a suggestion.

Scaling it

You do not scale a coordination service the way you scale a stateless tier or a database. The moves are different, and several of them are about scaling down.

flowchart TB
  W[Voting members\n3 or 5] -->|every write| Q[Quorum fsync]
  OBS[Observers /\nlearners] -.->|read only| R[Read fan-out]
  OBS -.->|no vote| Q
  W --> D[(Tiny dataset\nmetadata only)]
  D --> CMP[Compact +\ndefrag]
  PAY[Payloads,\ncounters,\nqueues] -.->|move out| ELSE[Real database\nor Redis/Kafka]

Cap the data; do not grow the cluster. Keep the dataset tiny. Store pointers and metadata — a service registration, a config version, the identity of a lock holder — not the payloads themselves. If a value is more than a few KB, it almost certainly belongs in object storage, a database, or Redis with a reference held here.
Separate read load with observers/learners. ZooKeeper observers and etcd learners replicate state and serve reads but do not vote in the quorum. They let you fan out reads to many clients without slowing every write by enlarging the voting set. This is the correct way to add read capacity — the same separation-of-concerns instinct as adding read replicas in database replication, but without touching the quorum.
Compact and defrag relentlessly (etcd). MVCC keeps every revision until you compact; compaction frees logical space and defrag returns it to the filesystem. Set --auto-compaction-retention and treat it as non-optional.
Protect the disk. Put the data/WAL directory on a fast, isolated device and monitor fsync and commit latency, not just CPU. The disk is the bottleneck the consensus protocol is built on.
The wall you hit is write throughput. When the thing you’re storing scales with user requests, no amount of tuning saves you — you are using a consensus quorum to do a database’s job. The answer is to move that data to a system built for it (a real database for durable rows, Redis for ephemeral counters, Kafka for a durable log) and keep only coordination metadata in the quorum. The skill is recognizing this before the quota alarm fires, not after.

When to reach for it (and when not to)

Reach for a coordination service when you need: leader election with real correctness guarantees, distributed locks that release automatically when a client dies, dynamic configuration that propagates with low latency to many watchers, service discovery and membership, or a small amount of metadata that must be linearizable and survive node failures. Choose etcd when you’re already in the Kubernetes/cloud-native ecosystem — you’re running it anyway. Choose ZooKeeper when you live in the Kafka-classic or Hadoop/HBase world where it’s the native fit.

Don’t reach for it when:

You need a database. Throughput, dataset size, and query flexibility are all wrong. Use PostgreSQL, DynamoDB, or a Kafka log for a durable, high-volume stream.
You only need a lock and you already run Redis. A Redis-based lock with a fencing token is often enough for non-safety-critical coordination and far cheaper to operate — just remember it inherits Redis’s async-failover loss window, so it is not a correctness guarantee on its own.
You’re tempted to store blobs, queues, or per-user state “because it’s reliable.” Reliable and small are a package deal here. Reliability at scale lives in systems designed for volume.

When to consider alternatives

A short map to sibling topics for the jobs a coordination service is the wrong tool for:

Durable rows, transactions, queryable state → PostgreSQL or DynamoDB.
A durable, high-throughput event log or streaming backbone → Kafka.
General task/message delivery with retries and acks → Message Queues or a task framework like Celery.
Ephemeral counters, rate-limit windows, hot-path locks → Redis, backed by a fencing token from a real source of truth.
Massive-scale, write-heavy, tunable-consistency storage → Cassandra.
Config you only read → a versioned object in object storage plus a cache, when you don’t need linearizable writes or watches.
The underlying theory → Consistency & Consensus and the CAP theorem for why these systems make the choices they do.

Operational checklist

Run an odd number of voting members (3 or 5, rarely 7). Never even. Document exactly which failures the cluster tolerates and spread members across failure domains.
Set and alert on the storage quota well below the limit: etcd --quota-backend-bytes and DB size; ZooKeeper dataset size and jute.maxbuffer headroom.
Schedule compaction + defrag (etcd --auto-compaction-retention) so the keyspace never drifts toward the quota.
Put the data/WAL directory on a fast, dedicated disk and alert on fsync/commit latency per member, not just CPU.
Tune session/lease TTL deliberately for your GC pauses and network reality; write down the false-death window you accepted.
Always fence lock and leader holders with the monotonic token (zxid / sequential number / lease revision) so a stale leader’s writes are rejected downstream. Treat an unfenced lock as a bug.
For watch-heavy workloads, watch the predecessor node, not the shared one — kill the herd effect before it kills the cluster.
Alert on quorum health and leader changes; a flapping leader is an early warning of disk or network trouble before it becomes an outage.
Keep a tested restore-from-snapshot runbook; once you drop below quorum, recovery is a restore, not a restart.

Summary

A coordination service is the smallest, most important database you will ever run, and the trap is thinking of it as a database at all. It is a replicated state machine that exists to give every node one consistent answer about who is leader and what the config is, and it deliberately trades throughput, dataset size, and availability-under-partition for that single guarantee. Keep the data tiny and store pointers not payloads; run an odd, small number of nodes on fast isolated disks; tune session timeouts for your real GC behavior; compact relentlessly; and — the lesson that pays for all the others — fence every lock holder with a monotonic token so the system can never be hurt by a process that stopped listening. Do that and ZooKeeper or etcd becomes the quiet, boring foundation everything else stands on. Forget the fence, or treat it like a database, and it becomes the single point that takes the whole platform down at once.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

No incident deep-dives yet. See the roadmap for what's coming.