Apache Cassandra
Cassandra in production: the masterless ring, consistent hashing, tunable consistency (R+W>N), LSM/SSTables and compaction, tombstone outages, partition design, and repair.
On this page
Cassandra gets sold as “infinitely scalable NoSQL with no single point of failure.” Both halves are true, and both are traps. It scales linearly if you model your data around the queries first, and it has no leader because it gave up the consistency guarantees you have spent your career assuming. People bring a relational mindset to it and produce a slow, inconsistent, tombstone-choked cluster that does none of what the brochure promised.
The one load-bearing fact about Cassandra is this: it is a distributed log-structured store with no master, where you trade ad-hoc querying and default strong consistency for linear write throughput and tunable availability. You design the table for the read, not the read for the table. You decide, per query, how many replicas have to agree. And you accept that a DELETE does not delete — it writes a marker that you then have to garbage-collect on a deadline. Get any of those three backwards and nothing else you do will save the cluster.
This is the long-form context article — the thing I wish someone had handed me before my first Cassandra incident. It covers the ring and consistent hashing, tunable consistency and the R + W > N rule, the LSM/SSTable engine and compaction, why deletes cause outages, partition-key design, and where Cassandra beats and loses to DynamoDB and PostgreSQL. It leans on Consistent Hashing for the ring math, Sharding & Partitioning for the key-design discipline, and Consistency & Consensus plus the CAP Theorem for the guarantees it gives up.
A motivating failure
A payments startup stored its transaction event log in Cassandra: every authorization, capture, and refund as an append, RF=3 across three racks. The throughput numbers were gorgeous — millions of events a day, p99 writes under 5ms. Someone, chasing those numbers, had set the driver default to CL=ONE for both reads and writes. With one replica required, writes never waited on a slow node and reads always hit the nearest one. It looked like free performance for nine months.
Then a routine kernel upgrade triggered a rolling restart — one node down at a time, two minutes each. During the window where node B was down, writes for the partitions it owned landed on the two surviving replicas. Some of those writes were acknowledged at CL=ONE after touching only node A. When B came back and a read at CL=ONE happened to route to node C — which had missed those writes and not yet been repaired — the customer’s balance rolled backward. A refund that had been confirmed minutes earlier simply was not there.
Support tickets called it “money disappearing.” Engineering called it a replication bug. It was neither. CL=ONE write plus CL=ONE read gives 1 + 1 = 2, which is not greater than RF=3, so there was never any guarantee that a read overlapped the write. The cluster did exactly what it was told. The fix was one line — LOCAL_QUORUM on both sides — and a nodetool repair to heal the divergence the months of CL=ONE had quietly accumulated. The outage lived entirely in a misunderstanding of R + W > N.
The one-sentence mental model
Cassandra is a ring of equal nodes where the partition key decides which nodes own a row, any node can coordinate any request, and you choose per-query how many replicas must agree.
No clause is free. Equal nodes means there is no leader to hand you a global order or a single source of truth. Partition key decides ownership means a badly chosen key concentrates traffic on a few nodes while the rest idle. You choose how many agree means consistency is a dial you set on every statement — and therefore a dial you can set wrong, as the story above shows.
flowchart TB
C[Client] -->|write key=user42| Co[Coordinator\nany node]
Co -->|hash key| T{Token ring}
T --> N1[(Node 1\nreplica)]
T --> N2[(Node 2\nreplica)]
T --> N3[(Node 3\nreplica)]
Co -.RF=3 CL=QUORUM.-> N1
Co -.wait 2 acks.-> N2
Any node can act as coordinator for a request. It hashes the partition key to a token, looks up which nodes own that token range, forwards the operation to all RF replicas, and waits for as many acknowledgements as the consistency level demands. There is no special node and no metadata round trip — ownership is pure arithmetic on the key. That is the source of both the linear scaling and every sharp edge.
How it actually works
The ring and consistent hashing
Cassandra maps the 64-bit token space into a logical ring. Each node owns a set of token ranges, assigned through virtual nodes (vnodes, default num_tokens: 16 in modern releases) so ranges spread evenly and a failed node’s load redistributes across many peers rather than dumping onto one unlucky neighbor. The partition key is hashed with Murmur3 into a token; the token’s position on the ring selects the first replica, and the next RF - 1 nodes clockwise hold the remaining copies. This is textbook consistent hashing — adding a node only steals ranges from its immediate ring neighbors instead of reshuffling the entire dataset.
Because ownership is computable from the key alone, any node can route any request without consulting a coordinator service or a metadata table. That property is exactly why there is no master and why you scale the cluster by adding commodity nodes.
Tunable consistency and the R + W > N rule
Each keyspace has a replication factor (RF) — how many nodes hold a copy of every partition. On every individual read and write you choose a consistency level (CL) — how many of those replicas must respond before the coordinator returns:
ONE— a single replica. Fastest, weakest, and the trap from the opening story.QUORUM—floor(RF/2) + 1replicas across the whole cluster.LOCAL_QUORUM— a quorum within the local datacenter; the standard production choice for multi-DC because it avoids a cross-region round trip.ALL— every replica. Strongest, and the most brittle: one slow or down node stalls the entire query.
The guarantee that buys read-your-writes consistency is R + W > N, where N = RF, W is the write CL, and R is the read CL. With RF=3, writing at QUORUM (W=2) and reading at QUORUM (R=2) gives 2 + 2 = 4 > 3 — any read quorum and any write quorum must share at least one node, so a read is guaranteed to see the most recent acknowledged write. Drop either side to ONE and the overlap vanishes; stale reads become legal, which is precisely how money “disappeared.”
flowchart LR
subgraph Write QUORUM
A[(R1 acked)]
B[(R2 acked)]
end
subgraph Read QUORUM
B2[(R2 read)]
C[(R3 read)]
end
B --- B2
note[R2 overlaps\nlatest write seen]
B --- note
| Config (RF=3) | W | R | R+W>N? | Behavior |
|---|---|---|---|---|
| QUORUM / QUORUM | 2 | 2 | yes (4>3) | strong, balanced |
| ONE / ONE | 1 | 1 | no (2<3) | fast, may read stale |
| ALL / ONE | 3 | 1 | yes (4>3) | read-fast, write-fragile |
| LOCAL_QUORUM / LOCAL_QUORUM | 2 | 2 | yes per DC | strong in DC, async cross-DC |
This is the lever DynamoDB hides behind “eventually consistent” versus “strongly consistent.” Cassandra exposes it raw, on every statement, which is more power and more rope.
LSM tree, SSTables, and compaction
Cassandra’s write speed comes from a log-structured merge tree. A write goes to two places and returns immediately — it never reads the existing row, and it never seeks:
- An append to the commit log on disk (crash durability).
- An insert into the memtable in RAM (a sorted in-memory structure).
sequenceDiagram participant W as Write participant CL as Commit log participant MT as Memtable participant SS as SSTables W->>CL: append (durable) W->>MT: insert (sorted) MT-->>W: ack (no seek) Note over MT: memtable fills MT->>SS: flush new SSTable Note over SS: many SSTables pile up SS->>SS: compaction merges\ndrops obsolete
When the memtable fills, it flushes to an immutable SSTable on disk. Because SSTables never change after they are written, a single logical row’s history can be scattered across several of them — the original insert in one, a later update in another, a delete marker in a third.
Compaction is the background job that merges SSTables together, keeps the newest version of each cell, and discards superseded data. The strategy you pick is a real tuning decision:
SizeTieredCompactionStrategy(STCS) — merges similarly-sized tables; write-friendly but causes read amplification because a row can span many tiers.LeveledCompactionStrategy(LCS) — bounds how many SSTables a read touches (usually one per level); great for read-heavy and update-heavy tables, at the cost of more compaction I/O.TimeWindowCompactionStrategy(TWCS) — buckets data into time windows; the right answer for time-series with TTLs because whole windows expire and drop at once.
Reads must merge a partition’s data across every SSTable that holds part of it, so the more uncompacted tables a partition spans, the slower the read. Bloom filters and the per-SSTable partition index keep this bounded by skipping tables that cannot contain the key — but only if compaction is keeping up. When it falls behind, SSTables-per-read climbs and read latency follows.
The read path
flowchart TD R[Read key] --> RC[Row cache?] RC -->|hit| Done[return] RC -->|miss| BF[Bloom filter\nper SSTable] BF -->|maybe| PI[Partition index\nseek] BF -->|no| Skip[skip SSTable] PI --> Merge[merge cells\nnewest wins] Merge --> Tomb[apply tombstones] Tomb --> Done
A read consults each SSTable’s bloom filter to rule out tables that definitely lack the key, seeks the rest via the partition index, then merges the surviving cells by timestamp (last write wins per cell). Crucially, that merge step also has to walk tombstones — delete markers — which is where the most common Cassandra outage is born.
The tradeoffs that bite
The query-first data model is the tradeoff that ambushes everyone from SQL. There are no joins, no ad-hoc WHERE on arbitrary columns, and aggregations across partitions are dangerous. You design one table per query pattern and denormalize aggressively — the same fact lives in three tables shaped for three different reads. When a new access pattern appears, you build a new table and backfill it. This feels wrong coming from a relational background, and it is correct here.
ALLOW FILTERING is the trap door that tempts people out of that discipline. It lets you query on a non-key column by scanning every partition. It works beautifully in dev against 1,000 rows and melts the cluster in prod against a billion. Treat its appearance in any query as a data-modeling bug, not a feature.
LWT — lightweight transactions, the IF NOT EXISTS and IF column = ... clauses — give you compare-and-set through a Paxos round. They are correct but cost four round trips and run an order of magnitude slower than a normal write. Reserve them for the genuine cases (claiming a unique username, a one-time idempotency key), never as a default mutation path.
| Decision | Looks like | Actually costs |
|---|---|---|
CL=ONE everywhere | free throughput | stale reads, lost-looking writes |
ALLOW FILTERING | ”just this one query” | full-cluster scan, latency cliff |
DELETE rows | reclaiming space | tombstones, read timeouts |
LWT for safety | a transaction | 4x round trips, Paxos contention |
| One big partition | simple key | wide-partition heap pressure |
Write performance
Writes are Cassandra’s home turf. Because every write is an append to the commit log plus an in-memory memtable insert — no read-modify-write, no disk seek, no lock on the row — a single node ingests writes far faster than a B-tree database doing in-place updates. This is why Cassandra is the default for firehose workloads: telemetry, clickstream, message history, audit logs.
The levers that matter:
- Compaction throughput. If compaction can’t keep pace with flush, SSTables pile up and reads degrade. Watch
nodetool compactionstats; raisecompaction_throughput_mb_per_secif the queue grows under steady load. - Commit log placement. On spinning disks, a dedicated commit-log device removed seek contention; on NVMe it matters less, but commit-log sync mode (
periodicvsbatch) still trades durability window against latency. - Memtable sizing. Larger memtables flush less often and produce fewer, bigger SSTables, easing compaction — bounded by heap.
The one thing that ruins write performance is LWT sprinkled into the hot path, because each one serializes a Paxos round through the replicas for that partition.
Read performance
Reads are where Cassandra asks for respect. A read may have to merge a partition across multiple SSTables and walk tombstones, so latency depends heavily on data model and compaction health. The levers:
- Compaction strategy matched to the access pattern (LCS for read-heavy, TWCS for time-series) keeps
SSTables-per-readlow. - Bloom filter false-positive rate (
bloom_filter_fp_chance) trades memory for fewer wasted seeks; tighten it on read-heavy tables. - Key and row caches help repeated hot reads but are easy to oversize and starve the heap.
- Partition size discipline — a read of a 2 GB partition is slow no matter what, because the coordinator and replica have to materialize and merge a huge amount of data.
A healthy read-heavy table on LCS touches one or two SSTables and returns in single-digit milliseconds. A neglected one on STCS with deep tombstone debt can take seconds or time out entirely.
Failure modes
The signature Cassandra outage is tombstone-driven read timeouts, and it catches every team that treats deletes as free.
A DELETE cannot physically remove data — SSTables are immutable, and the value being deleted may live in an un-compacted table on another replica that hasn’t heard about the delete yet. So a delete writes a tombstone: a marker that says “this cell is dead as of timestamp T.” The tombstone must survive until every replica has compacted past it — gc_grace_seconds, default 864000 (10 days) — or a deleted value could resurrect from a replica that never saw the delete. Until then, every read of that partition must scan through the tombstones to figure out what is still live.
flowchart TD
D[DELETE rows] --> TS[write tombstones]
TS --> Acc[partition fills\nwith markers]
Acc --> Rd[read scans\nthrough tombstones]
Rd --> Thr{over\nthreshold?}
Thr -->|warn 1000| Slow[slow read]
Thr -->|fail 100000| Abort[query aborted]
A queue-shaped table — insert a row, process it, delete it — accumulates thousands of tombstones in a single partition. A read then has to skip
100,000tombstones to find10live rows, blows pasttombstone_failure_threshold(default100000), and the query is aborted with aTombstoneOverwhelmingException. Your “fast NoSQL” now times out reading a partition that looks empty.
The fix is modeling, not tuning: don’t build queues or mutable sets on Cassandra (use a real broker — see Message Queues or Kafka); use TTLs with TimeWindowCompactionStrategy so whole SSTables expire and drop without per-row markers; and never delete-then-reinsert inside a hot partition.
The other recurring production failures:
- Hot partition. A partition key with skewed access — a celebrity
user_id, acountry = 'US'bucket — sends all traffic to the sameRFnodes while the rest of the ring idles. This is the same disease as a DynamoDB hot partition, and adding nodes does not cure it. - Wide partition. An unbounded key (
partition by sensor_id, append forever) grows to gigabytes. Reads and compaction on it stall, and one node’s JVM heap suffers GC pauses that ripple into tail latency. Keep partitions under roughly100 MB/100krows. - Coordinator overload. A query at
CL=ALL, or a large multi-partitionIN, makes one coordinator fan out and wait on many nodes; the slowest replica sets your latency, and the coordinator’s heap absorbs the merge. - Repair neglect. Skip
nodetool repairinsidegc_grace_secondsand either deleted data resurrects or replicas silently diverge — exactly the latent corruption the opening story’s months ofCL=ONEhad been building.
Scaling it
Scaling Cassandra is genuinely close to linear when the data model cooperates. Double the nodes and you roughly double throughput, because ownership is just more token ranges on the ring. Adding a node streams only its share of ranges from neighboring nodes — no resharding event, no downtime, no global lock. This is the property that earns the “scales horizontally” reputation, and it is real.
The wall you hit is never “the cluster” — it is a single partition. A hot or wide partition cannot be relieved by adding nodes, because every copy of that one partition lives on the same RF replicas. The cure is always in the key. You add a bucketing component to spread the load:
-- wide and dangerous: one partition per sensor, forever
PRIMARY KEY (sensor_id, ts)
-- bounded: one partition per sensor per day
PRIMARY KEY ((sensor_id, day), ts)
-- celebrity-safe: salt the partition across N buckets
PRIMARY KEY ((user_id, bucket), ts) -- bucket in 0..N, scatter-gather on read
This is the same partitioning discipline every distributed store demands, and it has to be designed in from the start — re-bucketing a live table means a full backfill.
Multi-datacenter is a first-class feature, not a bolt-on. You set RF per datacenter (for example {'dc-east': 3, 'dc-west': 3}) and use LOCAL_QUORUM so reads and writes satisfy quorum within one region while replication crosses regions asynchronously. That gives you active-active geo-distribution without paying a cross-region coordination round trip on the hot path — at the cost of cross-DC eventual consistency, which you manage through the same R + W > N reasoning, scoped per DC.
When to reach for it (and when not to)
Reach for Cassandra when you have a write-heavy, high-volume workload with known query patterns: time-series, event logging, IoT telemetry, message and notification history, activity feeds. It shines when you need linear write scaling, multi-region active-active, no single point of failure, and you can live with eventual consistency tuned per query through CL.
It beats DynamoDB when you want to avoid vendor lock-in, run across your own hardware or multiple clouds, need fine-grained per-query consistency control, or your traffic is steady enough that DynamoDB’s pricing model stings. It loses to DynamoDB when you want zero operational burden — DynamoDB is fully managed, while running Cassandra means owning compaction tuning, repair scheduling, JVM/GC tuning, and capacity planning yourself.
It loses to PostgreSQL the moment you need joins, ad-hoc queries, multi-row transactions, or strong consistency by default — and for anything under a few terabytes that fits one well-tuned relational box with read replicas. The honest truth: most teams reaching for Cassandra at small scale would be better served by Postgres. Don’t use Cassandra for workloads needing multi-row ACID transactions, for frequently-deleted or heavily-mutated data (tombstones), or when the query patterns are still unknown — because in Cassandra the query pattern is the schema.
When to consider alternatives
- Fully-managed wide-column with the same model, no ops → DynamoDB.
- Relational data, joins, transactions, strong consistency by default → PostgreSQL.
- Durable, ordered event log / streaming backbone → Kafka.
- A real work queue with acks and retries → Message Queues or Celery.
- Search, relevance, and ad-hoc filtering over text → Elasticsearch.
- Sub-millisecond ephemeral state, counters, rate limits → Redis.
- The coordination/consensus that Cassandra deliberately avoids → ZooKeeper or a Raft-backed store.
Operational checklist
- Default the driver to
LOCAL_QUORUMreads and writes withRF=3per DC soR + W > Nholds; never shipCL=ONEon data you can’t afford to read stale. - Model one table per query; forbid
ALLOW FILTERINGin code review — its presence is a data-model bug, not a query. - Bound every partition with a bucketed key; keep partitions under roughly
100 MB/100krows and alert on partition size. - Never build queues or hot mutable sets; prefer TTLs plus
TimeWindowCompactionStrategyover explicitDELETEs. - Run
nodetool repairon every node insidegc_grace_seconds(default 10 days) to prevent data resurrection and replica drift. - Pick compaction per workload: STCS (write-heavy), LCS (read/update-heavy), TWCS (time-series with TTL).
- Alert on per-read tombstone counts, pending compactions, and
SSTables-per-read— not just node CPU and disk. - Reserve
LWTfor genuine compare-and-set needs; it is a Paxos round trip, not a default. - Watch heap and GC pauses; wide partitions and oversized caches are the usual cause of long pauses that spike tail latency.
Summary
Cassandra is a masterless, write-optimized, distributed store that scales linearly when — and only when — you obey three rules it does not enforce for you. Model the table around the query, because there are no joins and ALLOW FILTERING is a scan in disguise. Set consistency so R + W > N actually holds, because CL=ONE everywhere is silent data loss waiting for a rolling restart. And treat deletes as expensive, because tombstones turn empty-looking partitions into read timeouts. Add repair on a schedule and bounded partitions, and Cassandra delivers the firehose write throughput and zero-downtime growth it promises. Skip them and you get a slow, inconsistent, tombstone-choked cluster — and a 3am page that was never a bug, only a misunderstanding.
Appendix: wide-column vs relational, in one breath
A relational row is a fixed set of typed columns; you query by any column and the engine plans the access path. A wide-column row is a partition key plus an ordered set of clustering columns, and you can only query along that key’s natural order. The “wide” part means different rows in a partition can hold different columns, and a partition can hold millions of cells — a structure shaped for “give me this key’s data in this order, fast,” not “find me all rows where X.” RF is how many copies exist; CL is how many you wait for; the gap between them is where availability and staleness trade off. Everything else in this article is a consequence of those few facts.
Further reading
Incidents & deep-dives
Where this system breaks in production — and how it comes back.
Documenting next
- 🔒 Tombstone Hell: Read Timeouts From Deletesroadmap →