Practical rebuilds of these systems — real failovers & chaos drills — are in production onYouTube, soon.

Message Queues & Event Streaming

Queue vs log, the exactly-once myth, ordering, backpressure, consumer lag, dead-letter queues, and idempotency — how async messaging really behaves in production, and when to use which.

23 min readupdated 2026-06-28
On this page

The first async messaging diagram everyone draws is a box labeled “queue” with an arrow going in and an arrow coming out. That picture is a lie of omission. It hides every decision that actually decides whether your system works: whether a message survives a broker crash, whether it arrives once or five times, whether two events for the same order stay in order, and what happens the day your consumers can’t keep up with your producers. A queue doesn’t remove the hard parts of distributed work. It moves them out of synchronous error handling, where the compiler and the call stack help you, into delivery semantics and consumer design, where nothing helps you and the bugs are quiet for months.

I have watched more than one team add a queue to “decouple” two services, ship it, and declare victory — only to discover at the worst possible moment that “decoupled” also meant “the producer has no idea the consumer has been silently failing for six hours.” The message bus didn’t make the system more reliable. It made the unreliability invisible, which is strictly worse, because now you find out from a customer instead of from a stack trace.

This is the long-form context article for the whole category. It covers the one split that governs everything — a queue versus a log — the truth about “exactly-once” (there isn’t one, end to end), why ordering is more fragile than people think, and the failure modes that every event-driven system eventually meets: poison messages, unbounded consumer lag, duplicate side effects, and rebalance storms. It assumes you’ve operated something async in production and been burned by it at least once.

It connects to Apache Kafka for the log model in depth, Celery for task queues on top of a broker, Redis as a lightweight broker and the canonical “ack is not durability” story, API Design & Idempotency for the consumer-side correctness work, and Observability for the lag metrics that catch this class of outage before your customers do.

A motivating failure

A food-delivery company runs its order pipeline through Kafka. A web service publishes an order.placed event; a fleet of consumers in one consumer group reads the topic, calls the restaurant’s POS, assigns a courier, and sends the customer a confirmation. Topic retention is set to 24h — generous, they figured, since orders are processed in seconds.

On a busy Friday they deploy a new consumer build. The new code does a synchronous call to a slow third-party address-validation API on every message — a change nobody flagged because it looked harmless in review. Per-message processing time jumps from 40ms to 1.2 seconds. That alone wouldn’t be fatal. But the slow processing means consumers stop sending heartbeats inside the session.timeout.ms window, so the broker decides they’re dead and triggers a rebalance. The rebalance pauses the whole group, reassigns partitions, and the freshly-assigned consumers start slow again, miss the next heartbeat, and get kicked again. The group enters a rebalance loop: more time spent reassigning partitions than processing orders.

Meanwhile orders keep arriving. Consumer lag — the gap between the latest offset and the committed offset — climbs from a few hundred to four million. Nobody is watching lag; the dashboard everyone is watching shows the producer happily accepting orders and the broker at 30% CPU. Green across the board.

Twenty-six hours later, the oldest unprocessed orders cross the 24h retention boundary and Kafka deletes them. Not failed. Not dead-lettered. Deleted, unread. Thousands of paid orders that no courier was ever assigned to, gone with no error anywhere in the system. The first signal was a wave of “where is my food” support tickets.

Nothing here was a bug in Kafka. The producer’s “OK” never meant the order would be fulfilled — only that it had been durably appended. Everything that mattered happened downstream, in the consumer, where the team had no idea what to measure.

The one-sentence mental model

A message queue decouples a producer from a consumer in time, so the producer keeps working when the consumer is slow, down, or scaling — and the price of that decoupling is that every delivery becomes an at-least-once-or-at-most-once gamble, every backlog becomes invisible unless you measure it, and the producer’s success no longer means the work got done.

Unpack each clause, because each is an operational constraint you will eventually meet:

  • Decouples in time → the producer and consumer no longer fail together. That’s the whole benefit. It’s also why a dead consumer doesn’t surface as a failed request — it surfaces as a growing backlog nobody is looking at.
  • At-least-once-or-at-most-once → there is no free “exactly-once” across a network boundary. You pick which way the failure leans, and you design the consumer to absorb it.
  • Backlog becomes invisible → synchronous overload announces itself with timeouts and 503s. Async overload is silent; the producer feels fine while messages pile up toward a cliff (retention loss or broker rejection).
  • Producer success ≠ work done → a publish ack means “durably accepted,” not “processed.” The gap between those two facts is where every interesting incident lives.
flowchart LR
  P[Producer] -->|publish| B[(Broker)]
  B -->|deliver| C1[Consumer 1]
  B -->|deliver| C2[Consumer 2]
  C1 -->|ack| B
  C2 -->|nack| B
  B -. redeliver\nunacked .-> C1
  B -->|retries\nexhausted| DLQ[(Dead-letter\nqueue)]

The producer’s job ends at “durably accepted.” Everything after that arrow — delivery, retry, ordering, dead-lettering, dedup — is the broker’s and consumer’s problem. That’s the part the simple box-and-arrow diagram leaves out, and it’s where production systems live or die.

How it actually works

Queue vs log: the split that decides everything

There are two fundamentally different systems both casually called “message queues,” and conflating them is the root of most architecture mistakes I’ve cleaned up.

A queue — RabbitMQ, AWS SQS, Celery’s broker — treats a message as a unit of work to be consumed and destroyed. The broker tracks per-message state: delivered, in-flight, acknowledged, redelivered. Once a consumer acks a message, the broker forgets it. Multiple consumers compete for messages off the same queue (the competing-consumers pattern), which spreads load automatically but means no single consumer ever sees the whole stream. The queue is a to-do list that workers tear pages out of.

A log — Apache Kafka, AWS Kinesis, Redis Streams — treats messages as an immutable, ordered, append-only sequence that consumers read by advancing an offset. The broker does not delete on consume; it retains messages for a time or size window (retention.ms, retention.bytes) regardless of who has read them. Consumers are independent: each tracks its own offset, multiple consumer groups can read the same partition without affecting each other, and any consumer can rewind its offset to replay history. The log is a ledger that readers bookmark.

flowchart TB
  subgraph Queue
    QB[(Broker)] -->|removed\non ack| QC1[Worker A]
    QB -->|competing| QC2[Worker B]
  end
  subgraph Log
    LB[(Partition\nappend-only)] -->|offset 105| GA[Group A]
    LB -->|offset 12| GB[Group B]
  end
Queue (RabbitMQ / SQS)Log (Kafka / Kinesis)
Message after consumeDeleted on ackRetained for the window
Consumer modelCompeting consumersIndependent offset per group
Replay historyNo — it’s goneYes — rewind the offset
OrderingPer-queue, fragile under redeliveryStrict per-partition
Fan-out to N readersHard (mirror/duplicate)Native (N consumer groups)
Natural fitTask distribution, work queuesEvent streaming, audit, replay
Throughput ceilingHigh per queueVery high, scales with partitions

The decision rule is blunt. If you need to fan the same events out to several independent systems, or replay them after fixing a bug, or keep an audit trail, you want a log. If you need to hand discrete tasks to a pool of workers and never look at them again, you want a queue. Choosing a queue when you needed a log means rebuilding replay and fan-out by hand, badly. Choosing a log when you needed a simple work queue means operating Kafka — partitions, consumer groups, offset management — for a job that SQS would have done with zero servers. Both mistakes are common and both are expensive.

The write path and what “OK” actually means

sequenceDiagram
  participant Prod as Producer
  participant Lead as Partition leader
  participant Repl as Replicas
  Prod->>Lead: send(record)
  Lead->>Lead: append to log
  Lead->>Repl: replicate
  Repl-->>Lead: in-sync ack
  Lead-->>Prod: ack (acks=all)
  Note over Prod,Repl: durable once min ISR\nconfirms the write

When a producer sends a record, the broker appends it to the partition leader’s log and, depending on the acknowledgment setting, may wait for replicas to confirm before acking the producer. That acks setting is the durability dial, and it’s the same acknowledgment-versus-durability gap that bites Redis:

  • acks=0 — fire and forget. The producer never waits. Fastest, and you lose in-flight records on any leader hiccup. This is at-most-once at the producer edge.
  • acks=1 — the leader acks after its own append, before replicas confirm. Fast, but if the leader dies before replication, the record is gone even though the producer got an “OK.”
  • acks=all (with min.insync.replicas=2) — the leader waits until the in-sync replica set has the record. Durable across a single broker loss, slower by the replication round-trip.

The point is identical to the database lesson: the ack and the durability are separate events. A producer that got an “OK” at acks=1 and a producer that got an “OK” at acks=all have very different guarantees, and the difference only shows up when a broker fails at exactly the wrong moment.

Delivery semantics: the “exactly-once” myth

Every broker advertises a delivery guarantee. There are three, and only two are honest end-to-end:

  • At-most-once. Deliver and don’t track acks. Fast and lossy. A crash between delivery and processing drops the message with no trace. Fine for metrics, sampled telemetry, and anything you’d happily lose; never for money or orders.
  • At-least-once. Deliver, wait for an ack, redeliver if the ack doesn’t arrive in time. This is the real-world default, and what you should assume you have. Nothing is lost, but a consumer that does the work and then crashes before acking (or committing its offset) will see the same message again on restart. Duplicates are not a possibility — they are a guarantee, given enough time.
  • Exactly-once. What everyone asks for and almost nothing delivers across system boundaries. The network can always drop an ack, which forces a redelivery; the broker cannot tell “the consumer never got it” from “the consumer got it and the ack was lost.” Kafka offers exactly-once within its own boundary — an idempotent producer plus transactions that atomically commit topic writes and consumer offsets together. The instant your consumer writes to an external database or calls a third-party API, that transactional boundary ends and you’re back to at-least-once for the part that matters.

The honest engineering posture: assume at-least-once delivery, and make the consumer idempotent. Don’t chase exactly-once delivery; engineer exactly-once effect by deduplicating on the consumer side. A natural business key (order_id, payment_intent_id) plus an INSERT ... ON CONFLICT DO NOTHING or a dedup table turns “this might arrive twice” into “the second arrival is a no-op.” That’s the whole game, and it’s covered in depth in API Design & Idempotency.

Ordering: weaker than you think

Ordering is guaranteed only within a single partition (log) or a single queue drained by a single consumer (queue). The moment you add parallelism — multiple partitions, multiple competing consumers, or any redelivery — global ordering is gone, and there is no setting that brings it back without also killing your throughput.

The standard, correct move is to partition by an ordering key. Route every event for one entity (user_id, order_id, account_id) to the same partition via a hash of the key, so all events for that entity stay ordered relative to each other, while different entities process fully in parallel. You give up global ordering — which you almost never actually need — and keep per-key ordering, which is usually what the business requires (“apply this user’s balance changes in order,” not “apply every user’s changes in one global order”). The hashing here is the same idea as consistent hashing: a key deterministically maps to a partition.

The trap is choosing the key carelessly. Key everything to a single value and you’ve funneled all traffic onto one partition — one consumer, no parallelism, and a hot partition that no amount of scaling fixes.

The tradeoffs that bite

These are the decisions that look free on the whiteboard and bill you in production.

TradeoffThe free-looking choiceWhat it actually costs
Throughput vs ordering”Just add partitions”Global order is gone; only per-key order survives
Durability vs latencyAck on receipt (acks=1)In-flight messages lost on a broker/leader crash
At-least-once vs dup workTrusting deliveryGuaranteed duplicates → double-charges without dedup
Decoupling vs visibility”It’s async, it’s resilient”A dead consumer is invisible until the backlog cliff
Exactly-once vs realityEnabling the EOS flagHolds only inside the broker; external effects still dup
Replay power vs costChoosing a log everywhereOperating Kafka for jobs SQS would do serverless

Two of these deserve emphasis. Decoupling vs visibility is the silent killer from the opening story: synchronous overload screams (timeouts, 503s, angry dashboards), but async overload whispers. The producer side stays green while the backlog grows toward a cliff. The only defense is to measure the consumer’s health — lag and its trend — not the producer’s.

Exactly-once vs reality is the one architects most often get sold on. Kafka’s exactly-once semantics are real and useful for stream-processing topologies that read from Kafka and write back to Kafka. They do nothing for the consumer that charges a credit card. Treating the EOS flag as permission to skip idempotency is how you get a “this can’t happen” double-charge incident.

Throughput and lag: the numbers that matter

Async systems don’t have a single “performance” number; they have a balance. Producers produce at some rate, consumers consume at some rate, and the only sustainable steady state is consume-rate ≥ produce-rate. Everything interesting is about what happens when that inequality flips, and how fast you notice.

What’s fast. Appending to a log is close to sequential disk write speed, which is why a single Kafka broker handles hundreds of thousands of messages per second per partition for small records, and why batching (linger.ms, batch.size) and compression (compression.type=lz4) multiply that. SQS gives you effectively unbounded throughput with zero servers to run, at the cost of higher per-message latency (tens of milliseconds) and a 256 KB message-size limit. RabbitMQ pushes tens of thousands of messages per second per queue comfortably, more with multiple queues.

What’s slow, and what to watch. The bottleneck is almost never the broker — it’s the consumer. The single most important metric in any async system is consumer lag: how far behind the head the consumer’s committed offset is, measured in messages and in time. And the most important thing about lag is its trend, not its absolute value:

  • Lag flat and small → healthy, consumers keeping pace.
  • Lag flat but large → consumers keeping pace but permanently behind; you’re one traffic bump from falling further behind, and your end-to-end latency equals that backlog.
  • Lag growing → you are heading for the cliff. In a log, growing lag means messages will eventually age past retention and be deleted unread (the opening outage). In a queue, growing depth means you’ll eventually hit the broker’s limit and start rejecting publishes, pushing the failure back onto producers.
flowchart TD
  H[Head offset\nclimbing] --> G{Consume rate\n>= produce rate?}
  G -->|yes| OK[Lag stable\nhealthy]
  G -->|no| L[Lag grows]
  L --> R{Log or\nqueue?}
  R -->|log| AGE[Messages age out\nsilent data loss]
  R -->|queue| REJ[Depth limit hit\npublishes rejected]
  style AGE fill:#e11d48,color:#fff
  style REJ fill:#e11d48,color:#fff

Practical levers when consumers can’t keep up, in rough order of reach: add consumers (up to the partition count in a log — beyond that they sit idle); make per-message work cheaper (batch DB writes, drop synchronous third-party calls off the hot path — exactly the change that caused the opening incident); increase partition count ahead of need (raising it later rehomes keys and breaks in-flight per-key ordering); and as a last resort, shed or sample load deliberately rather than letting the backlog choose what to drop for you.

Alert on lag time (records-lag converted to “how many minutes behind”), set the threshold well under your retention window, and page on the derivative — sustained positive lag growth — because that predicts the outage while you can still add capacity. This is squarely an observability problem before it’s a capacity problem.

Failure modes

The recurring async outages, each as symptom → root cause → prevention.

Poison messages. Symptom: a consumer crashes or rejects the same message over and over; in a strict-ordered partition, everything behind it stalls; in a queue, it burns redeliveries forever. Root cause: one malformed or unprocessable message under at-least-once delivery, which keeps getting redelivered. Prevention: a dead-letter queue (DLQ). After N failed attempts (maxReceiveCount in SQS, a DLX in RabbitMQ, a sidelined topic in Kafka), the broker moves the message aside so the rest of the stream flows. Then alert on DLQ depth > 0 and triage out of band. A DLQ you don’t alert on is just a slower way to lose data.

Unbounded consumer lag. Symptom: producers look healthy, broker CPU is low, then messages silently vanish (log) or publishes start failing (queue). Root cause: consumers fell behind and nobody measured it — the invisible-backlog problem. Prevention: alert on lag trend, not depth; size retention with comfortable margin over your worst realistic recovery time; load-test the consumer path, not just the producer. (This is the same shape as the Celery broker-backpressure failure.)

Duplicate processing / double side effects. Symptom: two charges, two emails, two shipping labels for one event. Root cause: at-least-once redelivery plus a non-idempotent consumer — it did the work, crashed before committing the offset, restarted, and did the work again. Prevention: consumer-side idempotency keyed on a business identifier; commit offsets only after the side effect is durably recorded; treat the dedup check as part of the work, not an afterthought.

Rebalance churn. Symptom: a consumer group spends more time reassigning partitions than processing; throughput craters; lag climbs (the opening story). Root cause: processing slower than max.poll.interval.ms or heartbeats missing session.timeout.ms, so the broker repeatedly evicts and re-adds consumers. Prevention: keep per-message work fast or reduce max.poll.records; tune session/heartbeat timeouts to match real processing time; move slow synchronous calls off the consumer hot path. Use cooperative/incremental rebalancing where the client supports it so a rebalance doesn’t stop the whole group at once.

Head-of-line blocking. Symptom: later messages that are perfectly fine wait behind one slow or stuck message on the same partition. Root cause: strict per-partition ordering means the partition is a single-file line. Prevention: if strict order isn’t required for those messages, spread them across more partitions or process out of order with per-key sequencing; if order is required, accept that one slow message holds its lane and keep that lane’s work fast.

At-least-once delivery is not an edge case you might hit in production. It is the contract you signed the moment you put a broker in the path. If your consumer is not idempotent, you do not have a latent reliability risk that careful testing will surface — you have a guaranteed incident scheduled for the first lost ack. Design every consumer as if each message will arrive at least twice, and make the second arrival a no-op. Everything else about async messaging is tuning; this one is correctness.

Scaling it

What changes as volume grows by 10× and 100×.

Partitions are your concurrency unit (in a log). A consumer group can have at most one active consumer per partition — extra consumers sit idle. So partition count is your parallelism ceiling, and you must size it ahead of demand. Increasing partitions later is not free: it changes the hash-to-partition mapping for keys, so events for a given user_id that used to land on partition 3 may now land on partition 7, breaking per-key ordering for keys that are in flight across the change. Pick a partition count with years of headroom (over-partitioning is cheap; re-partitioning a live system is not).

Scale consumers horizontally, but know the wall. Add consumers up to the partition count (log) or freely (queue, competing consumers). The wall you hit is rarely “not enough consumers” — it’s a hot partition (all traffic keyed to one value) or a consumer doing slow synchronous work per message. Neither is solved by adding consumers. Fix the key distribution or make the per-message work cheaper and async. This is the same hot-key wall that limits Redis Cluster: sharding distributes keys, not load within a key.

Apply backpressure deliberately. An unbounded queue doesn’t prevent failure; it defers it to a worse moment with a bigger backlog. Bound the queue and decide explicitly what happens when it’s full: block the producer (propagate backpressure upstream), shed load (drop or sample), or route overflow to a DLQ. Choosing nothing means the system chooses for you — usually by aging out your oldest, most-aggrieved customers’ messages.

Replication and durability scale with the broker. Kafka replicates partitions across brokers with a configurable replication factor and in-sync-replica minimum; this is the same durability-versus-latency tradeoff as database replication. At 100× you care about broker placement across availability zones, min.insync.replicas to survive a zone loss, and the rebalance time when a broker dies. SQS and other managed brokers handle this for you, which is a large part of why “just use SQS” is the right answer more often than Kafka enthusiasts admit.

When to reach for it (and when not to)

Reach for a queue (RabbitMQ, SQS, Celery’s broker) when you’re distributing discrete tasks to a worker pool, smoothing bursty load so a spike doesn’t topple a slow downstream, or decoupling a slow operation (sending email, resizing images, generating PDFs) from a user request so the request returns fast. The mental test: would I be happy if a pool of workers each grabbed one task and I never thought about the others?

Reach for a log (Kafka, Kinesis, Redis Streams) when multiple independent systems need the same events, when you need replay or audit, or when you’re building an event-driven backbone that many services subscribe to. The test: do I need to read this stream more than once, by more than one reader, possibly from the past?

Don’t introduce async messaging at all when a synchronous call is simpler and the caller genuinely needs the result now. A queue in front of a request/response interaction adds a broker to operate, eventual consistency to reason about, and duplicate-handling to build — for no benefit if the caller just blocks waiting for the reply anyway. Plenty of systems are made worse by a queue that exists because “events are modern.” If you need the answer to continue, make the call, handle the error inline, and move on.

Don’t use a log when you needed a queue — you’ll run and tune Kafka (partitions, consumer groups, offset management, rebalancing) for a job a serverless queue does with no servers. Don’t use a queue when you needed a log — you’ll reimplement replay and fan-out on top of consume-and-delete semantics, and it will be worse than the thing you avoided.

When to consider alternatives

  • Lightweight broker, modest scale, already in your stackRedis lists or Streams. Great until you need real durability or replay at volume.
  • Durable, high-throughput event log with replay and fan-outApache Kafka (or Kinesis if you want it managed).
  • Task distribution to workers with retries and schedulingCelery on top of a queue broker.
  • Zero-ops queue at any scale → AWS SQS (with SNS for fan-out). The boring, correct default for most teams.
  • The thing you’re really queueing is durable state → it belongs in PostgreSQL or DynamoDB; the queue carries the event, not the source of truth.
  • Strong distributed coordination, leader election, locks that must be correctZooKeeper, not a queue’s redelivery semantics.

The pattern: a message bus is the transport and buffering layer. The moment your requirement becomes “and this must be the durable record” or “this must be processed exactly once with external side effects,” the correctness work moves to the consumer (idempotency) and the source of truth (a real database), with the queue carrying events between them.

Operational checklist

  • Choose queue vs log explicitly from replay and fan-out needs, and write the reason in the design doc — this is the decision you’ll regret silently.
  • Assume at-least-once; make every consumer idempotent with a business dedup key, and commit offsets only after the side effect is durable (see API Design & Idempotency).
  • Configure a dead-letter queue with a sane maxReceiveCount, and alert on DLQ depth > 0 — a silent DLQ is just slow data loss.
  • Alert on consumer lag trend (time-behind, and its derivative), not just current depth; set the threshold well under your retention window.
  • Set producer durability deliberately (acks=all + min.insync.replicas=2 for anything you can’t lose) and document the latency cost.
  • Partition or key by your ordering entity (user_id, order_id) so per-key order holds while you parallelize; never funnel everything to one key.
  • Size retention with comfortable margin over your worst realistic consumer recovery time, so a long backlog doesn’t delete unread data.
  • Keep per-message work fast; tune max.poll.records, session.timeout.ms, and max.poll.interval.ms so slow processing doesn’t trigger rebalance storms.
  • Bound queues and define the full-queue behavior (block, shed, or DLQ) before launch — never run unbounded.
  • Load-test the consumer path under sustained overload, not just the producer; the producer is never the part that fails quietly.

Summary

A message queue is a time machine for work: the producer hands off and moves on, the consumer catches up whenever it can. That decoupling is genuinely powerful, and it’s also where the complexity goes to hide. Almost every async incident traces to one of four facts: delivery is at-least-once (so duplicates are guaranteed and consumers must be idempotent), backlogs are invisible (so you must measure consumer lag and its trend, not the producer’s happy dashboard), ordering is only per-partition (so you partition by an ordering key and give up global order), and “exactly-once” stops at the broker’s edge (so external side effects still need dedup). Pick queue or log on purpose, make consumers idempotent, alert on lag before it hits the retention cliff, and bound everything. Do that and async messaging is the resilient backbone it promises to be. Forget the at-least-once contract and it’s a double-charge — or a wave of “where is my order” tickets — waiting for the first lost ack.

Further reading

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

No incident deep-dives yet. See the roadmap for what's coming.