Practical rebuilds of these systems — real failovers & chaos drills — are in production onYouTube, soon.

Rate Limiting

A production tour of rate limiting: the four algorithms, atomic counting in Redis, fail-open vs fail-closed, noisy neighbors, retry storms, clock skew, and where to enforce limits.

26 min readupdated 2026-06-28
On this page

Rate limiting gets sold as an anti-abuse feature: keep the bots and the scrapers out. That framing is the reason so many teams build one badly. A rate limiter is not a bouncer with a grudge — it is a capacity-allocation decision made in advance. When demand exceeds what you can serve, something gets rejected. A rate limiter is how you choose what gets rejected, deliberately and fairly, instead of letting an overloaded database make the choice for you by toppling over and taking everything with it.

The choice is always there, whether you make it or not. Without a limiter, your policy is “whoever hammers hardest wins, and everyone shares in the collapse.” That is a policy. It is just a bad one, chosen by omission. With a limiter you get to say, up front and in writing, who is allowed how much of a finite resource and what happens when they ask for more.

This is the long-form context article on rate limiting: enough to reason about it under real load, not just to pass an interview. It leans heavily on Redis for the shared-counter mechanics, sits next to API design & idempotency and the API gateway where most product limits actually live, and shares its retry-storm dynamics with load balancing. The consistency questions it raises — what happens when nodes disagree about the count — connect to Consistency & Consensus. The fairness and overload incidents have dedicated write-ups on the roadmap.

The single most common mistake is treating the rate limiter as a thing you bolt on at the edge and forget. It is on the critical path of every request it guards. That means its own latency, its own failure modes, and its own scaling wall become your latency, failure, and wall. A rate limiter you don’t understand is just a second outage waiting for the first one.

A motivating failure

A B2B SaaS company sells an API with a published quota: 1,000 requests per minute per API key. The limiter is a fixed-window counter in Redis — one INCR per request, a 60-second TTL on the key, reject when the count crosses 1,000. It is simple, it is fast, and it works in every test they write. It ships.

Six weeks later their biggest customer runs a nightly export job. The job is naive: it fires requests as fast as the socket allows, gets a 429, and retries immediately with no backoff. At 00:00:59.7 the customer’s count hits 1,000 and starts getting rejected. The window resets at 00:01:00.0. In the 300 milliseconds around that boundary, the client — now retrying in a hot loop — lands roughly 1,000 requests in the tail of one window and another 1,000 in the head of the next. Two thousand requests in well under a second, against a limit that was supposed to cap them at 1,000 per minute.

That burst alone wouldn’t have mattered. What mattered is what it hit: a report endpoint that runs an unindexed aggregation. Two thousand of those in a second saturated the database connection pool, every other tenant’s queries started queuing behind them, and p99 across the whole API went from 40ms to 9 seconds. The limiter was up the entire time. It was returning 429s. It just wasn’t returning them at the rate the team thought, because a fixed window plus a retrying client is a burst amplifier, not a brake.

The fix was three lines — switch to a sliding-window counter and make the client respect Retry-After. The lesson was bigger: the limiter had been protecting the wrong number (requests) at the wrong granularity (fixed window) with no defense against the most predictable client behavior there is (retry-on-reject). Every one of those was a design decision nobody knew they’d made.

The one-sentence mental model

A rate limiter is a counter checked against a quota over a time window, and every real-world design choice collapses into two questions: how do you count (the algorithm) and where does the count live (local memory versus shared store) — because a count that lags, races, or disagrees across nodes is a limit that does not actually hold.

Take the sentence apart, because each clause is an operational constraint you will meet in production:

  • A counter checked against a quota — you need state, and that state must be accurate under concurrency. The moment two requests can read the counter at the same time, a naive implementation leaks past the limit.
  • Over a time window — how you slice time (fixed buckets versus a rolling window) decides whether a client can double their effective rate by straddling a boundary. The opening story is this clause collecting its debt.
  • Where the count lives — one node or many. With one server, the count is a variable in memory and life is easy. With a fleet, the count has to live somewhere shared, or the real limit is N × your intended limit, where N is your server count.
  • A count that lags or disagrees — the central trade. Checking shared state on every request is correct but adds a network hop to your hot path. Caching it locally is fast but lets nodes drift out of agreement.
flowchart LR
  C[Client] --> GW[Gateway\nlimiter check]
  GW -->|under quota| S[Service]
  GW -->|over quota| R[429\nRetry-After]
  GW <-->|atomic\nINCR / EVAL| Store[(Redis\ncounter)]
  S --> DB[(Backend)]

The diagram looks trivial, and that is the trap. The interesting question is not “where does the arrow go” — it is what happens to that Redis box when it is slow, when it is down, when one key on it gets a million requests a second, and when two gateway nodes both think they own the truth. Hold that picture; the rest of this article is what lives inside it.

How rate limiting actually works

The algorithm is just a way of answering one question — “is this request within quota right now” — and the four common ones differ entirely in how they treat bursts and how much state they cost. Get the algorithm wrong and no amount of infrastructure saves you.

Token bucket: the workhorse

A bucket holds up to B tokens and refills at R tokens per second. Each request removes one token; if the bucket is empty, the request is rejected. The long-run average rate is capped at R, but a client who has been quiet can spend up to B tokens in one burst. That controlled-burst behavior is exactly what most APIs want — humans and healthy clients are bursty, and punishing a brief spike that the average can absorb just annoys good users.

The state is tiny: a token count and a last-refill timestamp per key. You don’t run a background timer to add tokens; you compute them lazily on each request from elapsed time:

now        = current_time
elapsed    = now - last_refill
tokens     = min(B, tokens + elapsed * R)
last_refill = now
if tokens >= 1:
    tokens -= 1   -> ALLOW
else:
    -> REJECT, retry_after = (1 - tokens) / R

AWS API Gateway, Stripe, and most cloud APIs ship a token-bucket variant because it is cheap, intuitive to expose (burst and rate map cleanly to a plan tier), and forgiving of normal traffic shapes.

Leaky bucket: smoothing, not bursting

Leaky bucket inverts the goal. Requests enter a queue and drain at a fixed rate R; if the queue is full, new requests overflow and are rejected. The output is a perfectly smooth stream at R, regardless of how spiky the input was. That is what you want when the thing downstream cannot tolerate spikes at all — a payment processor with its own hard limit, a legacy mainframe, a third-party API that bills per burst.

The difference from token bucket is the whole point: token bucket lets a burst pass through (up to B), leaky bucket flattens it into a steady drip. Token bucket protects you while being nice to bursty clients; leaky bucket protects whatever is behind you that hates surprises.

Fixed window: simple and subtly broken

Count requests per fixed clock interval — 1,000 per minute, counter resets on the minute boundary. One INCR with a TTL and you’re done. It is the first thing everyone builds and the thing that took down the company in the opening story.

The flaw is the boundary burst. The window has no memory of what happened just before it opened. A client can spend their full quota in the last instant of one window and their full quota again in the first instant of the next — up to the intended rate across the boundary, concentrated into a tiny slice of real time. For a generous limit on a cheap endpoint, who cares. For a tight limit on an expensive one, that is the difference between fine and on fire.

Sliding window: paying to fix the boundary

Two variants. Sliding-window log stores a timestamp for every request and counts how many fall within the trailing window (e.g. the last 60 seconds) on each new request, dropping older ones. It is exact — no boundary burst, ever — but its memory grows with request volume, which is unacceptable for a high-traffic key.

Sliding-window counter is the production compromise. It keeps the current and previous fixed-window counts and weights them by how much the rolling window overlaps each:

weight     = (window_size - elapsed_in_current) / window_size
estimate   = current_count + previous_count * weight
if estimate < limit -> ALLOW else REJECT

It uses O(1) state (two counters), has no hard boundary cliff, and is slightly approximate at the margins — an error almost nobody cares about. This is what most real limiters ship, and switching the opening story from fixed window to this would have capped that customer at roughly their real limit through the boundary.

sequenceDiagram
  participant Client
  participant Limiter
  participant Redis
  Client->>Limiter: request key=user:42
  Limiter->>Redis: EVAL token-bucket Lua
  Redis-->>Limiter: allowed, remaining, retry_after
  alt under quota
    Limiter-->>Client: 200 X-RateLimit-Remaining 11
  else over quota
    Limiter-->>Client: 429 Retry-After 30
  end
  Note over Limiter,Redis: one round trip, atomic check and decrement
AlgorithmAllows bursts?State per keyBoundary burst riskBest for
Token bucketyes, up to Bcount + timestampnonegeneral API quotas
Leaky bucketno, smoothedqueue depthnonefragile downstream
Fixed windowyes, accidentalone counteryes, up to 2×cheap, loose limits
Sliding-window logmildone ts per requestnoneexact, low volume
Sliding-window countermildtwo countersminimalmost production limits

The check has to be atomic

Here is the bug that survives every code review by people who haven’t been burned yet. The naive distributed limiter is count = GET key; if count < limit: INCR key. That is a read-modify-write race. Two requests both GET and see 999, both decide they’re under the limit of 1,000, both INCR, and now the counter reads 1,001 with two requests admitted that shouldn’t both have been. Under real contention on a hot key, that leak is not a rounding error — it can blow well past the limit.

The fix is to make check-and-decrement a single atomic operation. In Redis, that is either INCR followed by checking the return value (the increment and the read are one atomic step, and you set the TTL with EXPIRE or SET ... EX on first creation), or — for anything more complex than a fixed window — a Lua script via EVAL, which runs to completion with nothing else interleaved because Redis executes commands one at a time on a single thread. Token bucket and sliding-window counter both need multiple reads and writes per check, so they essentially require a Lua script to be correct. The atomicity is free, but only if you ask for it.

The tradeoffs that bite

These are the decisions that look free when you design and bill you when you scale.

TradeoffThe free-looking choiceWhat it actually costs
Accuracy vs latencyCheck shared state every requestA network hop on your hot path, every time
Local vs distributedPer-node in-memory countersReal limit becomes N × intended
Atomic vs convenientGET then INCRRace leaks well past the limit under load
Sync vs async countingDecrement before admittingSerializes on the store; throughput ceiling
Global vs per-tenantOne limit for everyoneOne noisy tenant starves all the rest
Requests vs costCount requestsA few expensive calls melt the backend

Two deserve to be spelled out. Accuracy vs latency is the master trade. A purely local in-memory limiter answers in nanoseconds but only knows about traffic on its own node, so a fleet of 20 nodes enforces 20× your intended limit. A distributed limiter backed by Redis is globally accurate but puts a round-trip — typically 0.3–1ms on a healthy network — in front of every guarded request. The common resolution is a two-tier limiter: a generous local limit catches obvious floods for free, and a shared limit enforces the real global quota. The local tier sheds the cheap-to-reject traffic before it costs you a hop.

Sync vs async counting is the trade people discover under load. Decrementing the shared counter synchronously before admitting is correct but serializes every request through the store, which becomes your throughput ceiling. Counting asynchronously — admit immediately, reconcile the counter a beat later — removes the hot-path hop but lets you overshoot the limit during the reconciliation lag. Whether that overshoot is acceptable depends entirely on what the limit protects: a 5% overshoot on a capacity limit is usually fine; a 5% overshoot on a billing limit is revenue you can’t invoice.

Performance: what it costs to say no

A rate limiter’s performance is dominated by one number — the cost of the check on the hot path — because that cost is paid by every request, including the ones you ultimately admit. Saying “no” is only valuable if saying “yes” stays cheap.

What is fast. A local in-memory token bucket is a few arithmetic operations and a map lookup: sub-microsecond, no allocation, no I/O. If you can enforce a limit locally, do — it is essentially free. A single Redis INCR or a small EVAL on a warm connection is sub-millisecond and a single node handles 100k+ checks per second comfortably. Pipelining or batching checks where the protocol allows pushes that higher.

What is slow, and what makes it slow:

  1. The network hop itself. A synchronous Redis call adds the round-trip to p50 and couples your tail latency to Redis’s tail latency. If Redis has a 50ms GC-style pause (it shouldn’t, but latest_fork_usec during a save can do it), every guarded request wears that pause.
  2. A hot limit key. A single global counter hit by every request is a hot key on one Redis shard, and sharding the keyspace does nothing for it because it is one key. This is the limiter’s version of the hot-partition problem.
  3. An unbounded Lua script. A sliding-window-log implementation that scans a growing sorted set holds Redis’s single thread for the duration of the scan, stalling every other client. Keep scripts O(1) or O(log n), never O(n) over an unbounded structure.
  4. Connection churn. If each app instance opens a fresh Redis connection per request instead of pooling, you spend more time on TCP and auth than on counting.

The levers, in order of impact: enforce locally first and only consult the shared store for what local can’t decide; keep the per-check operation O(1) and atomic in one round-trip; pool connections; shard per-key counters across Redis nodes so no single shard carries the whole counting load; and measure rejection cost — a 429 should be the cheapest response your system produces, because under attack it is the response you produce most.

A rejected request that costs as much as a served one is not protecting anything; it just changes which resource you exhaust. Render the 429 at the edge, skip the body, skip the logging-per-request (sample it), and make sure the limiter check short-circuits everything expensive behind it.

Failure modes

Rate limiters fail in exactly two directions, and both are outages. Either they let too much through and the thing they were supposed to protect dies anyway, or they reject too much and the limiter becomes the outage. Every failure below is one of those two. Symptom → root cause → prevention.

The noisy-neighbor problem. Symptom: a wave of 429s hits tenants who sent almost no traffic, while the backend stayed up. Root cause: a single global limit, consumed entirely by one aggressive tenant — a runaway script, a misconfigured client in a retry loop — leaving nothing for everyone else. The limiter “worked” (the backend was protected) but fairness failed completely; your largest customer’s bug became your smallest customer’s outage. Prevention: per-tenant (or per-key, per-IP) limits plus a global ceiling, so one party can exhaust only their own share. Fairness is not a free byproduct of having a limiter; it is a separate design decision.

Synchronized retry storms after a 429. Symptom: rejection rate oscillates in sharp waves; the backend gets hit by synchronized pulses. Root cause: every rejected client retries on the same fixed delay, so they all come back at the same instant — a thundering herd against the limiter and whatever is behind it. Prevention: always return Retry-After, and require clients to add jitter to it. This is the same dynamic that drives the retry storms in load balancing; the cure is the same — randomize the backoff so the herd disperses across time instead of slamming back in lockstep.

The store as a single point of failure. Symptom: Redis has a bad few seconds and either all traffic stops (every request errors) or all traffic floods through (limits vanish). Root cause: a synchronous limiter call on every request means the limiter’s availability is your availability, and nobody decided what happens when the store is unreachable. Prevention: decide fail-open vs fail-closed per limiter, before the bad day. Fail-open (allow on error, ideally with a conservative local fallback limit) keeps an availability-critical API up at the risk of brief overload — the right default for most product traffic. Fail-closed (reject on error) is correct for security-critical limiters where letting traffic through is the actual danger.

A limiter that makes a synchronous network call on every request has put its store on the critical path of every request — including the ones it would have admitted. Decide its fail-open versus fail-closed behavior while you’re calm and writing a design doc, not at 3am while the store is flapping and the decision is being made for you by whatever the code happens to do on a timeout.

Clock skew across nodes. Symptom: the effective limit wobbles for no reason; windows seem to open and close at different real times depending on which node you hit. Root cause: a distributed limiter keying its time windows on each node’s local wall clock, and the clocks have drifted apart by tens or hundreds of milliseconds. Prevention: key windows on a single authority — the store’s clock (Redis TIME) or a logical clock — never the individual node’s now(). Clock skew is a quiet, intermittent bug that wastes days; centralize the time source and it disappears.

Counting the wrong thing. Symptom: you’re under the request limit but the backend is melting anyway. Root cause: the limiter counts requests, but the scarce resource is work — CPU, rows scanned, LLM tokens generated, bytes transferred. A handful of expensive requests sail under a generous request limit and exhaust the real bottleneck. The opening story was partly this: 2,000 requests would’ve been harmless against a cheap endpoint, but they hit an unindexed aggregation. Prevention: weight requests by their estimated cost (a heavy endpoint debits more tokens from the bucket), or limit the actual scarce resource directly. Limit what runs out, not what’s easy to count.

Cardinality explosion in the store. Symptom: Redis memory climbs steadily and unexpectedly. Root cause: per-key limits keyed on something unbounded — per-IP from a botnet rotating through millions of addresses, or per-request-id — creating millions of short-lived counter keys. Prevention: always set a TTL so abandoned keys expire, cap the cardinality of what you key on, and for volumetric IP defense, prefer coarse edge enforcement over a unique Redis key per attacker IP.

Scaling it

At one node, the entire implementation is an in-memory token bucket. Every problem in this article appears the moment you have a second node.

From local to shared state. With N app servers each enforcing a local 100 req/min, your real limit is N × 100 and it drifts every time you autoscale. You move the counter to a shared store — Redis is the default — so all nodes decrement the same bucket. Now your limiter’s throughput is bounded by that store, and any single global limit key becomes a hot key on it. This is the step where the limiter stops being a local detail and becomes a distributed-systems problem with all the consistency baggage that implies.

Sharding the limiter state. At high request rates, partition limit keys across Redis nodes by key hash so no single node carries the whole counting load — the same consistent hashing idea that distributes any keyspace. Per-user and per-tenant keys shard cleanly because they’re naturally distributed. A single global counter does not shard, so a true global ceiling is usually approximated — each of S shards enforces limit / S — rather than counted exactly. That approximation is the price of removing the hot key, and it’s almost always worth it.

flowchart TD
  R[incoming request] --> K{key type?}
  K -->|per tenant| H[hash tenant id]
  K -->|global| G[approx ceiling\nlimit / shards]
  H --> S1[shard A]
  H --> S2[shard B]
  H --> S3[shard C]
  G --> S1
  G --> S2
  G --> S3

Edge enforcement. The cheapest request to rate-limit is one you reject before it enters your network. CDNs and edge platforms enforce coarse, per-IP, volumetric limits at hundreds of points of presence, each with local state synced loosely between them. Accuracy drops — the global view is eventually-consistent at best — but you shed abusive volume before it costs you a single origin request. Fine-grained exact limits stay deeper in the stack where you actually know who the caller is and what their request costs.

Where to enforce it is therefore a layered question, not a single chokepoint. The same request can pass through several limiters, each doing a different job:

  1. Edge / CDN — coarse, per-IP, volumetric. Absorbs DDoS-scale floods and obvious abuse before they touch your origin. Cheap, approximate, first line of defense.
  2. API gateway — per-API-key, per-route. The enforcement point for published quotas and tiered plans, living right next to auth. This is where most product rate limits belong; see API design & idempotency and the API gateway topic.
  3. Service level — per-tenant or per-dependency limits protecting one specific fragile downstream (a slow database, a paid third-party API). The most precise, the most expensive to run, and the closest to the resource that actually runs out.

The rule: enforce as early as possible for raw volume (cheapest rejection) and as deep as necessary for precision (where you actually know the cost). And whatever you build, wire it into observability — a limiter you can’t see rejecting traffic is one you’ll only learn about from an angry customer.

When to reach for it (and when not to)

Reach for rate limiting on any public API, any shared multi-tenant resource, any expensive or paid downstream dependency, and any endpoint where abuse has a direct cost — login, signup, password reset, anything that sends email, money, or SMS. If a stranger can call it and a flood of calls would hurt you, it needs a limit.

Reach for the right algorithm. Token bucket for general API quotas where controlled bursts are fine (most cases). Leaky bucket when the downstream genuinely cannot absorb spikes and you need a smooth output rate. Sliding-window counter anywhere a boundary burst would actually hurt and you want O(1) state — which, after the opening story, should be your default over fixed window for anything non-trivial.

Don’t lead with a hard limiter when the real problem is capacity or fairness that backpressure handles better. Sometimes the right tool is a queue with timeouts that sheds load gracefully, or autoscaling that adds capacity, and a hard 429 is a blunt instrument that just moves the pain to the client. Don’t rate-limit internal trusted traffic with the same aggression as public traffic — a legitimate batch job, a backfill, or a fan-out from another internal service will hit a public-sized limit and fail in confusing ways. Give internal callers separate, higher quotas or exempt them and rely on capacity planning instead. And don’t use a limiter where you need a lock — limiting concurrency is not the same as guaranteeing mutual exclusion, and a rate limiter will not save you from a correctness bug that needs a real consensus system.

When to consider alternatives

  • Smoothing load instead of rejecting it → a queue with backpressure and timeouts (Message Queues, Celery) absorbs bursts and processes them at a sustainable rate, rather than throwing work away.
  • The shared counter at scaleRedis is the default store; if you outgrow a single Redis, shard it with consistent hashing or push coarse limits to the edge.
  • True mutual exclusion / leader electionZooKeeper or a fencing token from a real source of truth, not a rate limiter pretending to be a lock.
  • Adding capacity rather than capping demand → autoscaling on Kubernetes when the honest answer is “we’re just under-provisioned.”
  • Per-customer billing and metering → a metering system that counts exactly and durably; a rate limiter’s approximate, ephemeral counters are the wrong tool for anything you invoice.

The pattern: a rate limiter rejects excess demand. When the right move is to absorb, coordinate, provision, or bill it instead, reach for the tool built for that job.

Operational checklist

  • Make the check-and-decrement atomic — Redis INCR+EXPIRE in one step, or a Lua EVAL for token-bucket and sliding-window. Never GET-then-INCR.
  • Default to a sliding-window counter, not a fixed window, anywhere a boundary burst would hurt.
  • Always return 429 with Retry-After, plus X-RateLimit-Remaining and X-RateLimit-Reset, so well-behaved clients self-regulate instead of hammering.
  • Layer your limits: per-tenant/per-key and a global ceiling, so one noisy neighbor can’t starve everyone.
  • Decide and document fail-open vs fail-closed for each limiter before the store fails — fail-open with a local fallback for availability-critical paths, fail-closed for security-critical ones.
  • Key time windows on the store’s clock (Redis TIME) or a logical clock, never per-node wall time, to survive skew.
  • Weight by cost when request expense varies wildly — debit more tokens for heavy endpoints, or limit the scarce resource directly.
  • Always set a TTL on limit keys and cap the cardinality of what you key on, so per-IP or per-id limits can’t explode Redis memory.
  • Make the 429 the cheapest response you produce — render it early, skip the body, sample the logging.
  • Alert on the rejection rate and the rejected-vs-admitted ratio per tenant through observability; a spike is either an attack or a limit set too low, and you want to know which before the tickets arrive.

Summary

A rate limiter is a capacity-allocation decision you make on purpose so that an overloaded backend doesn’t make it for you by collapsing. Almost every sharp edge traces back to the same handful of facts: how you count decides whether bursts slip through (fixed window has a boundary hole; sliding-window counter closes it cheaply); where the count lives decides accuracy versus latency (local is free but per-node; shared is accurate but a hop on every request); the check must be atomic or concurrent requests leak past the limit; and the limiter is on the critical path, so its failure behavior, its hot keys, and its 429 cost are all your problems. Layer per-tenant limits under a global ceiling for fairness, return Retry-After with jitter to kill retry storms, decide fail-open versus fail-closed before the store has its bad day, and count the resource that actually runs out rather than the one that’s easy to tally. Do that and the limiter is a quiet, load-bearing part of your reliability story. Get one of those facts wrong and it’s the thing that turns a capacity problem into an outage — or becomes the outage itself.

Appendix: HTTP semantics for rate limits

If the body assumed the wire-level conventions, here is the quick version:

  • 429 Too Many Requests — the standard status for a rejected over-quota request (RFC 6585). Use it for client-attributable limits. Reserve 503 Service Unavailable for server-side overload shedding where the cause isn’t a specific client’s quota.
  • Retry-After — seconds (or an HTTP date) telling the client when to try again. Honoring it is what separates a well-behaved client from a retry-storm contributor. Always send it on a 429.
  • X-RateLimit-Limit / X-RateLimit-Remaining / X-RateLimit-Reset — de-facto-standard headers letting a client see its quota, how much is left, and when the window resets, so it can pace itself before getting rejected. Not formally standardized (the RateLimit header draft aims to fix that), but widely understood.
  • Idempotency matters under retries — if clients retry after a 429, non-idempotent operations risk double-execution. Pair rate limiting with idempotency keys; see API design & idempotency.

The unifying idea: rate limiting is a cooperative protocol when clients are well-behaved and a defensive one when they aren’t. The headers exist so the cooperative case stays cheap; the atomic counter and fail-safe behavior exist so the defensive case doesn’t take you down.

Further reading

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

No incident deep-dives yet. See the roadmap for what's coming.