BreakerKit: Circuit Breaker Pattern
Overview
Every app eventually meets The Outside World.
The Outside World is where "fetch" becomes:
- an API that is slow sometimes
- a dependency that is down right now
- a vendor that rate limits you for breathing too loudly
- a database that is fine until it is not
When that happens, most systems fail in the dumbest possible way: they keep trying, they keep waiting, they pile up threads/promises/jobs, and then they fall over from the weight.
That is what the circuit breaker pattern is trying to prevent.
Not "make errors go away".
OR
Not "magically heal outages".
Do: stop repeatedly doing the thing that is currently on fire.
What a circuit breaker actually does
A circuit breaker is basically a tiny state machine in front of a risky call.
It has three moods:
- closed: calls go through like normal
- open: calls do not go through, you fail fast
- half-open: you cautiously test the water to see if it recovered
That’s it. That’s the core idea.
The value is not just the fast failure, it is that it prevents cascading failure. If a dependency is dying, your app does not need to die with it.
Pitfall 1: treating "retry" and "circuit breaker" like the same tool
Retry and circuit breaker are cousins, not twins.
- Retry is betting that the next attempt might succeed.
- Circuit breaker is admitting “this is probably doomed” and refusing to keep paying the cost.
Retries are great for transient failures (packet loss, short blips). Retries are also a great way to DDoS your own dependencies if the outage is real.
|
|
A breaker is the thing that says: "Stop. We will try later."
Pitfall 2: counting the wrong failures
The breaker needs a signal. If your signal is garbage, your breaker becomes a random number generator.
Common mistakes:
- Counting validation errors as failures. That is not the dependency being down, that is you sending nonsense.
- Counting 404s as failures. If the resource does not exist, your breaker should not "protect" you from reality.
- Not separating rate limits from server errors. Rate limiting is often recoverable, but your strategy might be "slow down" not "trip open".
The core question is always:
"Does this failure mean the dependency is unhealthy, or does it mean my request is bad?"
|
|
If it is your request, do not teach your breaker to panic.
Pitfall 3: no timeout means the breaker never gets a clean failure
This one is sneaky.
If requests hang, they do not fail. They just… sit there… forever… like a haunted loading spinner.
Your breaker needs calls to resolve so it can observe success or failure.
So you almost always want a timeout in the same neighborhood as the breaker.
Timeouts are the "this took too long, we are done here" boundary.
Breakers are the "this keeps happening, we are not doing it again for a bit" boundary.
They work best as a pair.
|
|
Pitfall 4: flappy breakers
If your cooldown is too short, you get this loop:
- dependency is sick
- breaker opens
- cooldown ends
- breaker goes half-open
- dependency is still sick
- breaker opens again
- repeat until everyone is exhausted
The breaker is doing its job, but you still get noisy logs, noisy alerts, and your system is spending its time poking a wound.
A good cooldown is long enough that the dependency has a chance to recover, and short enough that you are not stuck failing fast forever.
There is no universal number, but "10 seconds" is usually not a magic spell. It is just a number people type because it looks reasonable.
|
|
Pitfall 5: "we added a circuit breaker" (but you really didn’t)
This happens in serverless and horizontally scaled systems a lot.
If every instance has its own breaker state, and traffic is spread across many instances, you can end up with:
- no single breaker seeing enough consecutive failures to trip
- or half your instances tripped and half not, which is chaos with extra steps
The pattern still helps per instance, but you need to understand what you actually built:
a local safety mechanism, not a globally consistent guardian angel.
If you want shared breaker state, that becomes a different design problem.
|
|
Pitfall 6: no fallback plan
An open circuit is not a fix, it is a choice.
If you open the circuit and then just throw a 500, all you did was make your failure faster (which is still sometimes good, to be fair).
The real win is when you already know what to do during the open state:
- return cached data
- return partial data
- degrade the UI
- queue work for later
- serve a stale result with a warning
A circuit breaker without a fallback is like buying a fire extinguisher and storing it in a different building.
|
|
Solution
Think of these as three separate knobs:
- Timeout: "this single attempt has a max cost"
- Retry: "this failure might be transient"
- Circuit breaker: "this dependency might be unhealthy, stop hammering it"
They stack nicely, but only if they agree on what counts as "retryable" vs "breaker worthy".
BreakerKit
I wrote a library called BreakerKit mainly because I wanted these primitives to be small and composable:
withTimeout, retry, and circuitBreaker.
|
|
BreakerKit’s breaker defaults and options are straightforward: failureThreshold, successThreshold, cooldownMs, plus an onStateChange callback so you can log/alert when it opens.
That last part is important. If you do not observe state transitions, you will ship a breaker and then forget it exists until it ruins your graphs.
Final thoughts
The circuit breaker pattern is simple, but the consequences aren’t.
- It is not a "reliability feature" system addon.
- It is a statement about how your system behaves under stress.
- If you count the wrong failures, you trip when you shouldn’t.
- If you skip timeouts, you never trip at all.
- If you retry everything, you amplify outages.
- If you add a breaker with no fallback, you just fail faster and call it architecture.
Still worth it. Just respect it a little.
Links
BreakerKit repo: https://github.com/Cr0wn-Gh0ul/BreakerKit
BreakerKit on npm: https://www.npmjs.com/package/breakerkit