Circuit Breaker & Resilience
Open Astra wraps every inference provider in a resilient client that combines retry with jittered exponential backoff, per-provider circuit breakers, and optional fallback providers. When a provider starts failing, the circuit breaker trips to prevent cascading failures and automatically recovers when the provider stabilizes.
Circuit breaker
Each provider gets its own circuit breaker instance, keyed by the provider label. The breaker tracks failures in a rolling time window and transitions through three states.
CLOSED ──── 5 failures in 60s ────→ OPEN
│
60s timeout
│
↓
HALF-OPEN
/ \
2 successes any failure
↓ ↓
CLOSED OPENDefault thresholds
// Per-provider circuit breaker thresholds
{
failureThreshold: 5, // failures in rolling window to trip
successThreshold: 2, // consecutive successes to recover
timeoutMs: 60_000, // ms to stay OPEN before testing (1 min)
windowMs: 60_000, // rolling window for failure counting (1 min)
}| Parameter | Default | Effect |
|---|---|---|
failureThreshold | 5 | Failures within the window to trip OPEN |
successThreshold | 2 | Consecutive successes in HALF-OPEN to close |
timeoutMs | 60,000 | Time to stay OPEN before testing (1 minute) |
windowMs | 60,000 | Rolling window for failure counting (1 minute) |
Resilient client
The createResilientClient wrapper adds retry logic and fallback support to any inference client.
import { createResilientClient } from './inference/resilient.js'
const client = createResilientClient({
primary: claudeClient,
fallback: openaiClient, // optional — used when primary circuit opens
maxRetries: 2, // default: 2 (3 total attempts)
baseDelayMs: 1000, // default: 1000ms
label: 'claude', // circuit breaker key (one per provider)
})Retry logic
Retries use jittered exponential backoff. Only specific error patterns are retried — non-retryable errors (auth failures, validation errors) fail immediately.
// Jittered exponential backoff
delay = baseDelayMs × 2^attempt + random(0–500ms)
// Retryable error patterns (case-insensitive):
429, rate limit, 500, 502, 503, 504,
ECONNRESET, ECONNREFUSED, ETIMEDOUT, fetch failed, networkFallback behavior
When a fallback client is configured, the resilient client fails over automatically. The behavior differs slightly between streaming and non-streaming calls.
// chat() fallback behavior:
1. Check circuit breaker state
2. If OPEN → skip retries, go directly to fallback
3. If CLOSED/HALF-OPEN → try primary with retries
4. If all retries exhausted → try fallback with retries
5. If both fail → throw primary's last error
// chatStream() fallback behavior:
1. Try primary streaming
2. If primary fails → call fallback.chat() (non-streaming)
3. Emit response as single "content" delta + "done" eventchat() call and emits the response as a single delta. The user sees a slightly delayed complete response rather than a stream — but no error. Monitoring circuit breakers
The diagnostics endpoint exposes real-time circuit breaker state for all providers.
// GET /diagnostics/circuit-breakers (Internal API key required)
{
"claude": {
"state": "closed",
"totalRequests": 1240,
"totalSuccesses": 1238,
"lastFailureAt": "2026-02-28T12:30:00.000Z"
},
"openai": {
"state": "closed",
"totalRequests": 450,
"totalSuccesses": 450,
"lastFailureAt": null
}
}This endpoint requires the INTERNAL_API_KEY header. You can also see breaker state in the Diagnostics system.
How it fits together
- The Inference layer creates a resilient client for each provider at startup
- Agent-level model config determines which provider is primary
- The Local Router can redirect traffic away from degraded providers based on circuit breaker state
- The Self-Healing system monitors consecutive agent failures that may be caused by provider outages