Inference / Resilience

Circuit Breaker & Resilience

Open Astra wraps every inference provider in a resilient client that combines retry with jittered exponential backoff, per-provider circuit breakers, and optional fallback providers. When a provider starts failing, the circuit breaker trips to prevent cascading failures and automatically recovers when the provider stabilizes.

Circuit breaker

Each provider gets its own circuit breaker instance, keyed by the provider label. The breaker tracks failures in a rolling time window and transitions through three states.

text
CLOSED ──── 5 failures in 60s ────→ OPEN
                                        │
                                   60s timeout
                                        │
                                        ↓
                                    HALF-OPEN
                                   /         \
                         2 successes       any failure
                              ↓                ↓
                           CLOSED            OPEN

Default thresholds

typescript
// Per-provider circuit breaker thresholds
{
  failureThreshold: 5,       // failures in rolling window to trip
  successThreshold: 2,       // consecutive successes to recover
  timeoutMs:        60_000,  // ms to stay OPEN before testing (1 min)
  windowMs:         60_000,  // rolling window for failure counting (1 min)
}
ParameterDefaultEffect
failureThreshold5Failures within the window to trip OPEN
successThreshold2Consecutive successes in HALF-OPEN to close
timeoutMs60,000Time to stay OPEN before testing (1 minute)
windowMs60,000Rolling window for failure counting (1 minute)

Resilient client

The createResilientClient wrapper adds retry logic and fallback support to any inference client.

typescript
import { createResilientClient } from './inference/resilient.js'

const client = createResilientClient({
  primary: claudeClient,
  fallback: openaiClient,     // optional — used when primary circuit opens
  maxRetries: 2,              // default: 2 (3 total attempts)
  baseDelayMs: 1000,          // default: 1000ms
  label: 'claude',            // circuit breaker key (one per provider)
})

Retry logic

Retries use jittered exponential backoff. Only specific error patterns are retried — non-retryable errors (auth failures, validation errors) fail immediately.

text
// Jittered exponential backoff
delay = baseDelayMs × 2^attempt + random(0–500ms)

// Retryable error patterns (case-insensitive):
429, rate limit, 500, 502, 503, 504,
ECONNRESET, ECONNREFUSED, ETIMEDOUT, fetch failed, network

Fallback behavior

When a fallback client is configured, the resilient client fails over automatically. The behavior differs slightly between streaming and non-streaming calls.

text
// chat() fallback behavior:
1. Check circuit breaker state
2. If OPEN → skip retries, go directly to fallback
3. If CLOSED/HALF-OPEN → try primary with retries
4. If all retries exhausted → try fallback with retries
5. If both fail → throw primary's last error

// chatStream() fallback behavior:
1. Try primary streaming
2. If primary fails → call fallback.chat() (non-streaming)
3. Emit response as single "content" delta + "done" event
Streaming fallback. When the primary provider's streaming fails, the fallback uses a non-streaming chat() call and emits the response as a single delta. The user sees a slightly delayed complete response rather than a stream — but no error.

Monitoring circuit breakers

The diagnostics endpoint exposes real-time circuit breaker state for all providers.

json
// GET /diagnostics/circuit-breakers (Internal API key required)
{
  "claude": {
    "state": "closed",
    "totalRequests": 1240,
    "totalSuccesses": 1238,
    "lastFailureAt": "2026-02-28T12:30:00.000Z"
  },
  "openai": {
    "state": "closed",
    "totalRequests": 450,
    "totalSuccesses": 450,
    "lastFailureAt": null
  }
}

This endpoint requires the INTERNAL_API_KEY header. You can also see breaker state in the Diagnostics system.

How it fits together

  • The Inference layer creates a resilient client for each provider at startup
  • Agent-level model config determines which provider is primary
  • The Local Router can redirect traffic away from degraded providers based on circuit breaker state
  • The Self-Healing system monitors consecutive agent failures that may be caused by provider outages