Self-Healing
Open Astra includes built-in self-healing capabilities that automatically recover from agent failures, manage context overflow, and keep agents running without manual intervention.
Which errors trigger retry
Not all errors are retried. The self-healing system classifies errors into retriable and non-retriable categories:
| Error type | Retried? | Reason |
|---|---|---|
| Provider network error (ECONNRESET, ETIMEDOUT) | Yes | Transient — network blip or provider restart |
| Provider 429 (rate limited) | Yes | Retried after backoff; respects Retry-After header if present |
| Provider 500 / 502 / 503 | Yes | Transient server error — provider may recover |
| Provider 504 (gateway timeout) | Yes | Slow response; retry with lower token count if possible |
| Tool execution error | Yes | Tool failure is reported back to the agent as a tool result; the agent decides whether to retry |
| Context length exceeded | Yes — after compaction | Triggers compaction first, then retries the turn |
| Provider 401 (auth error) | No | Bad API key — retrying will not help; emits alert immediately |
| Provider 400 (bad request) | No | Malformed prompt or invalid params — retrying would produce same error |
| Zod validation failure on tool output | No | Tool returned unexpected schema — flagged for investigation |
| Budget exceeded | No | Hard cap hit — turn is rejected, not retried |
| Ethical check rejection | No | Policy violation — retrying the same input would fail again |
Automatic restart with exponential backoff
When a retriable error occurs, the self-healing system retries with exponential backoff. The initial delay is set by restartDelayMs and doubles on each attempt:
- 1st retry: 2 seconds
- 2nd retry: 4 seconds
- 3rd retry: 8 seconds
- After
maxConsecutiveFailures: agent is paused and anagent.pausedevent is emitted
The failure counter resets to zero after any successful turn.
Circuit breaker
The circuit breaker prevents a failing provider from being hammered with requests during an outage. It operates as a sliding-window counter with three states:
| State | Behavior | Transition |
|---|---|---|
| Closed | Normal operation — all requests go to provider | → Open when 5 failures occur within 60 s |
| Open | All requests fail immediately (no provider call) | → Half-open after 30 s |
| Half-open | One trial request is sent to the provider | → Closed on success; → Open on failure |
The default thresholds — 5 failures in a 60-second window — are set to distinguish a brief spike from a sustained outage. Adjust failureThreshold and windowMs to match your provider's SLA.
Fallback provider routing
When an agent's primary provider is unavailable (circuit open or max retries reached), the self-healing system can automatically route to a fallback provider. Fallback providers are tried in order:
- The agent attempts inference with its configured provider.
- On failure, the system checks the
fallback.providerslist. - Each fallback provider is tried in order using the same model family if available, or the workspace default model for that provider.
- If all fallback providers fail, the turn is rejected and
agent.pausedis emitted.
Fallback routing is transparent to the agent — it sees the same tool schema and response format regardless of which provider served the request. The actual provider used is recorded in the turn's metadata for observability.
Context compaction
When the assembled context approaches the model's token limit, the compaction system automatically summarizes old conversation messages to free up space. This allows agents to maintain continuous sessions indefinitely without losing important context.
Compaction is triggered when the context size exceeds compactionThreshold (expressed as a fraction of maxContextTokens). At that point:
- The oldest 50% of conversation messages are extracted
- A lightweight model call generates a concise summary of those messages
- The raw messages are replaced with the summary as a single
systemmessage - Recent messages (the newer 50%) are kept verbatim
Configuration
selfHealing:
enabled: true
maxConsecutiveFailures: 3 # Failures before agent is paused
restartDelayMs: 2000 # Initial backoff (doubles on each retry)
compactionThreshold: 0.85 # Compact at 85% of maxContextTokens
circuitBreaker:
enabled: true
failureThreshold: 5 # trips after 5 failures in the window
windowMs: 60000 # 60-second rolling window
halfOpenAfterMs: 30000 # try one request after 30s in open state
fallback:
enabled: true
providers:
- anthropic # try Anthropic if primary provider fails
- groq # then Groq as final fallbackFailure events
Every failure and recovery action emits a typed event on the event bus:
| Event | Trigger |
|---|---|
agent.failed | An agent turn threw an unhandled error |
agent.restarting | A restart is being attempted |
agent.paused | Max consecutive failures reached |
agent.circuit_open | Circuit breaker tripped for this agent's provider |
agent.fallback_used | A fallback provider served the turn |
agent.compacted | Context was compacted |
agent.resumed | A paused agent was manually or automatically resumed |
Manual recovery
To resume a paused agent without restarting the gateway:
# Resume a paused agent via REST
curl -X POST http://localhost:3000/agents/research-agent/resume \
-H "Authorization: Bearer $TOKEN"
# Or via CLI
npx astra agent resume research-agentDiagnostics integration
The npx astra doctor command checks self-healing status as part of its agent health category. It reports any agents that are currently paused, how many failures they have accumulated, and the time of the last failure.
npx astra doctor
# Example output:
# [agents] default-agent ✓ healthy
# [agents] research-agent ✗ paused (3 consecutive failures, last: 14m ago)
# [agents] code-agent ✓ healthy (circuit: half-open, 1 trial pending)agent.paused and agent.circuit_open events to a webhook to get real-time alerts when agents go down. See Webhooks.