Self-Healing

Open Astra includes built-in self-healing capabilities that automatically recover from agent failures, manage context overflow, and keep agents running without manual intervention.

Which errors trigger retry

Not all errors are retried. The self-healing system classifies errors into retriable and non-retriable categories:

Error type	Retried?	Reason
Provider network error (ECONNRESET, ETIMEDOUT)	Yes	Transient — network blip or provider restart
Provider 429 (rate limited)	Yes	Retried after backoff; respects `Retry-After` header if present
Provider 500 / 502 / 503	Yes	Transient server error — provider may recover
Provider 504 (gateway timeout)	Yes	Slow response; retry with lower token count if possible
Tool execution error	Yes	Tool failure is reported back to the agent as a tool result; the agent decides whether to retry
Context length exceeded	Yes — after compaction	Triggers compaction first, then retries the turn
Provider 401 (auth error)	No	Bad API key — retrying will not help; emits alert immediately
Provider 400 (bad request)	No	Malformed prompt or invalid params — retrying would produce same error
Zod validation failure on tool output	No	Tool returned unexpected schema — flagged for investigation
Budget exceeded	No	Hard cap hit — turn is rejected, not retried
Ethical check rejection	No	Policy violation — retrying the same input would fail again

Automatic restart with exponential backoff

When a retriable error occurs, the self-healing system retries with exponential backoff. The initial delay is set by restartDelayMs and doubles on each attempt:

1st retry: 2 seconds
2nd retry: 4 seconds
3rd retry: 8 seconds
After maxConsecutiveFailures: agent is paused and an agent.paused event is emitted

The failure counter resets to zero after any successful turn.

Circuit breaker

The circuit breaker prevents a failing provider from being hammered with requests during an outage. It operates as a sliding-window counter with three states:

State	Behavior	Transition
Closed	Normal operation — all requests go to provider	→ Open when 5 failures occur within 60 s
Open	All requests fail immediately (no provider call)	→ Half-open after 30 s
Half-open	One trial request is sent to the provider	→ Closed on success; → Open on failure

The default thresholds — 5 failures in a 60-second window — are set to distinguish a brief spike from a sustained outage. Adjust failureThreshold and windowMs to match your provider's SLA.

ℹThe circuit breaker is per-agent, not per-provider. If multiple agents use the same provider and one trips, the others continue normally. This is intentional — a single misbehaving agent should not affect the rest of the workspace.

Fallback provider routing

When an agent's primary provider is unavailable (circuit open or max retries reached), the self-healing system can automatically route to a fallback provider. Fallback providers are tried in order:

The agent attempts inference with its configured provider.
On failure, the system checks the fallback.providers list.
Each fallback provider is tried in order using the same model family if available, or the workspace default model for that provider.
If all fallback providers fail, the turn is rejected and agent.paused is emitted.

Fallback routing is transparent to the agent — it sees the same tool schema and response format regardless of which provider served the request. The actual provider used is recorded in the turn's metadata for observability.

Context compaction

When the assembled context approaches the model's token limit, the compaction system automatically summarizes old conversation messages to free up space. This allows agents to maintain continuous sessions indefinitely without losing important context.

Compaction is triggered when the context size exceeds compactionThreshold (expressed as a fraction of maxContextTokens). At that point:

The oldest 50% of conversation messages are extracted
A lightweight model call generates a concise summary of those messages
The raw messages are replaced with the summary as a single system message
Recent messages (the newer 50%) are kept verbatim

Configuration

yaml

selfHealing:
  enabled: true
  maxConsecutiveFailures: 3      # Failures before agent is paused
  restartDelayMs: 2000           # Initial backoff (doubles on each retry)
  compactionThreshold: 0.85      # Compact at 85% of maxContextTokens
  circuitBreaker:
    enabled: true
    failureThreshold: 5          # trips after 5 failures in the window
    windowMs: 60000              # 60-second rolling window
    halfOpenAfterMs: 30000       # try one request after 30s in open state
  fallback:
    enabled: true
    providers:
      - anthropic                # try Anthropic if primary provider fails
      - groq                     # then Groq as final fallback

Failure events

Every failure and recovery action emits a typed event on the event bus:

Event	Trigger
`agent.failed`	An agent turn threw an unhandled error
`agent.restarting`	A restart is being attempted
`agent.paused`	Max consecutive failures reached
`agent.circuit_open`	Circuit breaker tripped for this agent's provider
`agent.fallback_used`	A fallback provider served the turn
`agent.compacted`	Context was compacted
`agent.resumed`	A paused agent was manually or automatically resumed

Manual recovery

To resume a paused agent without restarting the gateway:

bash

# Resume a paused agent via REST
curl -X POST http://localhost:3000/agents/research-agent/resume \
  -H "Authorization: Bearer $TOKEN"

# Or via CLI
npx astra agent resume research-agent

Diagnostics integration

The npx astra doctor command checks self-healing status as part of its agent health category. It reports any agents that are currently paused, how many failures they have accumulated, and the time of the last failure.

bash

npx astra doctor

# Example output:
# [agents] default-agent         ✓ healthy
# [agents] research-agent        ✗ paused (3 consecutive failures, last: 14m ago)
# [agents] code-agent            ✓ healthy (circuit: half-open, 1 trial pending)

💡Wire agent.paused and agent.circuit_open events to a webhook to get real-time alerts when agents go down. See Webhooks.