Agents

Self-Healing

Open Astra includes built-in self-healing capabilities that automatically recover from agent failures, manage context overflow, and keep agents running without manual intervention.

Which errors trigger retry

Not all errors are retried. The self-healing system classifies errors into retriable and non-retriable categories:

Error typeRetried?Reason
Provider network error (ECONNRESET, ETIMEDOUT)YesTransient — network blip or provider restart
Provider 429 (rate limited)YesRetried after backoff; respects Retry-After header if present
Provider 500 / 502 / 503YesTransient server error — provider may recover
Provider 504 (gateway timeout)YesSlow response; retry with lower token count if possible
Tool execution errorYesTool failure is reported back to the agent as a tool result; the agent decides whether to retry
Context length exceededYes — after compactionTriggers compaction first, then retries the turn
Provider 401 (auth error)NoBad API key — retrying will not help; emits alert immediately
Provider 400 (bad request)NoMalformed prompt or invalid params — retrying would produce same error
Zod validation failure on tool outputNoTool returned unexpected schema — flagged for investigation
Budget exceededNoHard cap hit — turn is rejected, not retried
Ethical check rejectionNoPolicy violation — retrying the same input would fail again

Automatic restart with exponential backoff

When a retriable error occurs, the self-healing system retries with exponential backoff. The initial delay is set by restartDelayMs and doubles on each attempt:

  • 1st retry: 2 seconds
  • 2nd retry: 4 seconds
  • 3rd retry: 8 seconds
  • After maxConsecutiveFailures: agent is paused and an agent.paused event is emitted

The failure counter resets to zero after any successful turn.

Circuit breaker

The circuit breaker prevents a failing provider from being hammered with requests during an outage. It operates as a sliding-window counter with three states:

StateBehaviorTransition
ClosedNormal operation — all requests go to provider→ Open when 5 failures occur within 60 s
OpenAll requests fail immediately (no provider call)→ Half-open after 30 s
Half-openOne trial request is sent to the provider→ Closed on success; → Open on failure

The default thresholds — 5 failures in a 60-second window — are set to distinguish a brief spike from a sustained outage. Adjust failureThreshold and windowMs to match your provider's SLA.

The circuit breaker is per-agent, not per-provider. If multiple agents use the same provider and one trips, the others continue normally. This is intentional — a single misbehaving agent should not affect the rest of the workspace.

Fallback provider routing

When an agent's primary provider is unavailable (circuit open or max retries reached), the self-healing system can automatically route to a fallback provider. Fallback providers are tried in order:

  1. The agent attempts inference with its configured provider.
  2. On failure, the system checks the fallback.providers list.
  3. Each fallback provider is tried in order using the same model family if available, or the workspace default model for that provider.
  4. If all fallback providers fail, the turn is rejected and agent.paused is emitted.

Fallback routing is transparent to the agent — it sees the same tool schema and response format regardless of which provider served the request. The actual provider used is recorded in the turn's metadata for observability.

Context compaction

When the assembled context approaches the model's token limit, the compaction system automatically summarizes old conversation messages to free up space. This allows agents to maintain continuous sessions indefinitely without losing important context.

Compaction is triggered when the context size exceeds compactionThreshold (expressed as a fraction of maxContextTokens). At that point:

  1. The oldest 50% of conversation messages are extracted
  2. A lightweight model call generates a concise summary of those messages
  3. The raw messages are replaced with the summary as a single system message
  4. Recent messages (the newer 50%) are kept verbatim

Configuration

yaml
selfHealing:
  enabled: true
  maxConsecutiveFailures: 3      # Failures before agent is paused
  restartDelayMs: 2000           # Initial backoff (doubles on each retry)
  compactionThreshold: 0.85      # Compact at 85% of maxContextTokens
  circuitBreaker:
    enabled: true
    failureThreshold: 5          # trips after 5 failures in the window
    windowMs: 60000              # 60-second rolling window
    halfOpenAfterMs: 30000       # try one request after 30s in open state
  fallback:
    enabled: true
    providers:
      - anthropic                # try Anthropic if primary provider fails
      - groq                     # then Groq as final fallback

Failure events

Every failure and recovery action emits a typed event on the event bus:

EventTrigger
agent.failedAn agent turn threw an unhandled error
agent.restartingA restart is being attempted
agent.pausedMax consecutive failures reached
agent.circuit_openCircuit breaker tripped for this agent's provider
agent.fallback_usedA fallback provider served the turn
agent.compactedContext was compacted
agent.resumedA paused agent was manually or automatically resumed

Manual recovery

To resume a paused agent without restarting the gateway:

bash
# Resume a paused agent via REST
curl -X POST http://localhost:3000/agents/research-agent/resume \
  -H "Authorization: Bearer $TOKEN"

# Or via CLI
npx astra agent resume research-agent

Diagnostics integration

The npx astra doctor command checks self-healing status as part of its agent health category. It reports any agents that are currently paused, how many failures they have accumulated, and the time of the last failure.

bash
npx astra doctor

# Example output:
# [agents] default-agent         ✓ healthy
# [agents] research-agent        ✗ paused (3 consecutive failures, last: 14m ago)
# [agents] code-agent            ✓ healthy (circuit: half-open, 1 trial pending)
💡Wire agent.paused and agent.circuit_open events to a webhook to get real-time alerts when agents go down. See Webhooks.