Inference

Inference Providers

Open Astra supports 10 inference providers out of the box. All providers implement the same InferenceClient interface, so switching providers requires only a config change — no code changes.

Provider reference

Provider IDNameRequired env var(s)Prompt cachingStreaming
openaiOpenAIOPENAI_API_KEY50–90%Yes
claudeAnthropic ClaudeANTHROPIC_API_KEY90%Yes
geminiGoogle GeminiGEMINI_API_KEY90%Yes
grokxAI GrokGROK_API_KEY75%Yes
groqGroqGROQ_API_KEYNoneYes
mistralMistral AIMISTRAL_API_KEYNoneYes
openrouterOpenRouterOPENROUTER_API_KEYVariesYes
ollamaOllama (local)OLLAMA_BASE_URLNoneYes
vllmvLLM (self-hosted)VLLM_API_KEY (optional)NoneYes
bedrockAWS BedrockAWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEYNoneYes
vLLM requires an endpoint field in the agent config pointing to your vLLM server (e.g. http://localhost:8000). There is no global env var — set it per-agent in astra.yml. VLLM_API_KEY is optional and defaults to a placeholder if omitted.

Configuring a provider for an agent

yaml
agents:
  - id: my-agent
    model:
      provider: claude            # Provider ID from table above
      modelId: claude-opus-4-6    # Model ID as accepted by the provider's API
      maxContextTokens: 200000
      maxOutputTokens: 8192
      temperature: 0.5

Provider client caching

The factory (inference/factory.ts) creates provider clients on demand and caches them by the key provider:modelId:endpoint. Creating a client is cheap but not free — the cache ensures that warm requests don't pay initialization costs.

Resilience and fallback

Each provider client is wrapped in a resilient layer that adds:

  • Retry with backoff — up to 2 retries with exponential backoff for 429, 500, 502, 503, 504, and network errors (ECONNRESET, ETIMEDOUT, etc.)
  • Circuit breaker — each provider has a per-instance circuit breaker with three states: closed (normal), open (fast-failing after 5 failures in a 60s window), and half-open (testing recovery after 60s). When open, requests skip retries and go directly to the fallback provider
  • Fallback provider — if configured, automatically routes to a secondary provider once retries and the circuit breaker are exhausted
  • Timeout — the gateway-level request timeout applies to all inference calls (default 120 seconds)
yaml
agents:
  - id: reliable-agent
    model:
      provider: openai
      modelId: gpt-4o
    fallback:
      provider: claude
      modelId: claude-opus-4-6

Ollama setup

To use Ollama for local inference, install Ollama and pull the models you want to use:

bash
# Install Ollama (macOS)
brew install ollama

# Pull models
ollama pull llama3.1:8b
ollama pull llama3.2

# Start Ollama server (default port 11434)
ollama serve

# Set env var (optional — defaults to http://localhost:11434)
OLLAMA_BASE_URL=http://localhost:11434

Then define an agent that uses Ollama in astra.yml:

yaml
agents:
  local:
    displayName: Local Agent (Llama)
    tier: internal
    model:
      provider: ollama
      modelId: llama3.1:8b
      # endpoint: http://custom-host:11434  # override OLLAMA_BASE_URL per-agent
      maxContextTokens: 32768
      maxOutputTokens: 4096
      temperature: 0.7
    systemPrompt: |
      You are a helpful AI assistant running locally via Ollama.
    tools:
      allow:
        - memory-write
        - memory-search
      deny:
        - shell-execute