Inference Providers

Open Astra supports 10 inference providers out of the box. All providers implement the same InferenceClient interface, so switching providers requires only a config change — no code changes.

Provider reference

Provider ID	Name	Required env var(s)	Prompt caching	Streaming
`openai`	OpenAI	`OPENAI_API_KEY`	50–90%	Yes
`claude`	Anthropic Claude	`ANTHROPIC_API_KEY`	90%	Yes
`gemini`	Google Gemini	`GEMINI_API_KEY`	90%	Yes
`grok`	xAI Grok	`GROK_API_KEY`	75%	Yes
`groq`	Groq	`GROQ_API_KEY`	None	Yes
`mistral`	Mistral AI	`MISTRAL_API_KEY`	None	Yes
`openrouter`	OpenRouter	`OPENROUTER_API_KEY`	Varies	Yes
`ollama`	Ollama (local)	`OLLAMA_BASE_URL`	None	Yes
`vllm`	vLLM (self-hosted)	`VLLM_API_KEY` (optional)	None	Yes
`bedrock`	AWS Bedrock	`AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`	None	Yes

ℹvLLM requires an endpoint field in the agent config pointing to your vLLM server (e.g. http://localhost:8000). There is no global env var — set it per-agent in astra.yml. VLLM_API_KEY is optional and defaults to a placeholder if omitted.

Configuring a provider for an agent

yaml

agents:
  - id: my-agent
    model:
      provider: claude            # Provider ID from table above
      modelId: claude-opus-4-6    # Model ID as accepted by the provider's API
      maxContextTokens: 200000
      maxOutputTokens: 8192
      temperature: 0.5

Provider client caching

The factory (inference/factory.ts) creates provider clients on demand and caches them by the key provider:modelId:endpoint. Creating a client is cheap but not free — the cache ensures that warm requests don't pay initialization costs.

Resilience and fallback

Each provider client is wrapped in a resilient layer that adds:

Retry with backoff — up to 2 retries with exponential backoff for 429, 500, 502, 503, 504, and network errors (ECONNRESET, ETIMEDOUT, etc.)
Circuit breaker — each provider has a per-instance circuit breaker with three states: closed (normal), open (fast-failing after 5 failures in a 60s window), and half-open (testing recovery after 60s). When open, requests skip retries and go directly to the fallback provider
Fallback provider — if configured, automatically routes to a secondary provider once retries and the circuit breaker are exhausted
Timeout — the gateway-level request timeout applies to all inference calls (default 120 seconds)

yaml

agents:
  - id: reliable-agent
    model:
      provider: openai
      modelId: gpt-4o
    fallback:
      provider: claude
      modelId: claude-opus-4-6

Ollama setup

To use Ollama for local inference, install Ollama and pull the models you want to use:

bash

# Install Ollama (macOS)
brew install ollama

# Pull models
ollama pull llama3.1:8b
ollama pull llama3.2

# Start Ollama server (default port 11434)
ollama serve

# Set env var (optional — defaults to http://localhost:11434)
OLLAMA_BASE_URL=http://localhost:11434

Then define an agent that uses Ollama in astra.yml:

yaml

agents:
  local:
    displayName: Local Agent (Llama)
    tier: internal
    model:
      provider: ollama
      modelId: llama3.1:8b
      # endpoint: http://custom-host:11434  # override OLLAMA_BASE_URL per-agent
      maxContextTokens: 32768
      maxOutputTokens: 4096
      temperature: 0.7
    systemPrompt: |
      You are a helpful AI assistant running locally via Ollama.
    tools:
      allow:
        - memory-write
        - memory-search
      deny:
        - shell-execute