Inference Providers
Open Astra supports 10 inference providers out of the box. All providers implement the same InferenceClient interface, so switching providers requires only a config change — no code changes.
Provider reference
| Provider ID | Name | Required env var(s) | Prompt caching | Streaming |
|---|---|---|---|---|
openai | OpenAI | OPENAI_API_KEY | 50–90% | Yes |
claude | Anthropic Claude | ANTHROPIC_API_KEY | 90% | Yes |
gemini | Google Gemini | GEMINI_API_KEY | 90% | Yes |
grok | xAI Grok | GROK_API_KEY | 75% | Yes |
groq | Groq | GROQ_API_KEY | None | Yes |
mistral | Mistral AI | MISTRAL_API_KEY | None | Yes |
openrouter | OpenRouter | OPENROUTER_API_KEY | Varies | Yes |
ollama | Ollama (local) | OLLAMA_BASE_URL | None | Yes |
vllm | vLLM (self-hosted) | VLLM_API_KEY (optional) | None | Yes |
bedrock | AWS Bedrock | AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY | None | Yes |
ℹvLLM requires an
endpoint field in the agent config pointing to your vLLM server (e.g. http://localhost:8000). There is no global env var — set it per-agent in astra.yml. VLLM_API_KEY is optional and defaults to a placeholder if omitted.Configuring a provider for an agent
yaml
agents:
- id: my-agent
model:
provider: claude # Provider ID from table above
modelId: claude-opus-4-6 # Model ID as accepted by the provider's API
maxContextTokens: 200000
maxOutputTokens: 8192
temperature: 0.5Provider client caching
The factory (inference/factory.ts) creates provider clients on demand and caches them by the key provider:modelId:endpoint. Creating a client is cheap but not free — the cache ensures that warm requests don't pay initialization costs.
Resilience and fallback
Each provider client is wrapped in a resilient layer that adds:
- Retry with backoff — up to 2 retries with exponential backoff for 429, 500, 502, 503, 504, and network errors (
ECONNRESET,ETIMEDOUT, etc.) - Circuit breaker — each provider has a per-instance circuit breaker with three states: closed (normal), open (fast-failing after 5 failures in a 60s window), and half-open (testing recovery after 60s). When open, requests skip retries and go directly to the fallback provider
- Fallback provider — if configured, automatically routes to a secondary provider once retries and the circuit breaker are exhausted
- Timeout — the gateway-level request timeout applies to all inference calls (default 120 seconds)
yaml
agents:
- id: reliable-agent
model:
provider: openai
modelId: gpt-4o
fallback:
provider: claude
modelId: claude-opus-4-6Ollama setup
To use Ollama for local inference, install Ollama and pull the models you want to use:
bash
# Install Ollama (macOS)
brew install ollama
# Pull models
ollama pull llama3.1:8b
ollama pull llama3.2
# Start Ollama server (default port 11434)
ollama serve
# Set env var (optional — defaults to http://localhost:11434)
OLLAMA_BASE_URL=http://localhost:11434Then define an agent that uses Ollama in astra.yml:
yaml
agents:
local:
displayName: Local Agent (Llama)
tier: internal
model:
provider: ollama
modelId: llama3.1:8b
# endpoint: http://custom-host:11434 # override OLLAMA_BASE_URL per-agent
maxContextTokens: 32768
maxOutputTokens: 4096
temperature: 0.7
systemPrompt: |
You are a helpful AI assistant running locally via Ollama.
tools:
allow:
- memory-write
- memory-search
deny:
- shell-execute