Inference Overview
Open Astra abstracts LLM inference behind a unified provider interface. You configure one or more providers in astra.yml and each agent picks its model by provider name and model ID. The gateway handles retries, prompt caching, token counting, and local routing transparently.
Supported providers
| Provider | Models | Notes |
|---|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, o1, o3-mini, … | Default provider |
| Anthropic | claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5, … | Supports extended thinking |
| gemini-2.0-flash, gemini-1.5-pro, … | Gemini embeddings also supported | |
| Groq | llama-3.3-70b, mixtral-8x7b, … | Low-latency inference |
| Ollama | Any locally pulled model | Requires local Ollama server |
| Custom | Any OpenAI-compatible endpoint | See Adding a Provider |
How routing works
Each inference call resolves a provider at runtime using the following priority:
- Agent-level
model.provider+model.modelIdoverride - Workspace-level default model restriction (if set in Model Restrictions)
- Global default provider from
INFERENCE_DEFAULT_PROVIDERenv var
The Local Router can redirect traffic between providers based on load, cost, or availability without changing agent config.
Prompt caching
Open Astra caches prompt prefixes for providers that support it (Anthropic, OpenAI). Cache hits reduce cost by up to 90% on repeated system prompts and context-heavy turns. See Prompt Caching for configuration.
The Semantic Cache extends this to near-duplicate user queries — if a semantically similar query was answered recently, the cached response is returned without an LLM call.
Configuration example
inference:
defaultProvider: openai
providers:
openai:
apiKey: ${OPENAI_API_KEY}
anthropic:
apiKey: ${ANTHROPIC_API_KEY}
ollama:
baseUrl: http://localhost:11434Next steps
- Providers — full list of supported providers and config
- Prompt Caching — reduce cost on repeated context
- Local Router — route between providers dynamically
- Adding a Provider — bring your own OpenAI-compatible endpoint