Inference Overview

Open Astra abstracts LLM inference behind a unified provider interface. You configure one or more providers in astra.yml and each agent picks its model by provider name and model ID. The gateway handles retries, prompt caching, token counting, and local routing transparently.

Supported providers

Provider	Models	Notes
OpenAI	gpt-4o, gpt-4o-mini, o1, o3-mini, …	Default provider
Anthropic	claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5, …	Supports extended thinking
Google	gemini-2.0-flash, gemini-1.5-pro, …	Gemini embeddings also supported
Groq	llama-3.3-70b, mixtral-8x7b, …	Low-latency inference
Ollama	Any locally pulled model	Requires local Ollama server
Custom	Any OpenAI-compatible endpoint	See Adding a Provider

How routing works

Each inference call resolves a provider at runtime using the following priority:

Agent-level model.provider + model.modelId override
Workspace-level default model restriction (if set in Model Restrictions)
Global default provider from INFERENCE_DEFAULT_PROVIDER env var

The Local Router can redirect traffic between providers based on load, cost, or availability without changing agent config.

Prompt caching

Open Astra caches prompt prefixes for providers that support it (Anthropic, OpenAI). Cache hits reduce cost by up to 90% on repeated system prompts and context-heavy turns. See Prompt Caching for configuration.

The Semantic Cache extends this to near-duplicate user queries — if a semantically similar query was answered recently, the cached response is returned without an LLM call.

Configuration example

yaml

inference:
  defaultProvider: openai
  providers:
    openai:
      apiKey: ${OPENAI_API_KEY}
    anthropic:
      apiKey: ${ANTHROPIC_API_KEY}
    ollama:
      baseUrl: http://localhost:11434

Next steps

Providers — full list of supported providers and config
Prompt Caching — reduce cost on repeated context
Local Router — route between providers dynamically
Adding a Provider — bring your own OpenAI-compatible endpoint