Inference

Inference Overview

Open Astra abstracts LLM inference behind a unified provider interface. You configure one or more providers in astra.yml and each agent picks its model by provider name and model ID. The gateway handles retries, prompt caching, token counting, and local routing transparently.

Supported providers

ProviderModelsNotes
OpenAIgpt-4o, gpt-4o-mini, o1, o3-mini, …Default provider
Anthropicclaude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5, …Supports extended thinking
Googlegemini-2.0-flash, gemini-1.5-pro, …Gemini embeddings also supported
Groqllama-3.3-70b, mixtral-8x7b, …Low-latency inference
OllamaAny locally pulled modelRequires local Ollama server
CustomAny OpenAI-compatible endpointSee Adding a Provider

How routing works

Each inference call resolves a provider at runtime using the following priority:

  1. Agent-level model.provider + model.modelId override
  2. Workspace-level default model restriction (if set in Model Restrictions)
  3. Global default provider from INFERENCE_DEFAULT_PROVIDER env var

The Local Router can redirect traffic between providers based on load, cost, or availability without changing agent config.

Prompt caching

Open Astra caches prompt prefixes for providers that support it (Anthropic, OpenAI). Cache hits reduce cost by up to 90% on repeated system prompts and context-heavy turns. See Prompt Caching for configuration.

The Semantic Cache extends this to near-duplicate user queries — if a semantically similar query was answered recently, the cached response is returned without an LLM call.

Configuration example

yaml
inference:
  defaultProvider: openai
  providers:
    openai:
      apiKey: ${OPENAI_API_KEY}
    anthropic:
      apiKey: ${ANTHROPIC_API_KEY}
    ollama:
      baseUrl: http://localhost:11434

Next steps