Inference

Prompt Caching

Open Astra is architected to maximize prompt cache hit rates across all supported providers. Prompt caching can reduce inference costs by 50–90% for agents with long system prompts or stable context.

How prompt caching works

All major providers (OpenAI, Claude, Gemini, Grok) implement server-side prompt caching. When you send the same prefix of tokens on successive calls, the provider recognizes it and charges a reduced rate (or nothing) for the cached portion. The key constraint is that the cached prefix must be identical and must appear at the very beginning of the prompt.

Open Astra's context assembly order is designed around this constraint: SOUL.md (stable, never changes) → workspace files (changes infrequently) → system prompt (changes per agent) → memory → conversation history (changes every turn). The most stable content is always first.

Per-provider caching rates

ProviderCache savingsMin prefix lengthCache TTL
Anthropic (Claude)Up to 90% on cached tokens1024 tokens5 minutes
Google (Gemini)Up to 90% on cached tokens32K tokens60 minutes
OpenAI50–90% on cached tokens1024 tokens5–10 minutes
xAI (Grok)Up to 75% on cached tokens1024 tokens5 minutes

SOUL.md as a stable cache prefix

SOUL.md is the first content in every prompt sent by Open Astra. It is loaded once at startup and never changes during a process lifetime. This makes it the ideal cache prefix: every agent across every workspace benefits from the same SOUL.md cache hit.

Any change to SOUL.md invalidates the provider-side cache for all agents. If you modify SOUL.md frequently, you will not get caching benefits on that prefix. Keep it stable and treat changes as significant events.

Workspace files caching

Workspace files come immediately after SOUL.md in the context assembly order. If workspace files are stable (not changing every turn), they will typically be cached after SOUL.md. The 500ms hot-reload debounce means that workspace files only invalidate the cache when they actually change, not on every request.

Measuring cache hit rates

Cache hit data is available in the cost dashboard and the billing API:

bash
# CLI dashboard
npx astra costs

# REST API
GET /costs/summary?period=day

# Per-agent breakdown
GET /agents/:id/usage?include=caching

The dashboard shows prompt tokens, completion tokens, cached tokens, and the effective cost saving per agent and per provider.

Optimizing for cache hits

  • Keep SOUL.md and workspace files short and stable — every token they use is a potential cache hit
  • Put dynamic content (dates, user names, memory) as late as possible in the system prompt template, so the stable prefix is as long as possible
  • Use Gemini for long-context tasks — its 32K min prefix and 60-minute TTL are best for workspaces with large workspace files
  • For Claude, break system prompts into a stable prefix followed by a dynamic suffix to maximize the cacheable portion