Prompt Caching
Open Astra is architected to maximize prompt cache hit rates across all supported providers. Prompt caching can reduce inference costs by 50–90% for agents with long system prompts or stable context.
How prompt caching works
All major providers (OpenAI, Claude, Gemini, Grok) implement server-side prompt caching. When you send the same prefix of tokens on successive calls, the provider recognizes it and charges a reduced rate (or nothing) for the cached portion. The key constraint is that the cached prefix must be identical and must appear at the very beginning of the prompt.
Open Astra's context assembly order is designed around this constraint: SOUL.md (stable, never changes) → workspace files (changes infrequently) → system prompt (changes per agent) → memory → conversation history (changes every turn). The most stable content is always first.
Per-provider caching rates
| Provider | Cache savings | Min prefix length | Cache TTL |
|---|---|---|---|
| Anthropic (Claude) | Up to 90% on cached tokens | 1024 tokens | 5 minutes |
| Google (Gemini) | Up to 90% on cached tokens | 32K tokens | 60 minutes |
| OpenAI | 50–90% on cached tokens | 1024 tokens | 5–10 minutes |
| xAI (Grok) | Up to 75% on cached tokens | 1024 tokens | 5 minutes |
SOUL.md as a stable cache prefix
SOUL.md is the first content in every prompt sent by Open Astra. It is loaded once at startup and never changes during a process lifetime. This makes it the ideal cache prefix: every agent across every workspace benefits from the same SOUL.md cache hit.
Workspace files caching
Workspace files come immediately after SOUL.md in the context assembly order. If workspace files are stable (not changing every turn), they will typically be cached after SOUL.md. The 500ms hot-reload debounce means that workspace files only invalidate the cache when they actually change, not on every request.
Measuring cache hit rates
Cache hit data is available in the cost dashboard and the billing API:
# CLI dashboard
npx astra costs
# REST API
GET /costs/summary?period=day
# Per-agent breakdown
GET /agents/:id/usage?include=cachingThe dashboard shows prompt tokens, completion tokens, cached tokens, and the effective cost saving per agent and per provider.
Optimizing for cache hits
- Keep SOUL.md and workspace files short and stable — every token they use is a potential cache hit
- Put dynamic content (dates, user names, memory) as late as possible in the system prompt template, so the stable prefix is as long as possible
- Use Gemini for long-context tasks — its 32K min prefix and 60-minute TTL are best for workspaces with large workspace files
- For Claude, break system prompts into a stable prefix followed by a dynamic suffix to maximize the cacheable portion