Prompt Caching
Open Astra is architected to maximize prompt cache hit rates across all supported providers. Prompt caching can reduce inference costs by 50–90% for agents with long system prompts or stable context.
How prompt caching works
All major providers (OpenAI, Claude, Gemini, Grok) implement server-side prompt caching. When you send the same prefix of tokens on successive calls, the provider recognizes it and charges a reduced rate (or nothing) for the cached portion. The key constraint is that the cached prefix must be identical and must appear at the very beginning of the prompt.
Open Astra's context assembly order is designed around this constraint: SOUL.md (stable, never changes) → workspace files (changes infrequently) → system prompt (changes per agent) → memory → conversation history (changes every turn). The most stable content is always first.
Per-provider caching rates
| Provider | Cache savings | Min prefix length | Cache TTL |
|---|---|---|---|
| Anthropic (Claude) | Up to 90% on cached tokens | 1024 tokens | 5 minutes |
| Google (Gemini) | Up to 90% on cached tokens | 32K tokens | 60 minutes |
| OpenAI | 50–90% on cached tokens | 1024 tokens | 5–10 minutes |
| xAI (Grok) | Up to 75% on cached tokens | 1024 tokens | 5 minutes |
SOUL.md as a stable cache prefix
SOUL.md is the first content in every prompt sent by Open Astra. It is loaded once at startup and never changes during a process lifetime. This makes it the ideal cache prefix: every agent across every workspace benefits from the same SOUL.md cache hit.
Workspace files caching
Workspace files come immediately after SOUL.md in the context assembly order. If workspace files are stable (not changing every turn), they will typically be cached after SOUL.md. The 500ms hot-reload debounce means that workspace files only invalidate the cache when they actually change, not on every request.
Measuring cache hit rates
Cache hit data is available in the cost dashboard and the billing API:
# CLI dashboard
npx astra costs
# REST API
GET /costs/summary?period=day
# Per-agent breakdown
GET /agents/:id/usage?include=cachingThe dashboard shows prompt tokens, completion tokens, cached tokens, and the effective cost saving per agent and per provider.
Optimizing for cache hits
- Keep SOUL.md and workspace files short and stable — every token they use is a potential cache hit
- Put dynamic content (dates, user names, memory) as late as possible in the system prompt template, so the stable prefix is as long as possible
- Use Gemini for long-context tasks — its 32K min prefix and 60-minute TTL are best for workspaces with large workspace files
- For Claude, break system prompts into a stable prefix followed by a dynamic suffix to maximize the cacheable portion
Provider cache warm-up
Set CACHE_WARMUP_ENABLED=true to fire one minimal request per configured provider at gateway startup. This primes the server-side prefix cache so the very first real agent request hits a warm cache rather than paying the cold-start penalty.
CACHE_WARMUP_ENABLED=trueWarm-up requests use the smallest available model and a single-token completion. Failures are non-fatal — if a provider is unavailable at startup the gateway continues without it.
Anthropic cache breakpoints
Open Astra inserts three cache_control: ephemeral breakpoints into every Claude request, maximising the cacheable prefix at each tier:
- Breakpoint 1 — System prompt: The last system block is marked ephemeral. SOUL.md + agent system prompt are cached together.
- Breakpoint 2 — Tool list: The last tool definition is marked ephemeral. Tool schemas are stable within a session and cache separately from the system prompt.
- Breakpoint 3 — Conversation history: The second-to-last user message is marked ephemeral. Older turns are cached; only the latest exchange is sent uncached.
Three breakpoints means three independently maintained cache slots per session, each contributing to the 90% cost reduction Claude offers on cached tokens.
Turn hash cache
Before any provider call, Open Astra hashes the full request (messages + tool names + generation params) with SHA-256. If an identical request was answered within the last 5 minutes, the cached response is returned immediately — the provider is never called.
This catches exact-replay scenarios: agent retries on transient errors, polling loops that re-send the same prompt, and test harnesses replaying fixed inputs. Only text-only responses are cached; tool-calling responses are always re-executed.
Stale-while-revalidate semantic cache
When a semantic cache entry has expired but is less than 5 minutes old, the stale response is returned immediately while a background job refreshes it. This eliminates the latency spike that would otherwise occur on the first request after TTL expiry.
| Cache state | Behaviour |
|---|---|
| Fresh (within TTL) | Return immediately, no model call |
| Stale (expired < 5 min ago) | Return stale immediately, refresh in background |
| Too stale (> 5 min ago) | Cache miss — call model, write fresh entry |
Cache metrics and health
Two endpoints expose cache state in real time:
# Unified stats across all cache layers (requires auth)
GET /cache/stats
# In-process cache sizes — no DB I/O, safe for k8s readiness probes (no auth)
GET /cache/healthGET /cache/stats returns semantic cache entry counts, provider cache hit rates (last 24 h), and embedding cache size. GET /cache/health returns instantaneous in-process stats from 7 layers: graph edge LRU, swarm L1, negative result cache, read-through cache, bloom filter, agent config cache, and turn hash cache.
Provider-level metrics are persisted to provider_cache_metrics after every inference call, enabling historical trend queries:
SELECT provider, AVG(hit_rate) AS avg_hit_rate, SUM(cached_tokens) AS total_saved
FROM provider_cache_metrics
WHERE recorded_at > NOW() - INTERVAL '7 days'
GROUP BY provider
ORDER BY avg_hit_rate DESC;