Inference

Prompt Caching

Open Astra is architected to maximize prompt cache hit rates across all supported providers. Prompt caching can reduce inference costs by 50–90% for agents with long system prompts or stable context.

How prompt caching works

All major providers (OpenAI, Claude, Gemini, Grok) implement server-side prompt caching. When you send the same prefix of tokens on successive calls, the provider recognizes it and charges a reduced rate (or nothing) for the cached portion. The key constraint is that the cached prefix must be identical and must appear at the very beginning of the prompt.

Open Astra's context assembly order is designed around this constraint: SOUL.md (stable, never changes) → workspace files (changes infrequently) → system prompt (changes per agent) → memory → conversation history (changes every turn). The most stable content is always first.

Per-provider caching rates

ProviderCache savingsMin prefix lengthCache TTL
Anthropic (Claude)Up to 90% on cached tokens1024 tokens5 minutes
Google (Gemini)Up to 90% on cached tokens32K tokens60 minutes
OpenAI50–90% on cached tokens1024 tokens5–10 minutes
xAI (Grok)Up to 75% on cached tokens1024 tokens5 minutes

SOUL.md as a stable cache prefix

SOUL.md is the first content in every prompt sent by Open Astra. It is loaded once at startup and never changes during a process lifetime. This makes it the ideal cache prefix: every agent across every workspace benefits from the same SOUL.md cache hit.

Any change to SOUL.md invalidates the provider-side cache for all agents. If you modify SOUL.md frequently, you will not get caching benefits on that prefix. Keep it stable and treat changes as significant events.

Workspace files caching

Workspace files come immediately after SOUL.md in the context assembly order. If workspace files are stable (not changing every turn), they will typically be cached after SOUL.md. The 500ms hot-reload debounce means that workspace files only invalidate the cache when they actually change, not on every request.

Measuring cache hit rates

Cache hit data is available in the cost dashboard and the billing API:

bash
# CLI dashboard
npx astra costs

# REST API
GET /costs/summary?period=day

# Per-agent breakdown
GET /agents/:id/usage?include=caching

The dashboard shows prompt tokens, completion tokens, cached tokens, and the effective cost saving per agent and per provider.

Optimizing for cache hits

  • Keep SOUL.md and workspace files short and stable — every token they use is a potential cache hit
  • Put dynamic content (dates, user names, memory) as late as possible in the system prompt template, so the stable prefix is as long as possible
  • Use Gemini for long-context tasks — its 32K min prefix and 60-minute TTL are best for workspaces with large workspace files
  • For Claude, break system prompts into a stable prefix followed by a dynamic suffix to maximize the cacheable portion

Provider cache warm-up

Set CACHE_WARMUP_ENABLED=true to fire one minimal request per configured provider at gateway startup. This primes the server-side prefix cache so the very first real agent request hits a warm cache rather than paying the cold-start penalty.

bash
CACHE_WARMUP_ENABLED=true

Warm-up requests use the smallest available model and a single-token completion. Failures are non-fatal — if a provider is unavailable at startup the gateway continues without it.

Anthropic cache breakpoints

Open Astra inserts three cache_control: ephemeral breakpoints into every Claude request, maximising the cacheable prefix at each tier:

  1. Breakpoint 1 — System prompt: The last system block is marked ephemeral. SOUL.md + agent system prompt are cached together.
  2. Breakpoint 2 — Tool list: The last tool definition is marked ephemeral. Tool schemas are stable within a session and cache separately from the system prompt.
  3. Breakpoint 3 — Conversation history: The second-to-last user message is marked ephemeral. Older turns are cached; only the latest exchange is sent uncached.

Three breakpoints means three independently maintained cache slots per session, each contributing to the 90% cost reduction Claude offers on cached tokens.

Turn hash cache

Before any provider call, Open Astra hashes the full request (messages + tool names + generation params) with SHA-256. If an identical request was answered within the last 5 minutes, the cached response is returned immediately — the provider is never called.

This catches exact-replay scenarios: agent retries on transient errors, polling loops that re-send the same prompt, and test harnesses replaying fixed inputs. Only text-only responses are cached; tool-calling responses are always re-executed.

Stale-while-revalidate semantic cache

When a semantic cache entry has expired but is less than 5 minutes old, the stale response is returned immediately while a background job refreshes it. This eliminates the latency spike that would otherwise occur on the first request after TTL expiry.

Cache stateBehaviour
Fresh (within TTL)Return immediately, no model call
Stale (expired < 5 min ago)Return stale immediately, refresh in background
Too stale (> 5 min ago)Cache miss — call model, write fresh entry

Cache metrics and health

Two endpoints expose cache state in real time:

bash
# Unified stats across all cache layers (requires auth)
GET /cache/stats

# In-process cache sizes — no DB I/O, safe for k8s readiness probes (no auth)
GET /cache/health

GET /cache/stats returns semantic cache entry counts, provider cache hit rates (last 24 h), and embedding cache size. GET /cache/health returns instantaneous in-process stats from 7 layers: graph edge LRU, swarm L1, negative result cache, read-through cache, bloom filter, agent config cache, and turn hash cache.

Provider-level metrics are persisted to provider_cache_metrics after every inference call, enabling historical trend queries:

sql
SELECT provider, AVG(hit_rate) AS avg_hit_rate, SUM(cached_tokens) AS total_saved
FROM provider_cache_metrics
WHERE recorded_at > NOW() - INTERVAL '7 days'
GROUP BY provider
ORDER BY avg_hit_rate DESC;