Semantic Response Cache

Open Astra ships a 7-layer cache stack that progressively filters requests before they reach the model. Each layer is independently tunable and fails open — a cache error always falls through to the next layer or to a live model call.

Cache stack (outer → inner)

#	Layer	Mechanism	Latency saved
1	Bloom filter	Set<agentId> — skips embedding generation entirely for agents with no cache entries	Full embedding API + pgvector round-trip
2	Turn hash cache	SHA-256 of full request; exact-match 5-min in-process TTL	Provider inference call
3	Semantic SWR cache	pgvector cosine similarity ≥ 0.97; stale-while-revalidate up to 5 min after expiry	Provider inference call
4	Negative result cache	SHA-256 of query; 30 s TTL for queries that returned empty memory	Typesense + pgvector memory search
5	Graph edge LRU	1 000-entry in-process LRU; 2-min TTL for graph search results	pgvector graph traversal
6	Swarm L1 cache	Per-swarm shared Map; 5-min TTL, 200 entries; sub-agents share memory reads	Memory retrieval round-trips
7	Embedding cache	LRU(2048) + PostgreSQL SHA-256 keyed; batched writes via 100 ms coalesce window	Embedding API calls

Semantic SWR cache

The core semantic cache stores LLM responses in PostgreSQL keyed by cosine similarity of query embeddings. It supports stale-while-revalidate (SWR): when an entry has expired but is less than 5 minutes old, the stale response is returned immediately while the cache is refreshed in the background.

Bloom filter check — if agent has no entries, return null immediately (no DB I/O).
Embed the query via the configured embedding provider (hits embedding cache first).
pgvector cosine similarity search against semantic_cache scoped to the agent.
If fresh hit (≥ 0.97, within TTL): return cached response.
If stale hit (expired < 5 min ago): return stale response, schedule background refresh.
On miss: call model, write response to cache.

Enabling the cache

bash

SEMANTIC_CACHE_ENABLED=true

The semantic_cache table is created by migration 024-memory-improvements.sql. No other configuration is required.

Configuration

Setting	Default	Notes
Similarity threshold	0.97	Cosine similarity required for a cache hit
TTL	1 hour	Adaptive: hit-rate-driven scaling (log₂ formula), range 10 min – 2 hours
SWR grace period	5 minutes	Stale entries up to 5 min past TTL are served while refreshing
Scope	Per agent + workspace	Cache is isolated per `agentId` and `workspaceId`
Budget per workspace	50 MB	LRU-evicts oldest entries when workspace exceeds budget

Cache invalidation

Cache entries are invalidated automatically in several scenarios:

Memory write: Every memory_write tool call fires invalidateSemanticCacheForAgent() so stale responses don't persist after new information is stored.
System prompt change: Open Astra tracks a SHA-256[:16] version of each agent's system prompt. When it changes, all semantic cache entries for that agent are evicted automatically.
Cascade invalidation: A single call evicts all layers in order — graph edge LRU → swarm L1 → prefetch store → semantic DB — ensuring no layer serves stale data after a significant state change.
Budget enforcement: When a workspace exceeds its 50 MB cache budget, the oldest 25% of entries are LRU-evicted to stay under the limit.
Manual purge: DELETE /cache/stats/expired removes all expired entries immediately.

Bloom filter

A module-level Set<agentId> tracks which agents have at least one active cache entry. Before performing an embedding lookup, Open Astra checks this set. If the agent is absent, the entire pgvector query is skipped — no embedding generation, no DB round-trip.

The set is populated on every successful cache write and cleared when an agent's cache is fully invalidated. This is especially valuable for new agents or agents whose cache was recently purged.

Embedding cache

Embedding vectors are cached at two levels:

Layer	Capacity	Persistence	Write strategy
In-process LRU	2 048 entries	Lost on restart	Synchronous
PostgreSQL (`embedding_cache`)	Unlimited	Persists across restarts	Batched (100 ms coalesce window)

Embedding writes are coalesced: multiple insertions within the same 100 ms window are batched into a single multi-row INSERT, reducing Postgres round-trips under bursty traffic.

Observability

bash

# All cache layer stats (requires JWT auth)
GET /cache/stats

# In-process stats only — no DB, no auth (safe for k8s readiness probes)
GET /cache/health

/cache/health returns instantaneous counts from all 7 in-process cache layers. /cache/stats additionally queries the semantic cache table and provider metrics for a full picture.

Provider cache hit rates are persisted to provider_cache_metrics after every inference call, enabling historical queries:

sql

SELECT provider, AVG(hit_rate) AS avg_hit_rate, SUM(cached_tokens) AS total_saved
FROM provider_cache_metrics
WHERE recorded_at > NOW() - INTERVAL '7 days'
GROUP BY provider
ORDER BY avg_hit_rate DESC;

Failure safety

Every cache operation is best-effort: read errors return null (treated as a miss), write errors are logged as warnings and silently skipped. A cache failure never degrades agent reliability — the live model path is always available as a fallback.