Semantic Response Cache
Open Astra ships a 7-layer cache stack that progressively filters requests before they reach the model. Each layer is independently tunable and fails open — a cache error always falls through to the next layer or to a live model call.
Cache stack (outer → inner)
| # | Layer | Mechanism | Latency saved |
|---|---|---|---|
| 1 | Bloom filter | Set<agentId> — skips embedding generation entirely for agents with no cache entries | Full embedding API + pgvector round-trip |
| 2 | Turn hash cache | SHA-256 of full request; exact-match 5-min in-process TTL | Provider inference call |
| 3 | Semantic SWR cache | pgvector cosine similarity ≥ 0.97; stale-while-revalidate up to 5 min after expiry | Provider inference call |
| 4 | Negative result cache | SHA-256 of query; 30 s TTL for queries that returned empty memory | Typesense + pgvector memory search |
| 5 | Graph edge LRU | 1 000-entry in-process LRU; 2-min TTL for graph search results | pgvector graph traversal |
| 6 | Swarm L1 cache | Per-swarm shared Map; 5-min TTL, 200 entries; sub-agents share memory reads | Memory retrieval round-trips |
| 7 | Embedding cache | LRU(2048) + PostgreSQL SHA-256 keyed; batched writes via 100 ms coalesce window | Embedding API calls |
Semantic SWR cache
The core semantic cache stores LLM responses in PostgreSQL keyed by cosine similarity of query embeddings. It supports stale-while-revalidate (SWR): when an entry has expired but is less than 5 minutes old, the stale response is returned immediately while the cache is refreshed in the background.
- Bloom filter check — if agent has no entries, return null immediately (no DB I/O).
- Embed the query via the configured embedding provider (hits embedding cache first).
- pgvector cosine similarity search against
semantic_cachescoped to the agent. - If fresh hit (≥ 0.97, within TTL): return cached response.
- If stale hit (expired < 5 min ago): return stale response, schedule background refresh.
- On miss: call model, write response to cache.
Enabling the cache
SEMANTIC_CACHE_ENABLED=trueThe semantic_cache table is created by migration 024-memory-improvements.sql. No other configuration is required.
Configuration
| Setting | Default | Notes |
|---|---|---|
| Similarity threshold | 0.97 | Cosine similarity required for a cache hit |
| TTL | 1 hour | Adaptive: hit-rate-driven scaling (log₂ formula), range 10 min – 2 hours |
| SWR grace period | 5 minutes | Stale entries up to 5 min past TTL are served while refreshing |
| Scope | Per agent + workspace | Cache is isolated per agentId and workspaceId |
| Budget per workspace | 50 MB | LRU-evicts oldest entries when workspace exceeds budget |
Cache invalidation
Cache entries are invalidated automatically in several scenarios:
- Memory write: Every
memory_writetool call firesinvalidateSemanticCacheForAgent()so stale responses don't persist after new information is stored. - System prompt change: Open Astra tracks a SHA-256[:16] version of each agent's system prompt. When it changes, all semantic cache entries for that agent are evicted automatically.
- Cascade invalidation: A single call evicts all layers in order — graph edge LRU → swarm L1 → prefetch store → semantic DB — ensuring no layer serves stale data after a significant state change.
- Budget enforcement: When a workspace exceeds its 50 MB cache budget, the oldest 25% of entries are LRU-evicted to stay under the limit.
- Manual purge:
DELETE /cache/stats/expiredremoves all expired entries immediately.
Bloom filter
A module-level Set<agentId> tracks which agents have at least one active cache entry. Before performing an embedding lookup, Open Astra checks this set. If the agent is absent, the entire pgvector query is skipped — no embedding generation, no DB round-trip.
The set is populated on every successful cache write and cleared when an agent's cache is fully invalidated. This is especially valuable for new agents or agents whose cache was recently purged.
Embedding cache
Embedding vectors are cached at two levels:
| Layer | Capacity | Persistence | Write strategy |
|---|---|---|---|
| In-process LRU | 2 048 entries | Lost on restart | Synchronous |
PostgreSQL (embedding_cache) | Unlimited | Persists across restarts | Batched (100 ms coalesce window) |
Embedding writes are coalesced: multiple insertions within the same 100 ms window are batched into a single multi-row INSERT, reducing Postgres round-trips under bursty traffic.
Observability
# All cache layer stats (requires JWT auth)
GET /cache/stats
# In-process stats only — no DB, no auth (safe for k8s readiness probes)
GET /cache/health/cache/health returns instantaneous counts from all 7 in-process cache layers. /cache/stats additionally queries the semantic cache table and provider metrics for a full picture.
Provider cache hit rates are persisted to provider_cache_metrics after every inference call, enabling historical queries:
SELECT provider, AVG(hit_rate) AS avg_hit_rate, SUM(cached_tokens) AS total_saved
FROM provider_cache_metrics
WHERE recorded_at > NOW() - INTERVAL '7 days'
GROUP BY provider
ORDER BY avg_hit_rate DESC;Failure safety
Every cache operation is best-effort: read errors return null (treated as a miss), write errors are logged as warnings and silently skipped. A cache failure never degrades agent reliability — the live model path is always available as a fallback.