Memory

Semantic Response Cache

Open Astra ships a 7-layer cache stack that progressively filters requests before they reach the model. Each layer is independently tunable and fails open — a cache error always falls through to the next layer or to a live model call.

Cache stack (outer → inner)

#LayerMechanismLatency saved
1Bloom filterSet<agentId> — skips embedding generation entirely for agents with no cache entriesFull embedding API + pgvector round-trip
2Turn hash cacheSHA-256 of full request; exact-match 5-min in-process TTLProvider inference call
3Semantic SWR cachepgvector cosine similarity ≥ 0.97; stale-while-revalidate up to 5 min after expiryProvider inference call
4Negative result cacheSHA-256 of query; 30 s TTL for queries that returned empty memoryTypesense + pgvector memory search
5Graph edge LRU1 000-entry in-process LRU; 2-min TTL for graph search resultspgvector graph traversal
6Swarm L1 cachePer-swarm shared Map; 5-min TTL, 200 entries; sub-agents share memory readsMemory retrieval round-trips
7Embedding cacheLRU(2048) + PostgreSQL SHA-256 keyed; batched writes via 100 ms coalesce windowEmbedding API calls

Semantic SWR cache

The core semantic cache stores LLM responses in PostgreSQL keyed by cosine similarity of query embeddings. It supports stale-while-revalidate (SWR): when an entry has expired but is less than 5 minutes old, the stale response is returned immediately while the cache is refreshed in the background.

  1. Bloom filter check — if agent has no entries, return null immediately (no DB I/O).
  2. Embed the query via the configured embedding provider (hits embedding cache first).
  3. pgvector cosine similarity search against semantic_cache scoped to the agent.
  4. If fresh hit (≥ 0.97, within TTL): return cached response.
  5. If stale hit (expired < 5 min ago): return stale response, schedule background refresh.
  6. On miss: call model, write response to cache.

Enabling the cache

bash
SEMANTIC_CACHE_ENABLED=true

The semantic_cache table is created by migration 024-memory-improvements.sql. No other configuration is required.

Configuration

SettingDefaultNotes
Similarity threshold0.97Cosine similarity required for a cache hit
TTL1 hourAdaptive: hit-rate-driven scaling (log₂ formula), range 10 min – 2 hours
SWR grace period5 minutesStale entries up to 5 min past TTL are served while refreshing
ScopePer agent + workspaceCache is isolated per agentId and workspaceId
Budget per workspace50 MBLRU-evicts oldest entries when workspace exceeds budget

Cache invalidation

Cache entries are invalidated automatically in several scenarios:

  • Memory write: Every memory_write tool call fires invalidateSemanticCacheForAgent() so stale responses don't persist after new information is stored.
  • System prompt change: Open Astra tracks a SHA-256[:16] version of each agent's system prompt. When it changes, all semantic cache entries for that agent are evicted automatically.
  • Cascade invalidation: A single call evicts all layers in order — graph edge LRU → swarm L1 → prefetch store → semantic DB — ensuring no layer serves stale data after a significant state change.
  • Budget enforcement: When a workspace exceeds its 50 MB cache budget, the oldest 25% of entries are LRU-evicted to stay under the limit.
  • Manual purge: DELETE /cache/stats/expired removes all expired entries immediately.

Bloom filter

A module-level Set<agentId> tracks which agents have at least one active cache entry. Before performing an embedding lookup, Open Astra checks this set. If the agent is absent, the entire pgvector query is skipped — no embedding generation, no DB round-trip.

The set is populated on every successful cache write and cleared when an agent's cache is fully invalidated. This is especially valuable for new agents or agents whose cache was recently purged.

Embedding cache

Embedding vectors are cached at two levels:

LayerCapacityPersistenceWrite strategy
In-process LRU2 048 entriesLost on restartSynchronous
PostgreSQL (embedding_cache)UnlimitedPersists across restartsBatched (100 ms coalesce window)

Embedding writes are coalesced: multiple insertions within the same 100 ms window are batched into a single multi-row INSERT, reducing Postgres round-trips under bursty traffic.

Observability

bash
# All cache layer stats (requires JWT auth)
GET /cache/stats

# In-process stats only — no DB, no auth (safe for k8s readiness probes)
GET /cache/health

/cache/health returns instantaneous counts from all 7 in-process cache layers. /cache/stats additionally queries the semantic cache table and provider metrics for a full picture.

Provider cache hit rates are persisted to provider_cache_metrics after every inference call, enabling historical queries:

sql
SELECT provider, AVG(hit_rate) AS avg_hit_rate, SUM(cached_tokens) AS total_saved
FROM provider_cache_metrics
WHERE recorded_at > NOW() - INTERVAL '7 days'
GROUP BY provider
ORDER BY avg_hit_rate DESC;

Failure safety

Every cache operation is best-effort: read errors return null (treated as a miss), write errors are logged as warnings and silently skipped. A cache failure never degrades agent reliability — the live model path is always available as a fallback.