Memory

Semantic Response Cache

The semantic cache stores LLM responses in PostgreSQL keyed by the cosine similarity of their query embeddings. When an incoming question is semantically close enough to a previously answered one, the cached response is returned immediately — no model call is made.

How it works

  1. The agent loop embeds the user's query using the configured embedding provider.
  2. A pgvector cosine similarity search runs against the semantic_cache table, scoped to the current agent.
  3. If a cached entry exists with similarity ≥ 0.97 (and has not expired), its stored response is returned directly.
  4. On a cache miss, the model is called normally and the response is stored for future reuse.

Enabling the cache

Set the env var and the cache activates on the next gateway start. No schema changes are needed — the semantic_cache table is created by migration 024-memory-improvements.sql.

bash
SEMANTIC_CACHE_ENABLED=true

Configuration

SettingDefaultNotes
Similarity threshold0.97Hardcoded per query; above this cosine similarity the cache hits
TTL1 hourCached entries expire after 1 hour. Expired entries are filtered by the expires_at column.
ScopePer agentCache is isolated per agentId — agents do not share cached responses

Cache miss safety

The cache is implemented as best-effort: any read or write error is swallowed and logged as a warning. A lookup failure always falls through to a live model call. This ensures the semantic cache never degrades agent reliability.

Embedding cache

Separately from the response cache, Open Astra also caches the raw embedding vectors for text inputs. An LRU in-process cache (2 048 slots) is backed by a PostgreSQL embedding_cache table keyed by SHA-256 hash of the input text. This eliminates redundant embedding API calls across both memory operations and semantic cache lookups.

LayerCapacityPersistence
In-process LRU2 048 entriesLost on restart
PostgreSQL (embedding_cache)UnlimitedPersists across restarts