Semantic Response Cache
The semantic cache stores LLM responses in PostgreSQL keyed by the cosine similarity of their query embeddings. When an incoming question is semantically close enough to a previously answered one, the cached response is returned immediately — no model call is made.
How it works
- The agent loop embeds the user's query using the configured embedding provider.
- A pgvector cosine similarity search runs against the
semantic_cachetable, scoped to the current agent. - If a cached entry exists with similarity ≥ 0.97 (and has not expired), its stored response is returned directly.
- On a cache miss, the model is called normally and the response is stored for future reuse.
Enabling the cache
Set the env var and the cache activates on the next gateway start. No schema changes are needed — the semantic_cache table is created by migration 024-memory-improvements.sql.
SEMANTIC_CACHE_ENABLED=trueConfiguration
| Setting | Default | Notes |
|---|---|---|
| Similarity threshold | 0.97 | Hardcoded per query; above this cosine similarity the cache hits |
| TTL | 1 hour | Cached entries expire after 1 hour. Expired entries are filtered by the expires_at column. |
| Scope | Per agent | Cache is isolated per agentId — agents do not share cached responses |
Cache miss safety
The cache is implemented as best-effort: any read or write error is swallowed and logged as a warning. A lookup failure always falls through to a live model call. This ensures the semantic cache never degrades agent reliability.
Embedding cache
Separately from the response cache, Open Astra also caches the raw embedding vectors for text inputs. An LRU in-process cache (2 048 slots) is backed by a PostgreSQL embedding_cache table keyed by SHA-256 hash of the input text. This eliminates redundant embedding API calls across both memory operations and semantic cache lookups.
| Layer | Capacity | Persistence |
|---|---|---|
| In-process LRU | 2 048 entries | Lost on restart |
PostgreSQL (embedding_cache) | Unlimited | Persists across restarts |