Memory / Embedding Cache

Embedding Cache

The embedding cache is a two-layer cache that eliminates redundant embedding generation calls. The same text always produces the same embedding, so caching by content hash avoids re-calling the embedding provider (Gemini API or local n-gram fallback) for previously seen inputs.

Two-layer architecture

text
┌─────────────────┐     miss     ┌──────────────────┐     miss
│  L1: In-Process  │ ──────────→ │  L2: PostgreSQL   │ ──────────→  Generate
│  LRU Map (2048)  │             │  embedding_cache   │              Embedding
└────────┬────────┘             └────────┬─────────┘
         │ hit                           │ hit
         ↓                               ↓
    Return cached                   Return cached
    (sub-microsecond)               (single query)
LayerStorageCapacityLatency
L1In-process Map (LRU)2,048 entriesSub-microsecond
L2PostgreSQL embedding_cacheUnlimitedSingle-query (~1–5ms)

Cache key

The cache key is the SHA-256 hex digest of the raw input text. This ensures identical text always maps to the same key regardless of whitespace normalization or encoding differences in the caller.

typescript
import { createHash } from 'crypto'

function hashText(text: string): string {
  return createHash('sha256').update(text).digest('hex')
}

L1: In-process LRU

The L1 cache uses JavaScript's Map insertion-order guarantee to implement LRU eviction. On access, entries are deleted and re-inserted to move them to the "most recently used" position. When the map exceeds 2,048 entries, the oldest (first-inserted) entry is evicted.

typescript
// In-process LRU using Map (insertion-order iteration)
const LRU_SIZE = 2048    // max entries (not bytes)

function set(key, value) {
  if (cache.has(key)) cache.delete(key)  // move to end
  cache.set(key, value)
  if (cache.size > LRU_SIZE) {
    const oldest = cache.keys().next().value
    cache.delete(oldest)                  // evict oldest
  }
}

function get(key) {
  const value = cache.get(key)
  if (value) {
    cache.delete(key)   // move to end (most recently used)
    cache.set(key, value)
  }
  return value
}

L2: PostgreSQL

The L2 cache stores embeddings as pgvector vector columns keyed by SHA-256 hash. Writes use ON CONFLICT DO NOTHING for idempotency. A text_preview column stores the first 200 characters for human inspection.

sql
-- Table: embedding_cache
CREATE TABLE embedding_cache (
  text_hash     TEXT PRIMARY KEY,        -- SHA-256 hex
  text_preview  TEXT,                    -- first 200 chars (for inspection)
  embedding     vector NOT NULL          -- pgvector type
);
Persistent across restarts. The L1 cache is lost on gateway restart, but L2 persists in PostgreSQL. On restart, frequently-used embeddings are automatically promoted back to L1 as they're accessed.

Error handling

The cache is designed to never block or break the embedding pipeline. Failures are silently absorbed.

text
// getCachedEmbedding: silently returns null on any error
// → Never throws, never blocks the caller

// setCachedEmbedding: logs warning on Postgres failure
// → L1 write is synchronous and always succeeds
// → L2 write is fire-and-forget (ON CONFLICT DO NOTHING)
  • Tiered Search — uses embeddings for vector similarity across Typesense and pgvector
  • Semantic Cache — caches full responses (not embeddings) by query similarity
  • Prompt Caching — provider-side prompt prefix caching (different layer)