Memory

Jaccard Dedup

The memory consolidation pipeline removes near-duplicate daily memory entries using Jaccard word similarity. Two entries with more than 75% word overlap are considered duplicates; the older entry is deleted and the newer one is kept.

Algorithm

typescript
// Runs during consolidation over the last 30 days of daily memory
const wordsA = new Set(
  a.content.toLowerCase().split(/\W+/).filter(w => w.length > 4)
)
const wordsB = new Set(
  b.content.toLowerCase().split(/\W+/).filter(w => w.length > 4)
)
const intersection = [...wordsA].filter(w => wordsB.has(w)).length
const union = new Set([...wordsA, ...wordsB]).size
const jaccard = union > 0 ? intersection / union : 0

if (jaccard > 0.75) {
  // Remove the older entry, keep the newer one
}

Parameters

ParameterValueDescription
Similarity threshold0.75Pairs above this are treated as duplicates
Word filterlength > 4Short/common words are excluded from comparison
ScopeLast 30 daysOnly recent daily memory is checked
ResolutionKeep newerOlder entry is deleted when duplicates are found

Why Jaccard

Jaccard similarity over word sets is fast, language-agnostic, and robust to minor paraphrasing. It is a better fit for short memory entries (1-3 sentences) than embedding-based similarity, which can over-cluster semantically similar but factually distinct observations.

Jaccard dedup targets daily memory entries (Tier 2). It does not deduplicate knowledge-graph entities or user profile facts — those have their own consistency mechanisms.