Jaccard Dedup

The memory consolidation pipeline removes near-duplicate daily memory entries using Jaccard word similarity. Two entries with more than 75% word overlap are considered duplicates; the older entry is deleted and the newer one is kept.

Algorithm

typescript

// Runs during consolidation over the last 30 days of daily memory
const wordsA = new Set(
  a.content.toLowerCase().split(/\W+/).filter(w => w.length > 4)
)
const wordsB = new Set(
  b.content.toLowerCase().split(/\W+/).filter(w => w.length > 4)
)
const intersection = [...wordsA].filter(w => wordsB.has(w)).length
const union = new Set([...wordsA, ...wordsB]).size
const jaccard = union > 0 ? intersection / union : 0

if (jaccard > 0.75) {
  // Remove the older entry, keep the newer one
}

Parameters

Parameter	Value	Description
Similarity threshold	0.75	Pairs above this are treated as duplicates
Word filter	length > 4	Short/common words are excluded from comparison
Scope	Last 30 days	Only recent daily memory is checked
Resolution	Keep newer	Older entry is deleted when duplicates are found

Why Jaccard

Jaccard similarity over word sets is fast, language-agnostic, and robust to minor paraphrasing. It is a better fit for short memory entries (1-3 sentences) than embedding-based similarity, which can over-cluster semantically similar but factually distinct observations.

Jaccard dedup targets daily memory entries (Tier 2). It does not deduplicate knowledge-graph entities or user profile facts — those have their own consistency mechanisms.