Jaccard Dedup
The memory consolidation pipeline removes near-duplicate daily memory entries using Jaccard word similarity. Two entries with more than 75% word overlap are considered duplicates; the older entry is deleted and the newer one is kept.
Algorithm
typescript
// Runs during consolidation over the last 30 days of daily memory
const wordsA = new Set(
a.content.toLowerCase().split(/\W+/).filter(w => w.length > 4)
)
const wordsB = new Set(
b.content.toLowerCase().split(/\W+/).filter(w => w.length > 4)
)
const intersection = [...wordsA].filter(w => wordsB.has(w)).length
const union = new Set([...wordsA, ...wordsB]).size
const jaccard = union > 0 ? intersection / union : 0
if (jaccard > 0.75) {
// Remove the older entry, keep the newer one
}Parameters
| Parameter | Value | Description |
|---|---|---|
| Similarity threshold | 0.75 | Pairs above this are treated as duplicates |
| Word filter | length > 4 | Short/common words are excluded from comparison |
| Scope | Last 30 days | Only recent daily memory is checked |
| Resolution | Keep newer | Older entry is deleted when duplicates are found |
Why Jaccard
Jaccard similarity over word sets is fast, language-agnostic, and robust to minor paraphrasing. It is a better fit for short memory entries (1-3 sentences) than embedding-based similarity, which can over-cluster semantically similar but factually distinct observations.
Jaccard dedup targets daily memory entries (Tier 2). It does not deduplicate knowledge-graph entities or user profile facts — those have their own consistency mechanisms.