dedupe-against-corpus

Name: dedupe-against-corpus
Author: lyndonkl/claude

$npx mdskill add lyndonkl/claude/dedupe-against-corpus

Prevents duplicate seeds by fingerprinting and clustering content.

Blocks exact matches and links near-duplicates to existing seeds.
Integrates with corpus storage and topic tagging systems.
Decides outcomes using SHA256 hashes and Jaccard similarity scores.
Returns SKIPPED for duplicates or creates related seed links.

SKILL.md

.github/skills/dedupe-against-corpusView on GitHub ↗

---
name: dedupe-against-corpus
description: Checks a candidate seed against the existing substacker corpus (seeds, drafts, published) for exact duplicates (sha256 fingerprint) and near-duplicates (title Jaccard, first-200-word Jaccard, shared topic cluster). Exact match exits as SKIPPED. Near-match links via related_seeds rather than creating a duplicate. Use after topic tagging and density scoring, before writing the seed. Trigger keywords: dedupe, duplicate, already thought, near-match, related seed, fingerprint.
---

# Dedupe Against Corpus

## Table of Contents

- [Three tiers of match](#three-tiers-of-match)
- [Workflow](#workflow)
- [Worked example](#worked-example)
- [Guardrails](#guardrails)

**Related skills:** Called by `ingest-inbox-item` step 4. Queried ad-hoc by `search-corpus`. Backlinks into matched seeds (the one place this skill writes outside new seeds).

## Three tiers of match

1. **Tier 1 — Exact fingerprint**: sha256 of body matches an existing seed's fingerprint. → `SKIPPED`.
2. **Tier 2 — Title similarity**: normalized Jaccard on title tokens ≥ 0.7. → `LINK` candidate.
3. **Tier 3 — Content similarity**: for seeds sharing ≥2 topic tags, first-200-word Jaccard ≥ 0.5. → `LINK` candidate.

Tier 2 and 3 candidates are unioned (up to 3 total LINK targets).

## Workflow

```
Dedupe one candidate seed:
- [ ] Step 1: Grep all corpus/**/*.md frontmatter for fingerprint match
- [ ] Step 2: If exact match, return SKIPPED
- [ ] Step 3: Normalized title Jaccard against all existing seeds
- [ ] Step 4: For seeds sharing ≥2 topic tags, first-200-word Jaccard
- [ ] Step 5: Union tier-2 and tier-3 candidates, cap at 3
- [ ] Step 6: If any LINK candidates, return LINK with related_seeds list; else CREATE
- [ ] Step 7: For LINK, Edit matched seeds' related_seeds to add this candidate's id
```

### Step 7: Backlink safely

The matched seeds get their `related_seeds` field extended (append, never replace) ONLY IF:
- `manual_edits: false` on the matched seed, OR
- The edit is strictly additive (appending an id to a list is always safe).

Do not touch any other field on the matched seed.

## Worked example

**Candidate**: new seed about "dropout as ensemble" with topics `[regularization, ensembling, dropout]`, first 200 words describing thinned-network averaging.

**Existing corpus**:
- `2026-03-11-l2-as-gaussian-prior` — topics `[regularization, bayesian]`. Title Jaccard = 0.1 (different). Shared tags: 1. Skip content-similarity check.
- `2026-02-08-bagging-in-deep-nets` — topics `[regularization, ensembling]`. Shared tags: 2. First-200-word Jaccard: 0.51 → LINK candidate.

**Output**: `{action: LINK, related_seeds: [2026-02-08-bagging-in-deep-nets]}`.

Side effect: `corpus/seeds/2026-02-08-bagging-in-deep-nets.md` gets its `related_seeds` appended with the new seed's id.

## Guardrails

1. Never merge, rewrite, or delete. Only link. Merging is the writer's call.
2. Never alter an existing seed's body. Only its `links.related_seeds` field.
3. Never cross-link into `corpus/dead/` — dead ideas stay dead.
4. Fingerprint is over the body only (not frontmatter). Retagging must not change fingerprint.
5. The 3-link cap prevents unbounded link chains.
6. Never propose LINK across `status: published` to `status: seed` — published posts are immutable except for typo fixes.

## Quick reference

- Returns one of: `SKIPPED`, `LINK`, `CREATE`.
- LINK can carry up to 3 related-seed ids.
- Side effect: backlinks into matched seeds (additive).