cluster-corpus-by-theme

$npx mdskill add lyndonkl/claude/cluster-corpus-by-theme

Cluster posts by theme using Braun & Clarke's six-phase analysis.

  • Reveals candidate sections from full post bodies without titles.
  • Depends on the substacker corpus of published posts.
  • Groups semantic codes via axial similarity and validates clusters.
  • Outputs YAML with candidate handles, membership, and cohesion.
SKILL.md
.github/skills/cluster-corpus-by-themeView on GitHub ↗
---
name: cluster-corpus-by-theme
description: Performs axial-coding-style thematic clustering over the substacker corpus of published posts to surface candidate sections. Uses Braun & Clarke's six-phase thematic analysis — familiarization, initial coding, searching for themes, reviewing themes, defining themes, naming. Reads full bodies, not titles. Use when re-opening the section question. Trigger keywords: cluster, theme, axial coding, thematic analysis, candidate sections.
---

# Cluster Corpus by Theme

## Workflow

```
Per Curator run:
- [ ] Step 1: Read every post in corpus/published/** end-to-end (not just titles)
- [ ] Step 2: Extract 3-5 codes per post (concepts, methods, domains)
- [ ] Step 3: Group codes across posts by semantic similarity (axial)
- [ ] Step 4: Validate clusters — split or merge where needed
- [ ] Step 5: Report candidate clusters with membership, cohesion, outliers
```

## Output

```yaml
cluster_1:
  candidate_handle: "kalshi-log"
  posts: [list of slugs]
  cohesion: high | medium | low
  centroid_codes: [top 5 codes]
  outlier_posts: [weakly-attached members]
rejected_clusters: [clusters with <3 posts]
```

## Heuristics

- Cohesion `high`: ≥5 posts, shared centroid, clear register.
- Cohesion `medium`: 3-4 posts or mixed register.
- Cohesion `low`: cluster exists but coherence is weak.
- Reject any cluster with <3 posts (below the 3-post floor for real sections).

## Guardrails

1. Read full post bodies. Titles are marketing; bodies are the beat.
2. Do not force every post into a cluster. Outliers are legitimate.
3. If >30% of corpus doesn't cluster coherently (all cohesion low), emit "corpus too heterogeneous" → Curator abandons section proposals this run.
4. Include rejected clusters in output — they feed `recommend-prune` and "watch" candidates.
5. Single-threaded. Don't race on the corpus.
More from lyndonkl/claude