evolve

Name: evolve
Author: SethGammon/Citadel
$npx mdskill add SethGammon/Citadel/evolve
Directs autonomous improvement cycles through hypothesis and validation.
Enables sustained quality advancement via causal hypothesis formation.
Validates theories with scout agents before deploying fleet attacks.
Extracts transferable patterns that propagate across skill sets.
Delivers results through persistent belief models and pattern libraries.
SKILL.md
.github/skills/evolveView on GitHub ↗
---
name: evolve
description: >-
  Research-driven multi-cycle improvement director. Forms causal hypotheses
  about why scores are low, validates them with scout agents before attacking,
  dispatches axis-parallel fleet attacks, extracts transferable patterns, and
  runs indefinitely within a budget envelope. Accumulates a persistent belief
  model and pattern library across sessions.
user-invocable: true
auto-trigger: false
last-updated: 2026-05-03
---

# /evolve — Improvement Director

## Orientation

**Use when:** You want sustained autonomous quality advancement — the director
forms hypotheses, scouts before attacking, and builds a belief model that
compounds across cycles. Runs until a natural ceiling, budget exhaustion, or you
say stop.

**Don't use when:** You want a single scored loop (`/improve`), a known axis
attacked directly (`/improve --axis`), or a one-time audit (`/improve --score-only`).

**Key difference from `/improve`:** `/improve` follows the rubric mechanically.
`/evolve` asks *why* scores are where they are, validates those theories before
spending fleet budget, and extracts cross-skill patterns that propagate to skills
never directly attacked.

## Invocation

```
/evolve {target}                  # run until ceiling, velocity drop, or budget
/evolve {target} --n={N}          # exactly N director cycles then stop
/evolve {target} --budget=${X}    # run until cumulative spend reaches $X
/evolve {target} --continue       # resume from saved director state
/evolve {target} --status         # show belief model, velocity, spend — no attack
/evolve {target} --axis={name}    # focus director on one axis (scout + attack only)
```

`target` maps to `.planning/rubrics/{target}.md`.
If no rubric exists, run `/improve {target}` Phase 0 first — `/evolve` requires
an approved rubric and will not auto-generate one.

---

## Campaign Artifacts

All findings are externalized incrementally — written after every phase, not
only at cycle end. A crashed or compacted session resumes with full context.

| Artifact | Path | Contents |
|---|---|---|
| Director state | `.planning/evolve/{target}/director-state.json` | cycle count, spend, velocity history, current phase, halt status |
| Belief model | `.planning/evolve/{target}/belief-model.jsonl` | one record per (axis, skill) per cycle: score, hypothesis, evidence, confidence |
| Experiment log | `.planning/evolve/{target}/experiment-log.jsonl` | every experiment: hypothesis → prediction → actual delta → mechanism confirmed |
| Pattern library | `.planning/evolve/{target}/pattern-library.md` | transferable patterns: what change to what axis class caused what delta in which skills |
| Cycle digest | `.planning/evolve/{target}/cycle-{n}-digest.md` | human-readable per-cycle summary for review |
| Global patterns | `.planning/research/patterns.md` | cross-target patterns written outside campaign scope; available to future sessions and other targets |
| Knowledge wiki | `.planning/wiki/` | compiled wiki pages from `/learn`; integrates evolve discoveries across sessions |

Create `.planning/evolve/{target}/` on first invocation. Create `.planning/research/` if absent.

---

## Cycle Digest Format

```markdown
# Cycle {n} — {target} | {date}

## Scores
| Axis | Prior | This Cycle | Delta |

## Hypotheses
| ID | Axis | Hypothesis | Scout Result | Confidence |

## What Was Attacked
| Axis | Skill | Delta | Mechanism Confirmed |

## Patterns Discovered This Cycle
- {pattern}: {evidence}

## Belief Model Updates
- {hypothesis confirmed / rejected / revised}

## Spend: ${cycle} this cycle | ${cumulative} cumulative | Velocity: {v}
```

---

## Director Cycle Protocol

### Phase 1: Survey

Run `/improve {target} --score-only`. Record scores to belief model with delta
from prior cycle (empty on cycle 1). Flag any axis that dropped since last cycle
as `regression-watch` — these are checked first in Phase 2.

### Phase 2: Hypothesize

For every axis below 8.0, generate one primary hypothesis in this form:

```
HYPOTHESIS: {axis} scores {n}/10 because {specific mechanism},
            not because {common misread}.
PREDICTION: Fixing {mechanism} will raise score ≥ {delta} across {N} skills.
FALSIFICATION: If we apply {change} and score does not rise > 0.5, hypothesis rejected.
```

Draw hypotheses from: evaluator justifications in Phase 1, prior evidence in
the belief model, and programmatic check failures. Do not hypothesize from score
alone — the number is the symptom.

Write each hypothesis to the experiment log as `{ id, status: "pending", ... }`.

Skip hypothesis generation for an axis if the belief model already has a
`confidence >= 0.8` confirmed hypothesis for it that has not yet been attacked.

### Phase 3: Scout

For axes below 7.0, or axes with unconfirmed hypotheses: dispatch one scout
agent per hypothesis. Scouts read — they do not modify files.

Each scout returns:
```json
{ "hypothesis_id": "...", "confirmed": true, "evidence": "...", "confidence": 0.85 }
```

**Scout confidence protocol**: Scouts read relevant files only — no edits, no test runs. Assign `confidence`:
- **0.9+**: mechanism is directly observable (explicit absence, missing section, wrong value in file)
- **0.7–0.89**: strong indirect evidence from 2+ corroborating observations
- **0.4–0.69**: single observation that supports the hypothesis; alternative explanations plausible
- **< 0.4**: no direct evidence found; hypothesis is speculative from this file set

Run scouts in parallel. Update experiment log:
- `confidence >= 0.7` → `confirmed`
- `confidence 0.4–0.69` → `needs-evidence` (do not attack; add to next cycle)
- `confidence < 0.4` → `rejected`

Skip Phase 3 for any hypothesis already `confirmed` at `confidence >= 0.8` in
the belief model from a prior cycle.

### Phase 4: Prioritize

For each confirmed hypothesis compute:

```
EV = (delta_estimate × axis_weight × confidence) / (effort_tier × collision_multiplier)
```

- `effort_tier`: low=1.0, medium=1.5, high=2.5
- `collision_multiplier`: 2.0 if axis shares primary files with another attack in this cycle

Select top K axes where K = min(confirmed count, 4). Document selection rationale
in cycle digest. If `--axis` was set, skip ranking — attack only that axis.

### Phase 5: Fleet Attack

Dispatch one agent per selected axis in an isolated worktree
(Agent tool, `isolation: "worktree"`). Each agent receives:
- The confirmed hypothesis and its falsification criterion
- The specific files to modify
- Verification oracle: `node scripts/run-with-timeout.js 300 node scripts/test-all.js`

Each agent returns a structured result:
```json
{
  "axis": "...", "skill": "...",
  "delta": 1.2,
  "mechanism_confirmed": true,
  "files_changed": ["..."],
  "approach": "..."
}
```

**Merge rules:**
- Non-conflicting worktrees: merge all
- Conflicting worktrees (same file): keep higher delta, discard lower
- Regression on any previously passing programmatic check: abort that worktree, do not merge
- `mechanism_confirmed: false` (score improved but not via predicted mechanism): record as `incidental_improvement`, mark hypothesis as `needs-revision`

Commit each merged worktree with a message citing the hypothesis ID.

### Phase 6: Synthesize

For each result:
1. Update belief model — append evidence record for (axis, skill)
2. Update experiment log — mark `verified` / `refuted` / `incidental`
3. Identify transferable patterns:

```
PATTERN: {axis_class} | Mechanism: {what caused improvement} | Delta: {avg} across {N} instances | Applies to: {skill list} | Confidence: high/medium/low
```

Write patterns to `.planning/evolve/{target}/pattern-library.md`.

**Compile into wiki:** After writing to the pattern library, call
`/learn --from-evolve {target} --cycle {n}`. This compiles cycle discoveries
into `.planning/wiki/` — integrating with findings from prior cycles and
campaigns rather than siloing them in the evolve directory. Skip if `/learn`
is not available in this session (log the skip, do not block the cycle).

### Phase 7: Cross-Pollinate

For each `confidence: high` pattern, or any pattern confirmed in 2+ skills:
apply to all other applicable skills as targeted single-file edits — without
running a full attack cycle.

Run verification oracle per cross-pollinated skill. Commit only if all
programmatic checks pass and no axis drops > 0.3. Revert on regression; mark
pattern as `context-dependent`.

Write patterns that apply beyond this target to `.planning/research/patterns.md`.

### Phase 8: Loop or Halt

Compute learning velocity:
```
velocity = Σ(delta across all attacked axes this cycle) / axes_attacked
```
Append to `director-state.json` velocity history.

**Halt conditions (check in order):**
1. `--n` cycles completed
2. `--budget` reached (cumulative cost ≥ limit)
3. All axes ≥ 9.0 across all scored skills
4. `velocity < 0.2` for 3 consecutive cycles AND no `needs-evidence` hypotheses remain
5. Level-up triggered (see below)
6. User says stop

**On velocity drop, before halting:** attempt one axis-class switch — attack the
highest-EV axis from a category not touched in the last 2 cycles. If velocity
is still < 0.2 after that cycle, halt.

**On level-up trigger** (no axis improved > 0.5 for 2 loops, ≥ 3 loops run,
no programmatic failures): write level-up proposals to
`.planning/rubrics/{target}-proposals.md`, set `status: level-up-pending` in
director state, halt. The campaign resumes only after the human approves and
edits the live rubric.

**On normal loop:** increment cycle, compress prior cycle findings to
continuation context, return to Phase 1.

---

## Unlimited Mode

No `--n` and no `--budget` = unlimited. Declare before starting:

```
/evolve running in unlimited mode.
Target: {target} | Exit: all axes ≥ 9.0 OR velocity < 0.2 for 3 cycles
Estimated cost: $12–18/cycle | Spend so far: $0
To halt after current cycle: type /stop or press Escape.
```

Every cycle, report:
```
Cycle {n} complete. Spend: ${cycle} | Cumulative: ${total} | Velocity: {v}
```

When context approaches compression territory (session duration > 30 min or
/compact recommended): write continuation checkpoint to director state,
surface the `--continue` command. The next session picks up exactly where this
one stopped.

For overnight / unattended runs: combine with `/daemon`. The director is
daemon-compatible — daemon calls `/evolve {target} --continue` each session.
Set `--budget` to cap total spend.

---

## Fringe Cases

- **`.planning/` does not exist**: error — run `/do setup` first to initialize the harness state directory, then retry.
- **No rubric**: error — run `/improve {target}` Phase 0 first. List available targets in `.planning/rubrics/` as hint.
- **No prior scores in belief model**: proceed from cycle 1; all deltas empty on first survey. Expected.
- **All scouts return `needs-evidence`**: attack the top-EV axis anyway under `low-confidence` flag; record as exploratory. Mark result regardless.
- **Scout agent hangs or times out** (dispatched scout never returns): After 10 minutes without a response, log the scout as `status: timed-out` in the experiment log with `confidence: 0`. Proceed with the remaining returned scouts. Never let a hung scout block the cycle — if all scouts time out, treat as "all scouts return needs-evidence" and attack the top-EV axis under `low-confidence` flag.
- **All axes collide** (every axis shares files): serialize top 2 axes; parallelize remainder. Log collision.
- **Cross-pollination causes regression**: revert that skill, mark pattern `context-dependent`, do not propagate further.
- **Level-up mid-campaign**: pause, write proposals, set `level-up-pending`. `/evolve --continue` after human approval resumes cycle numbering from where it stopped.
- **Budget overrun risk**: if projected spend for current cycle would exceed `--budget` by > 20%, warn and confirm before dispatching fleet.
- **`--continue` with no director state**: error — no campaign to resume. Suggest `/evolve {target}` to start fresh.
- **Pattern library > 50 entries**: consolidate — group by axis class, merge similar patterns, keep highest-confidence instance of each class. Log consolidation.
- **Zero skills match target rubric**: error with message listing all `.planning/rubrics/*.md` targets.

---

## Quality Gates

- Every hypothesis must have an explicit falsification criterion before Phase 3
- Scouts must run before fleet dispatch on any unconfirmed hypothesis
- Belief model written after every phase, not only end of cycle
- Cross-pollination requires passing verification oracle before commit
- Regression on any previously-passing axis aborts that worktree commit
- Pattern library and global patterns updated at every cycle end, even zero-improvement cycles
- Cycle digest written even on abort or no-change cycles

---

## Contextual Gates

**Disclosure:**
- State mode (unlimited / fixed / budget) and exit conditions before first cycle
- Estimate $12–18 per full cycle before starting
- Report spend and velocity at end of every cycle
- Confirm before continuing if cumulative spend exceeds $50

**Reversibility:** Red. Cross-pollination modifies many files across the repo;
level-up rewrites rubric anchors permanently. Each commit is individually
revertable; high volume. Range: `git revert {first}^..{last}`.

**Trust gates:**
- Novice (0-4 sessions): `--status` and `--n=1` only; unlimited blocked
- Familiar (5-19): up to `--n=5`; unlimited requires explicit `--budget` cap
- Trusted (20+): no cap; confirm if projected total > $100

---

## Exit Protocol

```
---HANDOFF---
- Target: {target} | Cycles: {n} | Spend: ${total} | Mode: {unlimited/n/budget}
- Axes improved: {list with deltas}
- Belief model: .planning/evolve/{target}/belief-model.jsonl ({N} confirmed, {M} rejected)
- Pattern library: .planning/evolve/{target}/pattern-library.md ({N} patterns)
- Global patterns: .planning/research/patterns.md
- Knowledge wiki: .planning/wiki/index.md (compiled via /learn --from-evolve after each cycle)
- Cycle digests: .planning/evolve/{target}/cycle-*-digest.md
- Halt reason: {ceiling/velocity/budget/n-complete/user-stop/level-up-pending}
- Level-up proposals: {path or N/A}
- Reversibility: red — {N} commits across {M} files; revert range: git revert {range}
- Recommended next: {level-up and re-run / new target / done}
---
```