aeon-skill-evals

Name: aeon-skill-evals
Author: BankrBot/skills

$npx mdskill add BankrBot/skills/aeon-skill-evals

Validate skill outputs against manifests to catch regressions.

Detects output failures by comparing against predefined assertion rules.
Generates starter manifests from recent successful runs automatically.
Reports regression status using NEW_FAIL, NEW_PASS, or CHRONIC tags.
Enforces quality gates like word counts, required sections, and citations.

SKILL.md

.github/skills/aeon-skill-evalsView on GitHub ↗

---
name: aeon-skill-evals
description: |
  Validate the output of any installed skill against an assertion manifest — word counts, required
  patterns, forbidden phrases, required sections, source citation. Detects regressions by diffing
  vs prior runs (NEW_FAIL / NEW_PASS / CHRONIC / STABLE_FAIL). Bootstrap mode generates a starter
  manifest from a skill's recent successful runs so manifests aren't written speculatively.
  Triggers: "evaluate this skill's output", "check skill X for regressions", "bootstrap evals
  for Y", "did this skill output pass quality gates".
---

# aeon-skill-evals

Quality net for installed skills. Each skill can declare an assertion manifest; outputs are checked against it; failing assertions surface regressions and route concrete fixes.

## Manifest format

```yaml
token-movers:
  min_words: 200
  required_patterns: ["Top movers", "24h"]
  forbidden_patterns: ["I cannot", "as an AI"]
  must_cite_source: true
  min_distinct_items: 5

narrative-tracker:
  min_words: 400
  required_sections: ["TRANSITIONS", "POSITIONS", "MAP"]
  forbidden_patterns: ["exciting", "consider"]
  must_have_position_call: true
```

Supported assertions: `min_words` / `max_words`, `required_patterns` / `forbidden_patterns`, `required_sections`, `must_cite_source`, `min_distinct_items`, `output_pattern` (regex), and per-skill-family custom binary checks.

## Operations

- `eval` — run every manifest-defined skill against its latest output.
- `eval --skill=NAME` — one skill.
- `bootstrap --skill=NAME` — generate a starter manifest from recent successful runs.

## Regression states

| State | Action |
|---|---|
| `NEW_FAIL` | Passing last run, failing now. Severity scales with pass streak. |
| `NEW_PASS` | Failing last run, passing now. Log the win. |
| `CHRONIC` | Failing > 3 consecutive runs. Recommend operator review. |
| `STABLE_FAIL` | Always failing. Manifest assertion mismatch — flag for review. |

State in local `evals-state.json`.

## Bootstrap mode

Samples last 5 successful runs of a skill. Computes:

- `min_words` at p25 of historical runs.
- Required patterns from common section headers.
- Forbidden patterns from default list (refusals, hedging filler).

Emits the proposed manifest for review. Never auto-commits — assertions need a human signoff.

## Rules

- Assertions are observations, not specifications. Bootstrap before writing speculatively.
- Forbidden patterns catch hallucination markers and refusals. Keep the list tight; don't lint stylistic choices.
- Chronic failures get a recommendation, not a re-file.
- Manifest changes are reviewed; never auto-edited by this skill.