aeon-skill-evals
$
npx mdskill add BankrBot/skills/aeon-skill-evalsValidate skill outputs against manifests to catch regressions.
- Detects output failures by comparing against predefined assertion rules.
- Generates starter manifests from recent successful runs automatically.
- Reports regression status using NEW_FAIL, NEW_PASS, or CHRONIC tags.
- Enforces quality gates like word counts, required sections, and citations.
SKILL.md
.github/skills/aeon-skill-evalsView on GitHub ↗
--- name: aeon-skill-evals description: | Validate the output of any installed skill against an assertion manifest — word counts, required patterns, forbidden phrases, required sections, source citation. Detects regressions by diffing vs prior runs (NEW_FAIL / NEW_PASS / CHRONIC / STABLE_FAIL). Bootstrap mode generates a starter manifest from a skill's recent successful runs so manifests aren't written speculatively. Triggers: "evaluate this skill's output", "check skill X for regressions", "bootstrap evals for Y", "did this skill output pass quality gates". --- # aeon-skill-evals Quality net for installed skills. Each skill can declare an assertion manifest; outputs are checked against it; failing assertions surface regressions and route concrete fixes. ## Manifest format ```yaml token-movers: min_words: 200 required_patterns: ["Top movers", "24h"] forbidden_patterns: ["I cannot", "as an AI"] must_cite_source: true min_distinct_items: 5 narrative-tracker: min_words: 400 required_sections: ["TRANSITIONS", "POSITIONS", "MAP"] forbidden_patterns: ["exciting", "consider"] must_have_position_call: true ``` Supported assertions: `min_words` / `max_words`, `required_patterns` / `forbidden_patterns`, `required_sections`, `must_cite_source`, `min_distinct_items`, `output_pattern` (regex), and per-skill-family custom binary checks. ## Operations - `eval` — run every manifest-defined skill against its latest output. - `eval --skill=NAME` — one skill. - `bootstrap --skill=NAME` — generate a starter manifest from recent successful runs. ## Regression states | State | Action | |---|---| | `NEW_FAIL` | Passing last run, failing now. Severity scales with pass streak. | | `NEW_PASS` | Failing last run, passing now. Log the win. | | `CHRONIC` | Failing > 3 consecutive runs. Recommend operator review. | | `STABLE_FAIL` | Always failing. Manifest assertion mismatch — flag for review. | State in local `evals-state.json`. ## Bootstrap mode Samples last 5 successful runs of a skill. Computes: - `min_words` at p25 of historical runs. - Required patterns from common section headers. - Forbidden patterns from default list (refusals, hedging filler). Emits the proposed manifest for review. Never auto-commits — assertions need a human signoff. ## Rules - Assertions are observations, not specifications. Bootstrap before writing speculatively. - Forbidden patterns catch hallucination markers and refusals. Keep the list tight; don't lint stylistic choices. - Chronic failures get a recommendation, not a re-file. - Manifest changes are reviewed; never auto-edited by this skill.
More from BankrBot/skills