skill-eval
$
npx mdskill add notque/vexjoy-agent/skill-evalEvaluate and improve skills using testing, benchmarks, and comparisons
- Solve the problem of measuring skill quality and effectiveness
- Uses Read, Write, Bash, Grep, Glob, and Agent tools for evaluation
- Triggers actions based on commands like 'test skill' or 'bake-off'
- Delivers structured results through validation reports and comparative analysis
SKILL.md
.github/skills/skill-evalView on GitHub ↗
---
name: skill-eval
description: "Evaluate skills: trigger testing, A/B benchmarks, structure validation, head-to-head bake-offs."
user-invocable: false
argument-hint: "<skill-name>"
allowed-tools:
- Read
- Write
- Bash
- Grep
- Glob
- Agent
routing:
triggers:
- improve skill
- test skill
- eval skill
- benchmark skill
- skill triggers
- skill quality
- self-improve skill
- skill self-improvement
- improve skill with variants
- bake-off
- bake off
- head-to-head
- head to head
- compare implementations
- grade two versions
- which skill is better
pairs_with:
- agent-evaluation
- verification-before-completion
complexity: Medium-Complex
category: meta
---
# Skill Evaluation & Improvement
Measure and improve skill quality through empirical testing — because structure doesn't guarantee behavior, and measurement beats assumption. Also covers head-to-head bake-offs of two peer implementations of the same artifact (Mode F).
## Reference Loading Table
| Signal | Load These Files | Why |
|---|---|---|
| tasks related to this reference | `schemas.md` | Loads detailed guidance from `schemas.md`. |
| tasks related to this reference | `self-improve-loop.md` | Loads detailed guidance from `self-improve-loop.md`. |
| "bake-off", "head-to-head", "compare implementations", "grade two versions", "which Feynman skill is better" | `bake-off-methodology.md` | Loads the bake-off rubric, anti-rationalization gate, fold-filter, and worked Feynman example. |
## Instructions
### Phase 1: ASSESS — Determine what to evaluate
**Step 1: Identify the skill**
```bash
# Validate skill structure first
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
```
This checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.
**Step 2: Choose evaluation mode based on user intent**
| Intent | Mode | Script |
|--------|------|--------|
| "Test if description triggers correctly" | Trigger eval | `run_eval.py` |
| "Optimize/improve the description through autoresearch" | Route to `agent-comparison` | `optimize_loop.py` |
| "Compare skill vs no-skill output" | Output benchmark | Manual + `aggregate_benchmark.py` |
| "Validate skill structure" | Quick validate | `quick_validate.py` |
| "Self-improve skill" / "optimize skill" / "improve skill with A/B" | Self-improvement loop | `references/self-improve-loop.md` |
| "Bake-off" / "head-to-head grade these two" / "compare X vs Y implementation" | Head-to-head bake-off | `references/bake-off-methodology.md` |
**GATE**: Skill path confirmed, mode selected.
### Phase 2: EVALUATE — Run the appropriate evaluation
#### Mode A: Trigger Evaluation
Test whether a skill's description causes Claude to invoke it for the right queries.
**Step 1: Create eval set** (or use existing)
Create a JSON file with 8-20 test queries. **Eval set quality matters** — use realistic prompts with detail (file paths, context, casual phrasing), not abstract one-liners. Focus on edge cases where the skill competes with adjacent skills.
Example of good eval queries:
```json
[
{"query": "ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage", "should_trigger": true},
{"query": "Format this data", "should_trigger": false}
]
```
**Why**: Real users write detailed, specific prompts. Abstract queries don't test real triggering behavior. Overfitting descriptions to abstract test cases bloats the description and fails on real usage.
**Step 2: Run evaluation**
```bash
python3 -m scripts.skill_eval.run_eval \
--eval-set evals.json \
--skill-path <path/to/skill> \
--runs-per-query 3 \
--verbose
```
This spawns `claude -p` for each query, checking whether it invokes the skill. Runs each query 3 times for reliability. Output includes pass/fail per query with trigger rates. Default 30s timeout; increase with `--timeout 60` if needed for complex queries.
**Constraints applied**:
- Always run baseline eval before making improvements
- 3 runs per query ensures statistical reliability
- Verbose output shows per-query pass/fail during eval runs
**GATE**: Eval results available. Proceed to improvement if failures found.
#### Mode B: Description Optimization
Automated loop that tests, improves, and re-tests descriptions using Claude with extended thinking.
```bash
python3 -m scripts.skill_eval.run_loop \
--eval-set evals.json \
--skill-path <path/to/skill> \
--max-iterations 5 \
--verbose
```
This will:
1. Split eval set 60/40 train/test (stratified by should_trigger) — prevents overfitting to test cases
2. Evaluate current description on all queries (3 runs each for reliability)
3. Use `claude -p` to propose improvements based on training failures
4. Re-evaluate the new description
5. Repeat until all pass or max iterations reached
6. Select best description by **test** score (not train score — prevents overfitting)
7. Open an HTML report in the browser
**Why 60/40 split**: Improvements should help across many prompts, not just test cases. Training on failures, validating on holdout ensures generalization.
**Why report HTML**: Visual reports enable quick review of which queries improved, which regressed, and what the new description looks like.
**GATE**: Loop complete. Best description identified.
#### Mode C: Output Benchmark
Compare skill quality by running prompts with and without the skill.
**Step 1: Create test prompts** — 2-3 realistic user prompts
**Step 2: Run with-skill and without-skill** in parallel subagents:
For each test prompt, spawn two agents:
- **With skill**: Load the skill, run the prompt, save outputs
- **Without skill** (baseline): Same prompt, no skill, save outputs
**Why baseline matters**: Can't prove the skill adds value without a baseline. Maybe Claude handles it fine without the skill. The delta is what matters.
**Step 3: Grade outputs**
Spawn a grader subagent using `agents/grader.md`. It evaluates assertions against the outputs.
**Step 4: Aggregate**
```bash
python3 -m scripts.skill_eval.aggregate_benchmark <workspace>/iteration-1 --skill-name <name>
```
Produces `benchmark.json` and `benchmark.md` with pass rates, timing, and token usage.
**Step 5: Analyze** (optional)
For blind comparison, use `agents/comparator.md` to judge outputs without knowing which skill produced them. Then use `agents/analyzer.md` to understand why the winner won.
**GATE**: Benchmark results available.
#### Mode D: Quick Validate
```bash
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
```
Checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.
#### Mode E: Self-Improvement Loop
Automatically generate variants of a skill, A/B test them against the original, and promote winners. This is a closed-loop pipeline — baseline, hypothesize, generate, test, promote.
Read the full protocol: `${CLAUDE_SKILL_DIR}/references/self-improve-loop.md`
The loop runs 5 phases: BASELINE (establish metrics with 3+ test cases), HYPOTHESIZE (2-3 single-variable changes), GENERATE VARIANTS (minimal diffs), BLIND A/B TEST (paired comparison via `agents/comparator.md`), PROMOTE OR KEEP (60%+ win rate required, no regressions). All outcomes — wins and losses — are recorded to the learning DB to prevent re-testing failed hypotheses.
**GATE**: Self-improvement protocol loaded from reference. Proceed through the 5 phases.
#### Mode F: Head-to-Head Bake-Off
Score two peer implementations of the same artifact (e.g., toolkit `voice-feynman` vs an external Feynman voice profile) on a numeric rubric and declare a decisive winner. Use when the user says "bake-off", "head-to-head", "compare implementations", "grade these two", or "which X is better".
Read the full protocol: `${CLAUDE_SKILL_DIR}/references/bake-off-methodology.md`
The protocol runs 5 phases: PREPARE (read both artifacts in full, pick a verifier that built neither side), RUBRIC (define 5–12 criteria scored 0–10, pre-state the loser-of-each-criterion before reading evidence), GRADE (every score cites a path/line range or quote; build the matrix; apply anti-rationalization gate), FOLD (filter loser-wins through `docs/PHILOSOPHY.md` before recommending any folds into the winner), REPORT (output to `tmp/<topic>-bakeoff-report.md`, gitignored).
The Feynman bake-off (toolkit 86 vs external 74 across 11 criteria, 12-point margin) is the canonical worked example carried in the reference.
**GATE**: Bake-off protocol loaded from reference. Proceed through the 5 phases.
### Phase 3: IMPROVE — Apply results
**Step 1: Review results**
For trigger eval / description optimization:
- Show the best description vs original
- Show per-query results (which queries improved, which regressed)
- Show train vs test scores
For output benchmark:
- Show pass rate delta (with-skill vs without-skill)
- Show timing and token cost delta
- Highlight assertions that only pass with the skill (value-add)
**Step 2: Apply changes** (with user confirmation)
If description optimization found a better description:
1. Show before/after with scores
2. Ask user to confirm
3. Update the skill's SKILL.md frontmatter
4. Re-run quick_validate to confirm the update is valid
**Constraint**: Always show results before/after with metrics. This enables informed decisions.
**GATE**: Changes applied and validated, or user chose to keep original.
---
## Error Handling
### Error: "No SKILL.md found"
**Cause**: Skill path doesn't point to a valid skill directory
**Solution**: Verify path contains a `SKILL.md` file. Skills must follow the `skill-name/SKILL.md` structure.
### Error: "claude: command not found"
**Cause**: Claude CLI not available for trigger evaluation
**Solution**: Install Claude Code CLI. Trigger eval requires `claude -p` to test skill invocation.
### Error: "legacy SDK dependency"
**Cause**: Outdated instructions or an old checkout still expects a direct SDK client
**Solution**: Update to the current scripts. Description optimization now runs through `claude -p`.
### Error: "CLAUDECODE environment variable"
**Cause**: Running eval from inside a Claude Code session blocks nested instances
**Solution**: The scripts automatically strip the `CLAUDECODE` env var. If issues persist, run from a separate terminal.
### Error: "All queries timeout"
**Cause**: Default 30s timeout too short for complex queries
**Solution**: Increase with `--timeout 60`. Simple trigger queries should complete in <15s.
---
## References
### Scripts (in `scripts/skill_eval/`)
- `run_eval.py` — Trigger evaluation: tests description against query set
- `run_loop.py` — Eval+improve loop: automated description optimization
- `improve_description.py` — Single-shot description improvement via Claude API
- `generate_report.py` — HTML report from loop output
- `aggregate_benchmark.py` — Benchmark aggregation from grading results
- `quick_validate.py` — Structural validation of SKILL.md
### Bundled Agents (in `skills/meta/skill-eval/agents/`)
- `grader.md` — Evaluates assertions against execution outputs
- `comparator.md` — Blind A/B comparison of two outputs
- `analyzer.md` — Post-hoc analysis of why one version beat another
### Reference Files
- `${CLAUDE_SKILL_DIR}/references/schemas.md` — JSON schemas for evals.json, grading.json, benchmark.json
- `${CLAUDE_SKILL_DIR}/references/self-improve-loop.md` — Self-improvement loop protocol: variant generation, blind A/B testing, promotion criteria
- `${CLAUDE_SKILL_DIR}/references/bake-off-methodology.md` — Head-to-head bake-off protocol: rubric construction, anti-rationalization gate, philosophy-filtered fold-list, Feynman worked example
More from notque/vexjoy-agent
- adr-consultationMulti-agent consultation for architecture decisions.
- agent-comparisonA/B test agent variants for quality and token cost.
- agent-evaluationEvaluate agents and skills for quality and standards compliance.
- architecture-deepeningProactive architecture improvement: find shallow modules, propose deepening opportunities, design conversation.
- auto-dreamBackground memory consolidation and learning graduation — overnight knowledge lifecycle.
- bluesky-readerRead public Bluesky feeds via AT Protocol API.
- cobalt-coreCobalt Core infrastructure knowledge: KVM exporters, hypervisor tooling, OpenStack compute.
- code-cleanupDetect stale TODOs, unused imports, and dead code.
- code-lintingRun Python (ruff) and JavaScript (Biome) linting.
- codebase-analyzerStatistical rule discovery from Go codebase patterns.