ai-reliability-eval
$
npx mdskill add arcasilesgroup/ai-engineering/ai-reliability-evalDefine pass/fail criteria to measure AI reliability over time.
- Helps developers track regression and define success criteria before coding.
- Depends on pass@k metrics to quantify task completion consistency.
- Executes capability checks against pre-agreed pass/fail thresholds.
- Reports status changes and identifies when reliability degrades.
SKILL.md
.github/skills/ai-reliability-evalView on GitHub ↗
---
name: ai-reliability-eval
description: "Measures AI system reliability over time by defining pass/fail criteria before implementation, running capability checks, and tracking regression via pass@k metrics. Trigger for 'how reliable is this', 'did my changes break anything', 'measure AI performance', 'define success criteria', 'eval this feature', 'check skill regression'. Not for code correctness; use /ai-test instead. Not for quality gates; use /ai-verify instead — evals measure AI task completion consistency."
effort: mid
argument-hint: "define|check|report|regression|--skill-set [feature]"
mode: agent
tags: [quality, evals, improvement]
model_tier: sonnet
mirror_family: copilot-skills
generated_by: ai-eng sync
canonical_source: .claude/skills/ai-reliability-eval/SKILL.md
edit_policy: generated-do-not-edit
---
# Reliability Eval
## Purpose
Eval-Driven Development (EDD) treats evals as the unit tests of AI development. Define pass/fail criteria before writing code. Measure AI reliability with pass@k metrics. Track regressions across prompt, agent, and model changes. Evals answer the question: "Can the AI do this reliably?"
**Key distinction**: `ai-verify` checks current code quality (linting, coverage, security). `ai-reliability-eval` measures AI reliability over time (can the agent complete this task consistently?).
## When to Use
- `define`: defining pass/fail criteria before implementation (EDD principle)
- `check`: running current evals and reporting status mid-implementation
- `report`: generating full eval report after implementation
- `regression`: ensuring changes to prompts, agents, or models don't break existing capabilities
- `--skill-set`: skill-set mode — runs the optimizer over each skill's eval corpus under `.ai-engineering/evals/<skill>.jsonl` and gates pass@1 vs `.ai-engineering/evals/baseline.json`. Combine with `--regression` to fail on >5 pp pass@1 drop (sub-007 M6, D-127-07). Wired into `.github/workflows/skill-evals.yml` on PRs touching `.github/skills/**`.
## Process
### Mode: define (Before Coding)
1. Identify the capability being built or changed
2. Write capability evals (can the AI do this new thing?)
3. Write regression evals (do existing things still work?)
4. Set success metrics (pass@k targets)
5. Store eval definition at `.ai-engineering/evals/<feature-name>.md`
```markdown
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely
### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
```
### Mode: check (During Implementation)
1. Read the eval definition from `.ai-engineering/evals/<feature-name>.md`
2. Run each capability eval, record PASS/FAIL
3. Run regression evals via existing test suites
4. Report current status with pass@k counts
5. Identify which evals still fail and why
### Mode: report (After Implementation)
Run all capability + regression evals (per Mode: check), calculate pass@k and pass^k metrics, then generate the structured report:
```markdown
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
```
Store the rendered report at `.ai-engineering/evals/<feature-name>.log`.
### Mode: regression
Baseline is created automatically on the first `report` run. If `baseline.json` does not exist, the current run becomes the initial baseline.
1. Load baseline from `.ai-engineering/evals/baseline.json`
2. Run all regression evals against current state
3. Compare against baseline results
4. Flag any degradation
```markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
```
## Quick Reference
### Grader Types
| Grader | How It Works | When to Use | Example |
|--------|-------------|-------------|---------|
| Code | Deterministic checks (grep, test runners, build) | Verifiable outputs, structured results | `grep -q "export function handleAuth" src/auth.ts && echo "PASS"` |
| Model | Claude evaluates open-ended output (score 1-5) | Prose quality, code style, creative output | Prompt: "Does it solve the stated problem? Score 1-5" |
| Human | Flag for manual review with risk level | Security decisions, UX judgment, ambiguous cases | `[HUMAN REVIEW REQUIRED] Risk Level: HIGH` |
### Metrics
| Metric | Definition | Typical Target |
|--------|-----------|----------------|
| pass@1 | First attempt success rate | Varies by difficulty |
| pass@3 | At least one success in 3 attempts | > 90% |
| pass@k | At least one success in k attempts | Depends on criticality |
| pass^3 | All 3 trials succeed | 100% for critical paths |
| pass^k | All k trials succeed | Use for regression evals |
## Storage
```
.ai-engineering/
evals/
<feature-name>.md # Eval definition
<feature-name>.log # Eval run history
baseline.json # Regression baselines
```
## Best Practices
1. **Use code graders when possible** -- deterministic > probabilistic
2. **Human review for security** -- never fully automate security checks
3. **Keep evals fast** -- slow evals don't get run
4. **Version evals with code** -- evals are first-class artifacts
## Examples
### Example 1 — define evals before implementing
User: "I'm about to add a new auth flow. Define evals first."
```
/ai-reliability-eval define auth-flow
```
Walks through capability evals (can-create-account, can-validate-email, can-hash-password) and regression evals (login-still-works), sets pass@3 targets, writes `.ai-engineering/evals/auth-flow.md`.
### Example 2 — regression check after model change
User: "I bumped the model version, did anything regress?"
```
/ai-reliability-eval regression
```
Loads `.ai-engineering/evals/baseline.json`, runs all regression evals against the new model, flags degradation.
## Integration
Called by: user directly, `/ai-build`, `/ai-verify` (regression mode). Calls: test runners (code graders), the model (model graders), stack-specific tools. See also: `/ai-test` (code correctness), `/ai-verify` (current quality gates).
## Common Mistakes
- Skipping the define phase and writing evals after implementation (same anti-pattern as tests-after)
- Using only model graders when code graders would be deterministic and faster
- Conflating evals with tests (tests verify code, evals verify AI capability)
- Setting pass@1 targets too high for genuinely hard tasks (use pass@3 instead)
$ARGUMENTS
More from arcasilesgroup/ai-engineering
- ai-adviseProactive governance advisor — checks standards, decisions, and quality trends during development. Always advisory, NEVER blocks. Three modes: `advise` (post-edit), `gate` (pre-dispatch), `drift` (on-demand decision audit). Trigger for 'governance check', 'advise on this change', 'check for drift', 'is this aligned with active decisions', 'shift-left advisory'. Not for blocking gates — use /ai-verify. Not for narrative code review — use /ai-review.
- ai-analyze-permissionsUse when Claude Code keeps asking to approve commands you have already approved, when settings.local.json has grown large, or when you want to consolidate permission grants into wildcard patterns. Trigger for 'too many permission prompts', 'clean up permissions', 'audit my settings', 'consolidate allow rules'. Claude Code only — not available in GitHub Copilot, Antigravity, or Codex.
- ai-animationDesigns motion, transitions, and micro-interactions for UI components: spring animations, gestures, easing, staggers — taste-driven detail compounding. Trigger for 'animate this', 'add transitions', 'micro-interactions for', 'gesture design', 'swipe to dismiss', 'easing for this', 'stagger the'. Not for design systems; use /ai-design instead. Not for visual art; use /ai-visual instead. Not for testing animation code; use /ai-test instead.
- ai-autopilotDelivers large multi-concern specs and backlog runs autonomously: decomposes specs into sub-specs (or normalizes work items into a backlog DAG), deep-plans with parallel agents, builds a dependency DAG, implements in waves, runs a single final quality loop with one bounded quality-remediation pass (verify+guard+review on full changeset), delivers via PR. Trigger for 'implement spec-NNN end to end', 'autopilot this', 'autonomous delivery', 'decompose and ship', 'run the backlog', 'execute these GitHub issues', 'process the sprint backlog'. Invocation is the approval gate. Not for small or single-concern tasks; use /ai-build instead. Not for ambiguous requirements; use /ai-brainstorm first.
- ai-boardOperates the project board (GitHub Projects v2 or Azure DevOps): discovers configuration after install (fields, state mappings, process templates) and syncs work-item state at lifecycle transitions. Trigger for 'set up the board', 'configure our ADO board', 'discover board fields', 'move this issue to in-review', 'update the board', 'mark as in progress', 'sync the work item state'. Two subcommands: `discover` (post-install configuration write) and `sync` (lifecycle state transitions). Auto-invoked via `sync` by /ai-brainstorm, /ai-build, and /ai-pr; fail-open. Not for backlog execution; use /ai-autopilot --backlog instead.
- ai-brainstormForces rigorous design interrogation BEFORE any code: explores approaches, surfaces ambiguity, gathers evidence, produces an approved spec that becomes the contract for /ai-plan. Trigger for 'lets add X', 'how should we handle Y', 'whats the best approach', 'I am thinking about', 'what should we build for'. Not for existing approved specs; use /ai-plan instead. Not for execution; use /ai-build instead.
- ai-branch-cleanupCleans branches safely: switches to the default branch, prunes merged and squash-merged branches, syncs to remote, sweeps stale specs, rotates `.ai-engineering/runtime/` per retention policy. Trigger for 'tidy up', 'tidy branches', 'sync to main', 'delete old branches', 'start fresh', 'rotate runtime'. Auto-invoked by /ai-pr after merge. Not for committing changes; use /ai-commit instead. Not for code-level dead-code removal; use /ai-simplify instead.
- ai-buildCanonical implementation gateway: reads approved plan.md, resolves stack from manifest, deterministic-routes each task to its adapter, dispatches the build agent in an isolated worktree, runs TDD self-validation per task, then a single final quality loop with one bounded quality-remediation pass on the full changeset before /ai-pr. Trigger for 'go', 'start building', 'execute the plan', 'implement it', 'lets do this', 'build the plan', 'resume', 'continue'. Not without an approved plan; run /ai-plan first. Not for multi-concern specs needing decomposition; use /ai-autopilot instead. Not for a single function or subcomponent; use /ai-code.
- ai-codeWrites production code that satisfies stack-context standards on the first pass: interface-first design, backward-compatibility checks, lightweight self-review. Trigger for 'implement this', 'write the code for', 'add X to Y', 'build this function', 'make this work'. Not for tests; use /ai-test instead. Not for debugging; use /ai-debug instead. Not for refactoring; use /ai-simplify instead. Not for executing an approved plan end-to-end; use /ai-build (the gateway).
- ai-commitRuns the governed commit pipeline: auto-branches from protected, stages selectively, formats and lints, scans for secrets, gates docs, composes a conventional message, pushes. Trigger for 'commit my changes', 'save my work', 'push this to remote', 'stage these files', 'ship it'. Not for opening a PR; use /ai-pr instead. Not for branch hygiene; use /ai-branch-cleanup instead.