ai-reliability-eval

$npx mdskill add arcasilesgroup/ai-engineering/ai-reliability-eval

Define pass/fail criteria to measure AI reliability over time.

  • Helps developers track regression and define success criteria before coding.
  • Depends on pass@k metrics to quantify task completion consistency.
  • Executes capability checks against pre-agreed pass/fail thresholds.
  • Reports status changes and identifies when reliability degrades.
SKILL.md
.github/skills/ai-reliability-evalView on GitHub ↗
---
name: ai-reliability-eval
description: "Measures AI system reliability over time by defining pass/fail criteria before implementation, running capability checks, and tracking regression via pass@k metrics. Trigger for 'how reliable is this', 'did my changes break anything', 'measure AI performance', 'define success criteria', 'eval this feature', 'check skill regression'. Not for code correctness; use /ai-test instead. Not for quality gates; use /ai-verify instead — evals measure AI task completion consistency."
effort: mid
argument-hint: "define|check|report|regression|--skill-set [feature]"
mode: agent
tags: [quality, evals, improvement]
model_tier: sonnet
mirror_family: copilot-skills
generated_by: ai-eng sync
canonical_source: .claude/skills/ai-reliability-eval/SKILL.md
edit_policy: generated-do-not-edit
---



# Reliability Eval

## Purpose

Eval-Driven Development (EDD) treats evals as the unit tests of AI development. Define pass/fail criteria before writing code. Measure AI reliability with pass@k metrics. Track regressions across prompt, agent, and model changes. Evals answer the question: "Can the AI do this reliably?"

**Key distinction**: `ai-verify` checks current code quality (linting, coverage, security). `ai-reliability-eval` measures AI reliability over time (can the agent complete this task consistently?).

## When to Use

- `define`: defining pass/fail criteria before implementation (EDD principle)
- `check`: running current evals and reporting status mid-implementation
- `report`: generating full eval report after implementation
- `regression`: ensuring changes to prompts, agents, or models don't break existing capabilities
- `--skill-set`: skill-set mode — runs the optimizer over each skill's eval corpus under `.ai-engineering/evals/<skill>.jsonl` and gates pass@1 vs `.ai-engineering/evals/baseline.json`. Combine with `--regression` to fail on >5 pp pass@1 drop (sub-007 M6, D-127-07). Wired into `.github/workflows/skill-evals.yml` on PRs touching `.github/skills/**`.

## Process

### Mode: define (Before Coding)

1. Identify the capability being built or changed
2. Write capability evals (can the AI do this new thing?)
3. Write regression evals (do existing things still work?)
4. Set success metrics (pass@k targets)
5. Store eval definition at `.ai-engineering/evals/<feature-name>.md`

```markdown
## EVAL DEFINITION: feature-xyz

### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely

### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact

### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
```

### Mode: check (During Implementation)

1. Read the eval definition from `.ai-engineering/evals/<feature-name>.md`
2. Run each capability eval, record PASS/FAIL
3. Run regression evals via existing test suites
4. Report current status with pass@k counts
5. Identify which evals still fail and why

### Mode: report (After Implementation)

Run all capability + regression evals (per Mode: check), calculate pass@k and pass^k metrics, then generate the structured report:

```markdown
EVAL REPORT: feature-xyz
========================

Capability Evals:
  create-user:     PASS (pass@1)
  validate-email:  PASS (pass@2)
  hash-password:   PASS (pass@1)
  Overall:         3/3 passed

Regression Evals:
  login-flow:      PASS
  session-mgmt:    PASS
  logout-flow:     PASS
  Overall:         3/3 passed

Metrics:
  pass@1: 67% (2/3)
  pass@3: 100% (3/3)

Status: READY FOR REVIEW
```

Store the rendered report at `.ai-engineering/evals/<feature-name>.log`.

### Mode: regression

Baseline is created automatically on the first `report` run. If `baseline.json` does not exist, the current run becomes the initial baseline.

1. Load baseline from `.ai-engineering/evals/baseline.json`
2. Run all regression evals against current state
3. Compare against baseline results
4. Flag any degradation

```markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
```

## Quick Reference

### Grader Types

| Grader | How It Works | When to Use | Example |
|--------|-------------|-------------|---------|
| Code | Deterministic checks (grep, test runners, build) | Verifiable outputs, structured results | `grep -q "export function handleAuth" src/auth.ts && echo "PASS"` |
| Model | Claude evaluates open-ended output (score 1-5) | Prose quality, code style, creative output | Prompt: "Does it solve the stated problem? Score 1-5" |
| Human | Flag for manual review with risk level | Security decisions, UX judgment, ambiguous cases | `[HUMAN REVIEW REQUIRED] Risk Level: HIGH` |

### Metrics

| Metric | Definition | Typical Target |
|--------|-----------|----------------|
| pass@1 | First attempt success rate | Varies by difficulty |
| pass@3 | At least one success in 3 attempts | > 90% |
| pass@k | At least one success in k attempts | Depends on criticality |
| pass^3 | All 3 trials succeed | 100% for critical paths |
| pass^k | All k trials succeed | Use for regression evals |

## Storage

```
.ai-engineering/
  evals/
    <feature-name>.md      # Eval definition
    <feature-name>.log     # Eval run history
    baseline.json           # Regression baselines
```

## Best Practices

1. **Use code graders when possible** -- deterministic > probabilistic
2. **Human review for security** -- never fully automate security checks
3. **Keep evals fast** -- slow evals don't get run
4. **Version evals with code** -- evals are first-class artifacts

## Examples

### Example 1 — define evals before implementing

User: "I'm about to add a new auth flow. Define evals first."

```
/ai-reliability-eval define auth-flow
```

Walks through capability evals (can-create-account, can-validate-email, can-hash-password) and regression evals (login-still-works), sets pass@3 targets, writes `.ai-engineering/evals/auth-flow.md`.

### Example 2 — regression check after model change

User: "I bumped the model version, did anything regress?"

```
/ai-reliability-eval regression
```

Loads `.ai-engineering/evals/baseline.json`, runs all regression evals against the new model, flags degradation.

## Integration

Called by: user directly, `/ai-build`, `/ai-verify` (regression mode). Calls: test runners (code graders), the model (model graders), stack-specific tools. See also: `/ai-test` (code correctness), `/ai-verify` (current quality gates).

## Common Mistakes

- Skipping the define phase and writing evals after implementation (same anti-pattern as tests-after)
- Using only model graders when code graders would be deterministic and faster
- Conflating evals with tests (tests verify code, evals verify AI capability)
- Setting pass@1 targets too high for genuinely hard tasks (use pass@3 instead)

$ARGUMENTS
More from arcasilesgroup/ai-engineering