scenario

$npx mdskill add boshu2/agentops/scenario

Manage holdout scenarios for behavioral validation

  • Isolates scenarios to test system behavior against unseen inputs
  • Uses property-based testing, snapshot testing, and LLM evaluation frameworks
  • Generates measurable acceptance vectors and satisfaction scoring
  • Stores artifacts in .agents/holdout/ for evaluators and humans
SKILL.md
.github/skills/scenarioView on GitHub ↗
---
name: scenario
description: Manage holdout scenarios.
practices:
- property-based-testing
- snapshot-testing
- llm-eval-harness
hexagonal_role: supporting
consumes: []
produces:
- result.json
context_rel: []
skill_api_version: 1
metadata:
  tier: execution
  stability: experimental
output_contract: Scenario artifacts in .agents/holdout/*.json
---
# Scenario Skill

Author and manage holdout scenarios for behavioral validation. Scenarios
define **what** the system should do in narrative form, with measurable
acceptance vectors and satisfaction scoring. They live in `.agents/holdout/`
so implementing agents cannot see them during development.

## Quick Start

```bash
# Initialize holdout directory
/scenario init

# Add a scenario from a description
/scenario add "user can authenticate with valid credentials"

# List all active scenarios
/scenario list

# Validate scenarios against the schema
/scenario validate
```

## Execution Steps

### Step 1: Initialize Holdout Directory

```bash
ao scenario init
```

Creates `.agents/holdout/` with a `README.md` explaining holdout isolation
rules. If the directory already exists, this is a no-op.

The README makes clear:
- Implementing agents MUST NOT read `.agents/holdout/`
- Only evaluator agents and humans should author scenarios
- Hook enforcement prevents implementing agents from accessing holdout files

### Step 2: Author Scenarios

Provide a narrative description and the skill generates a schema-compliant
JSON scenario file.

```bash
ao scenario add "user can authenticate with valid credentials"
```

The skill will:
1. Generate an ID (`s-YYYY-MM-DD-NNN`)
2. Prompt for or infer the narrative, expected outcome, and acceptance vectors
3. Set default satisfaction threshold (0.8)
4. Write to `.agents/holdout/s-YYYY-MM-DD-NNN.json`

You can also author scenarios manually by writing JSON that conforms to
`schemas/scenario.v1.schema.json`. See [Scenario Schema Reference](references/scenario-schema.md).

### Step 3: Validate Scenarios

```bash
ao scenario validate
```

Validates every `.json` file in `.agents/holdout/` against
`schemas/scenario.v1.schema.json`. Reports:
- Schema violations (missing fields, wrong types)
- Duplicate IDs
- Stale scenarios (status = "active" but date > 90 days old)
- Acceptance vectors with no check command

### Step 4: List Scenarios

```bash
ao scenario list
```

Displays all scenarios with:
- ID, goal, status, source, date
- Satisfaction threshold
- Count of acceptance vectors

Filter options:
```bash
ao scenario list --status active
ao scenario list --status draft
ao scenario list --status retired
```

### Linking scenarios to GOALS.md directives

A scenario linked to a GOALS.md directive becomes part of the executable
spec. `ao goals scenarios --create "<goal>" --directive N` scaffolds a
promoted spec scenario and links it bidirectionally; `ao goals scenarios`
lists each directive's linked scenarios; `ao goals scenarios --lint` checks
the link graph. Ad hoc holdout scenarios authored with `ao scenario add`
stay unlinked until promoted. See the `/goals` skill and `docs/adr/ADR-0003`.

Once linked, a scenario's pass/fail feeds the directive's fitness:
`ao goals measure` rolls linked scenario results into a per-directive
`scenario_satisfaction` ratio (RED below threshold), and `ao goals trace`
renders the directive → scenario → bead → verdict → learning lineage and
audits it for orphans. See the `/goals` skill for both surfaces.

### Step 5: Integration with Validation

Scenarios are consumed by **STEP 1.8** in the `/validation` skill. During
validation, the evaluator agent:
1. Loads all active scenarios from `.agents/holdout/`
2. Runs each acceptance vector's check command
3. Computes a satisfaction score per scenario (0.0-1.0)
4. Aggregates into an overall holdout score
5. Fails the validation gate if any scenario falls below its threshold

## Key Rules

### Holdout Isolation

Scenarios are **holdout data**. The implementing agent must never see them.
This prevents the agent from overfitting to specific test cases instead of
building correct general behavior.

- Scenarios live in `.agents/holdout/`, which is outside the codebase
- A hook enforces that implementing agents cannot read holdout files
- Only evaluator agents, humans, or the `/validation` skill access scenarios

### Satisfaction Scoring

Scenarios use continuous satisfaction scoring (0.0-1.0), not boolean
pass/fail. This enables:
- Partial credit for incomplete implementations
- Trend tracking across iterations
- Threshold tuning per scenario based on criticality

Each acceptance vector produces a score, and the scenario's overall score
is the weighted average across all vectors.

### Authorship Rules

- Scenarios should be written by **humans** or by **evaluator agents**
- The implementing agent MUST NOT author its own scenarios
- The `source` field tracks provenance: `human`, `agent`, or `prod-telemetry`
- When an evaluator agent writes scenarios, it should operate in a separate
  session with no access to implementation details

### Scenario Lifecycle

| Status | Meaning |
|--------|---------|
| `active` | Scenario is evaluated during validation |
| `retired` | Scenario passed consistently; kept for reference |
| `blocked` | Scenario cannot be evaluated (missing dependency) |
| `draft` | Scenario is incomplete; not yet evaluated |

## Reference Documents

- [Scenario Schema Reference](references/scenario-schema.md) -- full field
  documentation and example JSON for the scenario schema
- [references/scenario.feature](references/scenario.feature) — Executable spec: author holdout scenarios, schema-validate, list + link to GOALS directives, feed /validation (soc-qk4b)

## Troubleshooting

| Problem | Cause | Fix |
|---------|-------|-----|
| `validate` reports missing fields | Schema version mismatch | Check `version` field matches schema expectation |
| Scenario not picked up by validation | Status is not `active` | Set `"status": "active"` in the JSON |
| Implementing agent read holdout | Hook not installed | Run `ao scenario init` to verify hook setup |
| Duplicate ID error | Two scenarios share an ID | Rename one using `s-YYYY-MM-DD-NNN` format |
| Stale scenario warning | Active scenario older than 90 days | Review and retire or refresh the scenario |
| Score always 0.0 | Check command returns non-zero | Debug the check command independently |

## See Also

- `/validation` -- consumes scenarios at STEP 1.8 for holdout evaluation
- `/council` -- multi-model review can generate scenario suggestions
- `/vibe` -- code quality validation (complementary to behavioral scenarios)
More from boshu2/agentops