scenario
$
npx mdskill add boshu2/agentops/scenarioManage holdout scenarios for behavioral validation
- Isolates scenarios to test system behavior against unseen inputs
- Uses property-based testing, snapshot testing, and LLM evaluation frameworks
- Generates measurable acceptance vectors and satisfaction scoring
- Stores artifacts in .agents/holdout/ for evaluators and humans
SKILL.md
.github/skills/scenarioView on GitHub ↗
--- name: scenario description: Manage holdout scenarios. practices: - property-based-testing - snapshot-testing - llm-eval-harness hexagonal_role: supporting consumes: [] produces: - result.json context_rel: [] skill_api_version: 1 metadata: tier: execution stability: experimental output_contract: Scenario artifacts in .agents/holdout/*.json --- # Scenario Skill Author and manage holdout scenarios for behavioral validation. Scenarios define **what** the system should do in narrative form, with measurable acceptance vectors and satisfaction scoring. They live in `.agents/holdout/` so implementing agents cannot see them during development. ## Quick Start ```bash # Initialize holdout directory /scenario init # Add a scenario from a description /scenario add "user can authenticate with valid credentials" # List all active scenarios /scenario list # Validate scenarios against the schema /scenario validate ``` ## Execution Steps ### Step 1: Initialize Holdout Directory ```bash ao scenario init ``` Creates `.agents/holdout/` with a `README.md` explaining holdout isolation rules. If the directory already exists, this is a no-op. The README makes clear: - Implementing agents MUST NOT read `.agents/holdout/` - Only evaluator agents and humans should author scenarios - Hook enforcement prevents implementing agents from accessing holdout files ### Step 2: Author Scenarios Provide a narrative description and the skill generates a schema-compliant JSON scenario file. ```bash ao scenario add "user can authenticate with valid credentials" ``` The skill will: 1. Generate an ID (`s-YYYY-MM-DD-NNN`) 2. Prompt for or infer the narrative, expected outcome, and acceptance vectors 3. Set default satisfaction threshold (0.8) 4. Write to `.agents/holdout/s-YYYY-MM-DD-NNN.json` You can also author scenarios manually by writing JSON that conforms to `schemas/scenario.v1.schema.json`. See [Scenario Schema Reference](references/scenario-schema.md). ### Step 3: Validate Scenarios ```bash ao scenario validate ``` Validates every `.json` file in `.agents/holdout/` against `schemas/scenario.v1.schema.json`. Reports: - Schema violations (missing fields, wrong types) - Duplicate IDs - Stale scenarios (status = "active" but date > 90 days old) - Acceptance vectors with no check command ### Step 4: List Scenarios ```bash ao scenario list ``` Displays all scenarios with: - ID, goal, status, source, date - Satisfaction threshold - Count of acceptance vectors Filter options: ```bash ao scenario list --status active ao scenario list --status draft ao scenario list --status retired ``` ### Linking scenarios to GOALS.md directives A scenario linked to a GOALS.md directive becomes part of the executable spec. `ao goals scenarios --create "<goal>" --directive N` scaffolds a promoted spec scenario and links it bidirectionally; `ao goals scenarios` lists each directive's linked scenarios; `ao goals scenarios --lint` checks the link graph. Ad hoc holdout scenarios authored with `ao scenario add` stay unlinked until promoted. See the `/goals` skill and `docs/adr/ADR-0003`. Once linked, a scenario's pass/fail feeds the directive's fitness: `ao goals measure` rolls linked scenario results into a per-directive `scenario_satisfaction` ratio (RED below threshold), and `ao goals trace` renders the directive → scenario → bead → verdict → learning lineage and audits it for orphans. See the `/goals` skill for both surfaces. ### Step 5: Integration with Validation Scenarios are consumed by **STEP 1.8** in the `/validation` skill. During validation, the evaluator agent: 1. Loads all active scenarios from `.agents/holdout/` 2. Runs each acceptance vector's check command 3. Computes a satisfaction score per scenario (0.0-1.0) 4. Aggregates into an overall holdout score 5. Fails the validation gate if any scenario falls below its threshold ## Key Rules ### Holdout Isolation Scenarios are **holdout data**. The implementing agent must never see them. This prevents the agent from overfitting to specific test cases instead of building correct general behavior. - Scenarios live in `.agents/holdout/`, which is outside the codebase - A hook enforces that implementing agents cannot read holdout files - Only evaluator agents, humans, or the `/validation` skill access scenarios ### Satisfaction Scoring Scenarios use continuous satisfaction scoring (0.0-1.0), not boolean pass/fail. This enables: - Partial credit for incomplete implementations - Trend tracking across iterations - Threshold tuning per scenario based on criticality Each acceptance vector produces a score, and the scenario's overall score is the weighted average across all vectors. ### Authorship Rules - Scenarios should be written by **humans** or by **evaluator agents** - The implementing agent MUST NOT author its own scenarios - The `source` field tracks provenance: `human`, `agent`, or `prod-telemetry` - When an evaluator agent writes scenarios, it should operate in a separate session with no access to implementation details ### Scenario Lifecycle | Status | Meaning | |--------|---------| | `active` | Scenario is evaluated during validation | | `retired` | Scenario passed consistently; kept for reference | | `blocked` | Scenario cannot be evaluated (missing dependency) | | `draft` | Scenario is incomplete; not yet evaluated | ## Reference Documents - [Scenario Schema Reference](references/scenario-schema.md) -- full field documentation and example JSON for the scenario schema - [references/scenario.feature](references/scenario.feature) — Executable spec: author holdout scenarios, schema-validate, list + link to GOALS directives, feed /validation (soc-qk4b) ## Troubleshooting | Problem | Cause | Fix | |---------|-------|-----| | `validate` reports missing fields | Schema version mismatch | Check `version` field matches schema expectation | | Scenario not picked up by validation | Status is not `active` | Set `"status": "active"` in the JSON | | Implementing agent read holdout | Hook not installed | Run `ao scenario init` to verify hook setup | | Duplicate ID error | Two scenarios share an ID | Rename one using `s-YYYY-MM-DD-NNN` format | | Stale scenario warning | Active scenario older than 90 days | Review and retire or refresh the scenario | | Score always 0.0 | Check command returns non-zero | Debug the check command independently | ## See Also - `/validation` -- consumes scenarios at STEP 1.8 for holdout evaluation - `/council` -- multi-model review can generate scenario suggestions - `/vibe` -- code quality validation (complementary to behavioral scenarios)
More from boshu2/agentops
- autodevManage bounded autonomous dev loops.
- beadsTrack issues with bd/br, triage with bv, and convert plans to beads.
- bootstrapInitialize AgentOps project files.
- bug-huntInvestigate bugs and root causes.
- codex-teamCoordinate multiple Codex agents.
- compileCompile .agents knowledge wiki.
- complexityFind focused refactor hotspots.
- converterConvert AgentOps skill formats.
- crankExecute epics through waves.
- curateMine transcripts, .agents, bd, and git for skill diffs, bd updates, or