e2e-scenario-testing

$npx mdskill add obra/dotfiles/e2e-scenario-testing

Verify that a *running* application does what it claims, by driving its real interface the way a user would. The unit of work is a **scenario card**: a short markdown test written for an agent to execute — not a Playwright/expect script. Cards are high-level enough that a small UI shuffle doesn't invalidate them, but precise enough that two agents running the same card reach the same verdict.

SKILL.md
.github/skills/e2e-scenario-testingView on GitHub ↗
---
name: e2e-scenario-testing
description: Use when verifying a running application end-to-end through its real interface — a web UI, a CLI, or a TUI — by writing and executing agent-run "scenario cards" against a freshly built instance with falsifiable assertions. Trigger on "test it end to end", "prove the UI actually works", "write/run a scenario", or after a change touches a user-facing surface that unit tests can't fully cover. Not for unit tests, pure code review, or API-only checks.
---

# End-to-end scenario testing

Verify that a *running* application does what it claims, by driving its real
interface the way a user would. The unit of work is a **scenario card**: a
short markdown test written for an agent to execute — not a Playwright/expect
script. Cards are high-level enough that a small UI shuffle doesn't invalidate
them, but precise enough that two agents running the same card reach the same
verdict.

A green unit test proves the wiring in isolation. A scenario proves the wiring
*as assembled and rendered*. They catch different bugs — write the card even
when the unit tests pass.

## When to use this

- A feature touches a user-facing surface (button, palette command, status
  indicator, keybinding, rendered message) and you want proof it works live.
- The user asks to "test it end to end" / "prove the UI works" / "run a scenario."
- You changed a layer (projection, capability gate, renderer) whose effect is
  only observable in the assembled UI.

Don't use it for logic with no UI surface (unit-test that), or when a
production gate makes the live path unreachable (see *Over-specification* below).

## The card format

One card = one `.md` file. Keep these sections; collapse any to one line when
the scenario is simple. Don't pad.

```markdown
# <area>-<behavior>: one-line title

**What this covers**: the feature + the specific commits/IDs it exercises.
If something else breaks this, it should be caught here.

## Pre-state
What must be true before starting: a freshly built instance running, auth/creds
in place, a clean workdir. Give the exact commands to reach it.

## Steps
Numbered actions described by **intent**, each with the concrete command or
tool call and a real UI label (prefer labels the user sees over brittle
selectors like `#nav > li:nth-child(3)`).

## Expected
For each step, what you should observe — and the **falsification condition**:
"if you see X instead, the test fails." Silence is not success.

## Cleanup
Idempotent teardown so reruns are hermetic. Never touch state you didn't create.

## Sharp edges
Footguns, timing/ordering caveats, nondeterminism noted while recording.
```

## Running a card

1. **Build fresh from the code under test.** The single most common mistake is
   testing a stale binary. Rebuild every layer your change touches (server,
   client, embedded assets) and confirm the running instance is the new one,
   not a process someone left up yesterday.
2. **Isolate.** Run in a hermetic workdir. If the app holds a host-level
   singleton (a lock, a fixed port, a shared state dir), point the test
   instance at its own copy — e.g. override `$HOME`/state-dir/port — so it can
   neither collide with nor pollute (nor be polluted by) a real instance.
   Symlink shared read-only inputs (creds, tokens); keep mutable state separate.
3. **Drive the surface** (recipes below).
4. **Assert against the authoritative record, not just the pixels.** The UI can
   lie or lag; the on-disk state / log / database is ground truth. Cross-check
   the rendered claim against it when an assertion is ambiguous.
5. **Capture evidence** — a screenshot, the captured pane, the on-disk artifact.
6. **Clean up** — shut down what you spawned, remove scratch dirs, leave
   pre-existing instances running and untouched.

## Driving a web UI (browser)

Use a Chrome/CDP browser tool. After authenticated navigation, drive the page
through `eval` against the app's own JS entry points rather than synthesizing
clicks where possible — it's more robust to layout change.

- **Optimistic-vs-settled** assertions: fire the action but *don't await it*,
  take a synchronous DOM snapshot (the pending placeholder is there *now*),
  then await and snapshot again. Without the no-await capture you can't tell
  "rendered then reconciled" from "never rendered."
- Return a **plain string** from `eval` (join your findings with `\n`); some
  bridges stringify a returned object as `[object Object]`.
- Inspect internal state via the app's singleton (`window.<App>?.state`, etc.)
  when the DOM is ambiguous.

## Driving a CLI / TUI (tmux)

Each scenario gets its own named tmux session (cleanup needs a deterministic
name). Fix the size for deterministic capture; prefer the app's plain-text/inline
mode if it has one.

```bash
tmux new-session -d -s <name> -x 200 -y 50 "<cmd> 2>/tmp/<name>-stderr.log"
tmux send-keys -t <name> -l "literal text"   # -l = no key-name parsing (paths, slashes)
tmux send-keys -t <name> Enter
tmux capture-pane -t <name> -p                # -p = plain text; add -e only for styling
```

- Always `-l` for user-typed strings; without it `/foo/bar` parses as escapes.
- Poll `capture-pane` for a state string; grep the **glyph/word**, not the color.
- Redirect stderr to a file — panics and debug probes land there, not the pane.

## Hard-won principles

- **Falsification, always.** Every assertion states what failure looks like. A
  step that can't fail proves nothing. When watching for an outcome, make sure
  your check would fire on the failure path, not just the happy path.
- **Verify the *right* surface.** The same concept often exists at several
  layers (an internal capability vs. the REST projection of it; a model field
  vs. the rendered chip). Confirm your assertion reads the surface that actually
  carries the signal — a "missing" value is often present one layer over.
- **Present but not visible ≠ absent.** Scrollable bodies, virtualized lists,
  and auto-scroll-to-bottom routinely push a real element out of the capture
  window. Before concluding something didn't render, scroll/expand to where it
  should be. Confirm via a sibling read (a status command reading the same
  state) when the visual is hard to capture.
- **Executing the card tests the card.** Expect to find bugs in your own
  scenario — a wrong selector, a wrong layer, an assertion the UI can't show.
  Fix the card as you go; a card that "passes" because its check was vacuous is
  worse than none.
- **Over-specification trap.** A card can describe a path that production gating
  prevents (e.g. a keybind that's a no-op in the current mode). Confirm the gate
  in the source rather than fighting it through the UI; verify the underlying
  behavior with a unit test and note the gate in the card.
- **Cleanup is part of the test.** A half-shutdown fleet makes the next run's
  polling return false positives. Make teardown idempotent and scoped to what
  you created.

## Finishing

Report each assertion as pass/fail with the concrete observation (the rendered
text, the on-disk value), not "looks good." If a card fails, capture the
evidence and either fix the bug or file it; don't soften the verdict.
More from obra/dotfiles