run-evals

Name: run-evals
Author: different-ai/openwork

$npx mdskill add different-ai/openwork/run-evals

Runs OpenWork UI evals on Daytona sandboxes or local Electron instances

Executes end-to-end UI tests for onboarding, session, and settings flows
Uses Daytona CLI, Docker, and CDP browser tools for automation
Chooses fresh Daytona sandbox by default, falls back to local Electron
Returns eval results via logs and sandbox URLs for inspection

SKILL.md

.github/skills/run-evalsView on GitHub ↗

---
name: run-evals
description: Run OpenWork UI evals on a Daytona sandbox or local Electron instance. Handles sandbox creation, service startup, and eval execution via CDP browser tools.
---

# Skill: Run Evals

Run the OpenWork UI evaluation flows against a real Electron app. Prefer a fresh Daytona sandbox for each run, with a local test fallback when Daytona is unavailable.

## When to use

- User says "run evals on Daytona" or "run this flow on Daytona"
- User wants to verify a UI change end-to-end
- User wants to test the onboarding, session, or settings flows

## Prerequisites

- `daytona` CLI installed and logged in (`daytona login`)
- Using the "Different AI" org (`daytona organization use "Different AI"`)
- The `.devcontainer/` files exist in the repo

## Workflow

### Step 1: Create sandbox

Create a new Daytona sandbox for each eval run. Avoid reusing old sandboxes unless the user explicitly asks to debug existing state.

Before creating the sandbox, explicitly point Daytona at the Different AI org:

```bash
daytona organization use "Different AI"
```

Pick a unique name:

```bash
SANDBOX="openwork-eval-$(date +%Y%m%d-%H%M%S)"
```

Create it from the Daytona VNC-capable devcontainer:

```bash
daytona create \
  --name "$SANDBOX" \
  --dockerfile .devcontainer/Dockerfile.daytona-vnc \
  --context .devcontainer/Dockerfile.daytona-vnc \
  --context .devcontainer/start-daytona-vnc.sh \
  --class large \
  --memory 8 \
  --disk 10 \
  --auto-stop 60 \
  --public \
  --target us
```

Use `--disk 10`; the default Daytona disk can fill up during dependency and
sidecar work.

If Daytona is unavailable, skip to the local fallback and still run the closest possible test.

### Step 2: Prepare repo

Use a clean checkout inside the sandbox. The devcontainer Dockerfile normally clones the repo into `/workspace`; if that is missing, clone it there. Then fetch and check out the branch or commit under test.

```bash
daytona exec "$SANDBOX" 'test -d /workspace/.git || git clone https://github.com/different-ai/openwork.git /workspace'
daytona exec "$SANDBOX" 'git -C /workspace fetch --all --prune && git -C /workspace checkout <branch-or-commit> && git -C /workspace pull --ff-only || true'
```

Install dependencies before starting services:

```bash
daytona exec "$SANDBOX" 'cd /workspace && pnpm install'
```

### Step 3: Start services

```bash
daytona exec "$SANDBOX" 'bash -lc "cd /workspace && nohup bash .devcontainer/start-daytona-vnc.sh > /tmp/start-vnc.log 2>&1 &"'
daytona exec "$SANDBOX" 'bash -lc "cd /workspace/apps/app && nohup env OPENWORK_DEV_MODE=1 pnpm exec vite --host 0.0.0.0 --port 5173 > /tmp/vite.log 2>&1 &"'
daytona exec "$SANDBOX" 'bash -lc "cd /workspace && nohup env DISPLAY=:99 ELECTRON_DISABLE_SANDBOX=1 OPENWORK_REACT_DEVTOOLS=0 OPENWORK_ELECTRON_REMOTE_DEBUG_PORT=9825 OPENWORK_DEV_MODE=1 pnpm --filter @openwork/desktop dev:electron > /tmp/electron.log 2>&1 &"'
```

Wait ~35-60s for XFCE/noVNC, Vite, Electron, and opencode to start.

### Step 4: Verify

```bash
# Get CDP URL
daytona preview-url "$SANDBOX" -p 9825
```

Then use the browser tools to verify:

```
browser_list({ browser_url: "<CDP_URL>" })
→ should show "OpenWork" page target
```

If `browser_list` fails, inspect `/tmp/electron.log`. The real CDP success
marker is Chromium's `DevTools listening on ws://127.0.0.1:9825/...`, not just
OpenWork's `Electron CDP exposed` line.

### Step 5: Create a workspace

If the app shows the Welcome page, create a workspace:

1. Create directory on sandbox:
   ```bash
   daytona exec "$SANDBOX" 'mkdir -p /workspace/hello'
   ```

2. Follow the workspace creation flow from `evals/daytona-flows.md` Flow 1:
   - Click "Get started" → "Local workspace"
   - Inject path via React fiber dispatch: `{ key: "selectedFolder", value: "/workspace/hello" }`
   - Click "Create Workspace"
   - Wait 10s for opencode sidecar to boot

### Step 6: Run the requested eval

Read the eval file from `evals/` and execute each step using the browser tools.

For each step:
1. Execute the `browser_evaluate` / `browser_click` / `browser_screenshot` call
2. Verify the expected outcome
3. Report pass/fail

### Key techniques

**Clicking buttons:**
```
browser_evaluate({ browser_url: URL, expression: "(function() { var btns = document.querySelectorAll('button'); for (var i = 0; i < btns.length; i++) { if (btns[i].textContent.indexOf('BUTTON_TEXT') !== -1) { btns[i].click(); return 'clicked'; } } return 'not found'; })()" })
```

**Typing in Lexical editors:**
```
browser_evaluate({ browser_url: URL, expression: "(function() { var e = document.querySelector('[contenteditable=true]'); e.focus(); document.execCommand('insertText', false, 'YOUR TEXT'); return 'typed'; })()" })
```

**Injecting folder path (bypass native picker):**
Use the `__reactFiber$` → `CreateWorkspaceModal` reducer dispatch with `{ key: "selectedFolder", value: "/path" }`. Full code in `evals/daytona-flows.md` Flow 1 Step 5.

**Checking page state:**
```
browser_evaluate({ browser_url: URL, expression: "document.body.innerText.substring(0, 500)" })
```

**Screenshots:**
```
browser_screenshot({ browser_url: URL })
```

### Local fallback

Always include a local fallback in the result. Use it when Daytona is down, quota-limited, or the sandbox cannot expose CDP. At minimum, run the closest local verification commands and report that the Daytona path was unavailable.

```bash
pnpm install
pnpm --filter @openwork/app typecheck
pnpm --filter @openwork/app build
```

For UI flow verification, start the local app and attach browser tools to the local Electron or Chrome DevTools endpoint, then run the same eval steps from `evals/`.

```bash
pnpm dev
```

Report clearly whether the result came from Daytona or the local fallback.

### Teardown

```bash
daytona delete "$SANDBOX"
```