eval-writer

$npx mdskill add langchain-ai/deepagentsjs/eval-writer

Build and manage comprehensive evaluation suites for the deepagentsjs monorepo.

  • Creates new evaluation packages, scaffolds test cases, and designs datasets for agent testing.
  • Integrates with vitest for execution, LangSmith for tracking, and the internal eval-harness.
  • Triggers when asked to benchmark, write evaluations, or add specific test scenarios.
  • Delivers functional, runnable test suites and reports structured performance metrics.

SKILL.md

.github/skills/eval-writerView on GitHub ↗
---
name: eval-writer
description: "Create new eval suites for the deepagentsjs monorepo. Handles dataset design, test case scaffolding, scoring logic, vitest configuration, and LangSmith integration. Use when the user asks to: (1) create an eval, (2) write an evaluation, (3) add a benchmark, (4) build an eval suite, (5) evaluate agent behaviour, (6) add test cases for a capability, or (7) implement an existing benchmark (e.g. oolong, AgentBench, SWE-bench). Trigger on phrases like 'create eval', 'new eval', 'add eval', 'benchmark', 'evaluate', 'eval suite', 'write evals for'."
---

# Eval Writer

Create new eval suites for the `deepagentsjs` monorepo. Each eval is an
independent workspace package under `evals/` that uses the `@deepagents/evals`
harness, runs via vitest, and reports results to LangSmith.

## Before you start

Read the existing eval infrastructure to understand current patterns:

```
internal/eval-harness/src/index.ts    # EvalRunner, RunAgentParams, matchers
internal/eval-harness/src/deepagent.ts # DeepAgentEvalRunner, extend()
internal/eval-harness/src/setup.ts     # Registered runners
evals/README.md                        # User-facing docs
internal/eval-harness/README.md        # Harness internals
```

Scan existing evals for conventions:

```
evals/basic/index.test.ts     # Simple: system prompt, reasoning
evals/files/index.test.ts     # File ops: read, write, edit, glob, grep
evals/subagents/index.test.ts # Delegation: task tool, named subagents
```

## Workflow

### 1. Understand the eval requirements

Clarify with the user:

- **What capability** is being evaluated? (file ops, tool use, multi-turn reasoning, memory, code generation, etc.)
- **Where do test cases come from?** Options:
  - **Inline** — hardcoded in the test file (simple, good for <20 cases)
  - **JSON/JSONL fixture** — checked into the eval package (good for 20-200 cases)
  - **External dataset** — downloaded at setup time (good for published benchmarks)
  - **LangSmith dataset** — pulled from LangSmith API (good for collaborative curation)
- **How should results be scored?** Options:
  - **Trajectory matchers** — step count, tool calls, final text (built-in)
  - **Exact/fuzzy match** — compare output to reference (simple)
  - **LLM-as-judge** — use a model to grade the output (complex evals)
  - **Code execution** — run generated code and check results (SWE-bench style)
  - **Custom evaluator** — domain-specific scoring function
- **Does the agent need special configuration?** (custom tools, subagents, system prompt, initial files)

### 2. Create the eval package

Every eval is a workspace package under `evals/<name>/`.

#### Directory structure

```
evals/<name>/
├── package.json
├── vitest.config.ts
├── index.test.ts
├── README.md
└── (optional) fixtures/       # JSON/JSONL test data
└── (optional) vitest.setup.ts  # Dataset loading, custom setup
└── (optional) evaluators.ts   # Custom scoring functions
```

#### package.json

```json
{
  "name": "@deepagents/eval-<name>",
  "private": true,
  "type": "module",
  "scripts": {
    "test:eval": "vitest run"
  },
  "dependencies": {
    "@deepagents/evals": "workspace:*",
    "deepagents": "workspace:*",
    "langsmith": "^0.5.4",
    "vitest": "^4.0.18"
  }
}
```

Add extra dependencies as needed (e.g. `zod` for tool schemas, `langchain`
for `tool()` helper, dataset-specific packages).

#### vitest.config.ts

```ts
import { defineConfig } from "vitest/config";

export default defineConfig({
  test: {
    environment: "node",
    globals: false,
    testTimeout: 120_000,
    hookTimeout: 60_000,
    teardownTimeout: 60_000,
    include: ["**/*.test.ts"],
    setupFiles: ["@deepagents/evals/setup"],
    reporters: ["default", "langsmith/vitest/reporter"],
  },
});
```

Adjust `testTimeout` for long-running evals (multi-turn, code execution).
Add `"./vitest.setup.ts"` to `setupFiles` if the eval needs custom setup (dataset loading, etc.).

#### README.md

```markdown
# <name>

<One-line description of what this eval tests.>
```

#### Verify workspace registration

Check that `pnpm-workspace.yaml` includes `"evals/*"`. It should already
be there — if not, add it.

### 3. Design test cases

#### Pattern A: Inline test cases (simple evals)

Best for small, hand-crafted test suites. Each test is an `ls.test()` call
with `inputs` and optional `referenceOutputs`.

```ts
ls.test(
  "descriptive test name",
  {
    inputs: { query: "What is 2+2?" },
    referenceOutputs: { expectedAnswer: "4" },
  },
  async ({ inputs, referenceOutputs }) => {
    const result = await runner.run({ query: inputs.query });
    // assertions...
  },
);
```

#### Pattern B: Data-driven with ls.test.each (medium evals)

Best for 10-200 cases from a fixture file. Load the data and iterate:

```ts
import testCases from "./fixtures/cases.json";

// testCases = [{ inputs: { query: "..." }, referenceOutputs: { answer: "..." } }, ...]

ls.test.each(testCases)(
  "case: ${inputs.query}",
  async ({ inputs, referenceOutputs }) => {
    const result = await runner.run({ query: inputs.query });
    // assertions using referenceOutputs...
  },
);
```

The fixture JSON must be an array of objects with at minimum `{ inputs: {...} }`.
Optional fields: `referenceOutputs`, `id`, `metadata`, `split`.

#### Pattern C: External dataset (published benchmarks)

For published benchmarks (oolong, AgentBench, SWE-bench, etc.), download
and cache the dataset in a setup file.

Create `vitest.setup.ts`:

```ts
import { existsSync, mkdirSync, writeFileSync, readFileSync } from "fs";
import { join } from "path";

const CACHE_DIR = join(import.meta.dirname, ".cache");
const DATA_PATH = join(CACHE_DIR, "dataset.json");

export async function loadDataset(): Promise<TestCase[]> {
  if (existsSync(DATA_PATH)) {
    return JSON.parse(readFileSync(DATA_PATH, "utf-8"));
  }

  mkdirSync(CACHE_DIR, { recursive: true });

  // Download from source — adapt to the specific benchmark
  const response = await fetch("https://example.com/dataset.json");
  const data = await response.json();

  // Transform into eval format
  const cases = data.map((item: any) => ({
    inputs: { query: item.question },
    referenceOutputs: { answer: item.gold_answer },
    metadata: { source: item.id, category: item.category },
  }));

  writeFileSync(DATA_PATH, JSON.stringify(cases, null, 2));
  return cases;
}
```

Add `.cache/` to `.gitignore` in the eval package.

Then register it as a vitest setup file in `vitest.config.ts`:

```ts
setupFiles: ["@deepagents/evals/setup", "./vitest.setup.ts"],
```

And in the test file:

```ts
import { loadDataset } from "./vitest.setup.js";

const dataset = await loadDataset();

ls.describe(runner.name, () => {
  ls.test.each(dataset)(
    "${metadata.source}: ${inputs.query}",
    async ({ inputs, referenceOutputs }) => {
      // ...
    },
  );
}, { projectName: "deepagents-js-<name>", upsert: true });
```

#### Pattern D: LangSmith dataset

Pull test cases from a LangSmith dataset. Useful for collaborative curation
where non-engineers add examples via the LangSmith UI.

```ts
import { Client } from "langsmith";

const client = new Client();

export async function loadDataset(): Promise<TestCase[]> {
  const examples = [];
  for await (const example of client.listExamples({
    datasetName: "my-dataset-name",
  })) {
    examples.push({
      id: example.id,
      inputs: example.inputs,
      referenceOutputs: example.outputs ?? {},
    });
  }
  return examples;
}
```

### 4. Write scoring logic

#### Built-in trajectory matchers

The harness provides vitest matchers that also log LangSmith feedback scores.
Use these as the primary building blocks:

```ts
// Exact step count
expect(result).toHaveAgentSteps(3);

// Exact tool-call count across all steps
expect(result).toHaveToolCallRequests(2);

// Check a specific tool call in step N (1-indexed)
expect(result).toHaveToolCallInStep(1, {
  name: "write_file",
  argsContains: { file_path: "/out.txt" },  // partial match
  argsEquals: { file_path: "/out.txt" },     // exact match
});

// Final response text
expect(result).toHaveFinalTextContaining("hello", true /* caseInsensitive */);

// Extract final text for custom assertions
import { getFinalText } from "@deepagents/evals";
const text = getFinalText(result);
expect(text.trim()).toBe("4");

// File system assertions
expect(result.files["/output.md"]).toContain("expected content");
expect(Object.keys(result.files)).toHaveLength(3);
```

#### Custom feedback logging

Log additional LangSmith feedback scores beyond what matchers provide:

```ts
import * as ls from "langsmith/vitest";

// Numeric score
ls.logFeedback({ key: "accuracy", score: 0.95 });

// Boolean score
ls.logFeedback({ key: "correct", score: 1 });

// With comment
ls.logFeedback({ key: "quality", score: 0.8, comment: "Minor formatting issue" });
```

#### LLM-as-judge evaluators

For subjective quality, use `ls.wrapEvaluator()` to create a traced evaluator
that logs feedback automatically:

```ts
import * as ls from "langsmith/vitest";
import { ChatAnthropic } from "@langchain/anthropic";

const judge = new ChatAnthropic({ model: "claude-sonnet-4-5-20250929" });

const evaluateHelpfulness = ls.wrapEvaluator(
  async ({ inputs, outputs, referenceOutputs }) => {
    const response = await judge.invoke([
      {
        role: "system",
        content: `Rate the helpfulness of the assistant's response on a scale of 0-1.
          Respond with JSON: { "score": <number>, "reasoning": "<explanation>" }`,
      },
      {
        role: "user",
        content: `Question: ${inputs.query}\nExpected: ${referenceOutputs.answer}\nActual: ${outputs.response}`,
      },
    ]);

    const parsed = JSON.parse(response.content as string);
    return {
      key: "helpfulness",
      score: parsed.score,
      comment: parsed.reasoning,
    };
  },
);

// In a test:
const result = await runner.run({ query: inputs.query });
const text = getFinalText(result);
await evaluateHelpfulness({
  inputs: { query: inputs.query },
  outputs: { response: text },
  referenceOutputs: referenceOutputs ?? {},
});
```

### 5. Wire up the test file

#### Minimal template

```ts
import * as ls from "langsmith/vitest";
import { expect } from "vitest";
import { getDefaultRunner, getFinalText } from "@deepagents/evals";

const runner = getDefaultRunner();

ls.describe(
  runner.name,
  () => {
    ls.test(
      "test name",
      {
        inputs: { query: "..." },
        referenceOutputs: { answer: "..." },
      },
      async ({ inputs, referenceOutputs }) => {
        const result = await runner.run({ query: inputs.query });

        expect(result).toHaveAgentSteps(1);
        expect(result).toHaveFinalTextContaining(referenceOutputs.answer);
      },
    );
  },
  { projectName: "deepagents-js-<name>", upsert: true },
);
```

#### Key conventions

- **`getDefaultRunner()`** — reads `EVAL_RUNNER` env var. Throws if not set.
- **`runner.name`** — used as `ls.describe` name → becomes the LangSmith dataset name.
- **`runner.run({ query, initialFiles? })`** — pure invocation. Returns `AgentTrajectory`.
- **`runner.extend({ systemPrompt?, tools?, subagents?, ... })`** — returns a new runner with agent config overrides. Use for tests that need custom agent setup.
- **`projectName`** in `ls.describe` config — sets the LangSmith project for tracing. Convention: `"deepagents-js-<eval-name>"`.
- **`upsert: true`** — reuse existing dataset/project instead of creating new ones each run.
- **Always import `expect` from `vitest`** — the harness extends it with custom matchers at import time.
- **`ls.logOutputs()`** is called inside the runner — do NOT call it in test code.

#### Using extend() for custom agent config

```ts
// Custom system prompt
const result = await runner
  .extend({ systemPrompt: "You are a code reviewer." })
  .run({ query: inputs.query });

// Custom tools
const result = await runner
  .extend({ tools: [myCustomTool] })
  .run({ query: inputs.query });

// Custom subagents
const result = await runner
  .extend({
    subagents: [{
      name: "researcher",
      description: "Research assistant",
      systemPrompt: "You help with research.",
      tools: [searchTool],
    }],
  })
  .run({ query: inputs.query });
```

#### Using initialFiles for seeded state

```ts
const result = await runner.run({
  query: "Read /data.csv and count the rows.",
  initialFiles: {
    "/data.csv": "name,age\nAlice,30\nBob,25\n",
  },
});
```

#### Sandbox-backed evals (containerized execution)

The default eval runners use the in-memory `StateBackend` — the agent can
read/write files but cannot execute shell commands, install packages, or
interact with a real OS. This is fine for testing tool selection, reasoning,
and file operations.

For evals that need real execution (SWE-bench, code generation, agentic
benchmarks), the agent must run against a sandbox backend. Available
sandbox providers:

| Provider | Package | Use case |
| --- | --- | --- |
| Modal | `@deepagents/modal` | Remote containers, GPU support |
| Daytona | `@deepagents/daytona` | Cloud dev environments |
| Deno | `@deepagents/deno` | Lightweight local sandboxes |
| Node VFS | `@deepagents/node-vfs` | In-process virtual filesystem + shell |

Pass the sandbox via `extend({ backend })`. Manage its lifecycle with
`beforeAll` / `afterAll` (suite-level) or `beforeEach` / `afterEach`
(per-test isolation):

```ts
import * as ls from "langsmith/vitest";
import { expect, beforeAll, afterAll } from "vitest";
import { getDefaultRunner, getFinalText } from "@deepagents/evals";
import { ModalSandbox } from "@deepagents/modal";

const runner = getDefaultRunner();
let sandbox: ModalSandbox;

beforeAll(async () => {
  sandbox = await ModalSandbox.create({
    image: "python:3.12-slim",
    timeout: 600,
  });
});

afterAll(async () => {
  await sandbox?.terminate();
});

ls.describe(
  runner.name,
  () => {
    ls.test(
      "agent can run python",
      { inputs: { query: "Write a Python script that prints 'hello' and run it." } },
      async ({ inputs }) => {
        const result = await runner
          .extend({ backend: sandbox })
          .run({ query: inputs.query });

        expect(result).toHaveFinalTextContaining("hello");
      },
    );
  },
  { projectName: "deepagents-js-sandbox-eval", upsert: true },
);
```

For per-test isolation (each test gets a fresh sandbox):

```ts
import { beforeEach, afterEach } from "vitest";

let sandbox: ModalSandbox;

beforeEach(async () => {
  sandbox = await ModalSandbox.create({ image: "python:3.12-slim" });
});

afterEach(async () => {
  await sandbox?.terminate();
});
```

**When to containerize:**
- The eval requires `execute()` (shell commands)
- The eval involves installing packages or modifying system state
- The eval runs untrusted or generated code
- Tests need filesystem isolation from each other

**When in-memory is fine:**
- Testing tool selection, reasoning, or response quality
- File read/write/edit operations (handled by `StateBackend` + `initialFiles`)
- System prompt adherence, subagent routing

Add the sandbox provider to `package.json` dependencies:

```json
{
  "dependencies": {
    "@deepagents/modal": "workspace:*"
  }
}
```

And increase `testTimeout` in `vitest.config.ts` — sandbox creation adds
overhead:

```ts
testTimeout: 300_000,  // 5 minutes for sandbox evals
hookTimeout: 120_000,  // sandbox setup/teardown
```

### 6. Install and verify

```bash
# From repo root
pnpm install

# Build the harness (if you changed it)
cd internal/eval-harness && pnpm build && cd ../..

# Run the new eval
EVAL_RUNNER=sonnet-4-5 pnpm --filter @deepagents/eval-<name> test:eval
```

### 7. Update documentation

Add the new eval to `evals/README.md` in the "Available eval suites" table:

```markdown
| [`<name>/`](./<name>/) | <one-line description> |
```

## Parity with Python deepagents evals

The Python `deepagents` package has eval suites in
`libs/deepagents/tests/evals/`. The JS evals should maintain parity.
When creating a new eval, check the Python source at
`https://github.com/langchain-ai/deepagents/blob/v0.5/libs/deepagents/tests/evals/`
for the reference implementation.

### Current parity status

| Python eval | JS eval | Status |
| --- | --- | --- |
| `test_system_prompt.py` | `evals/basic/` | ✅ Covered |
| `test_file_operations.py` | `evals/files/` | ✅ Covered |
| `test_subagents.py` | `evals/subagents/` | ✅ Covered |
| `test_memory.py` | `evals/memory/` | ✅ Covered |
| `test_hitl.py` | `evals/hitl/` | ✅ Covered |
| `test_skills.py` | `evals/skills/` | ✅ Covered |
| `test_summarization.py` | `evals/summarization/` | ❌ **Missing** |

### Notes on HITL evals

HITL evals require multi-step invocation (invoke → check interrupts → resume
with `Command`). The eval runner's `run()` does a single invocation, so HITL
tests construct agents directly via `createDeepAgent()` with a `checkpointer`
and `interruptOn` config. See `evals/hitl/index.test.ts` for the pattern.

### Notes on summarization evals

Summarization evals need `SummarizationMiddleware` with low token thresholds,
a checkpointer, a real/virtual filesystem backend, and multi-turn invocations.
These tests would bypass the standard `EvalRunner` and construct agents
directly, similar to HITL.

## Reference: ls.test.each API

`ls.test.each` is the most powerful pattern for data-driven evals. The table
must be an array of objects with at least `{ inputs }`:

```ts
ls.test.each([
  { inputs: { query: "Q1" }, referenceOutputs: { answer: "A1" } },
  { inputs: { query: "Q2" }, referenceOutputs: { answer: "A2" } },
  // Optional additional fields: id, metadata, split
  { id: "custom-id", inputs: { query: "Q3" }, referenceOutputs: { answer: "A3" }, split: "hard" },
])(
  "case: ${inputs.query}",  // Name template — interpolates from row
  async ({ inputs, referenceOutputs, testMetadata }) => {
    // testMetadata.exampleId, testMetadata.datasetId, etc.
    const result = await runner.run({ query: inputs.query });
    // ...
  },
);
```

## Reference: LangSmith integration

### How datasets map

| Concept | LangSmith entity |
| --- | --- |
| `ls.describe(name, ...)` | Dataset (name = dataset name) |
| `ls.test(name, { inputs, referenceOutputs }, fn)` | Example in dataset |
| Running the test suite | Experiment on the dataset |
| `ls.logFeedback(...)` | Feedback on the experiment run |
| `ls.logOutputs(...)` | Experiment output (called by runner) |

### Environment variables

| Variable | Purpose |
| --- | --- |
| `EVAL_RUNNER` | Which model runner to use (e.g. `sonnet-4-5`) |
| `LANGSMITH_API_KEY` | LangSmith auth |
| `LANGSMITH_PROJECT` | Override tracing project (normally set via `projectName`) |
| `LANGSMITH_TEST_TRACKING` | Set to `"false"` to disable LangSmith reporting |
| `ANTHROPIC_API_KEY` | For Anthropic model runners |
| `OPENAI_API_KEY` | For OpenAI model runners |

### Available runners

Defined in `internal/eval-harness/src/setup.ts`:

| Runner name | Model |
| --- | --- |
| `sonnet-4-5` | Claude Sonnet 4.5 |
| `sonnet-4-5-thinking` | Claude Sonnet 4.5 with extended thinking |
| `opus-4-6` | Claude Opus 4.6 |
| `gpt-4.1` | GPT-4.1 |
| `gpt-4.1-mini` | GPT-4.1 Mini |
| `o3-mini` | o3-mini |

## Reference: Implementing published benchmarks

When implementing an existing benchmark, follow these attribution and
methodology guidelines.

### Attribution

Always credit the original benchmark authors. In the eval's `README.md`:

```markdown
# <benchmark-name>

Implementation of [<Benchmark Name>](<paper-or-repo-url>) by <authors> (<year>).

> <brief description from the paper abstract>

## Citation

\`\`\`bibtex
@article{...}
\`\`\`

## Adaptations

<Describe any differences from the original benchmark methodology:>
- <e.g. "Subset of N cases selected for cost efficiency">
- <e.g. "Adapted for agentic tool-use evaluation rather than direct QA">
- <e.g. "Uses LLM-as-judge instead of human annotation">
```

### Common benchmark patterns

#### QA / Factual benchmarks (e.g. MMLU, TrivialQA)

- Test cases: question + gold answer
- Scoring: exact match or fuzzy match on final text
- Pattern: `ls.test.each` with fixture JSON

#### Multi-turn / Conversational (e.g. MT-Bench)

- Test cases: conversation turns that build on each other
- Scoring: LLM-as-judge on each turn
- Pattern: sequential `runner.run()` calls sharing state, or multi-message queries

#### Tool-use benchmarks (e.g. ToolBench, API-Bank)

- Test cases: task requiring specific tool calls
- Scoring: trajectory matchers (correct tools called, correct args)
- Pattern: `runner.extend({ tools: [...] })` with custom tools that return canned responses

#### Code generation (e.g. HumanEval, SWE-bench)

- Test cases: problem description + test suite
- Scoring: execute generated code, check test results
- **Requires sandbox**: yes — agent must run code and observe output
- Pattern: `runner.extend({ backend: sandbox })`, write code to file, execute tests via sandbox

#### Memory / Long-context (e.g. oolong, needle-in-haystack)

- Test cases: large context + retrieval question
- Scoring: whether the agent finds the target information
- **Requires sandbox**: no — seed files via `initialFiles`, check final text
- Pattern: in-memory `StateBackend` is sufficient

#### Agentic benchmarks (e.g. AgentBench, WebArena)

- Test cases: multi-step tasks requiring planning
- Scoring: combination of trajectory analysis + final state checking
- **Requires sandbox**: yes — agent needs shell access, package installs, environment interaction
- Pattern: `runner.extend({ backend: sandbox })` with per-test sandbox isolation

### Handling large datasets

For benchmarks with thousands of cases:

1. **Subset selection** — pick a representative subset (e.g. 100 cases per category). Document the selection criteria.
2. **Split support** — use `split` field in test cases to categorise (e.g. `"easy"`, `"hard"`). Run subsets via vitest filtering.
3. **Caching** — download once, cache in `.cache/` (gitignored).
4. **Cost awareness** — estimate API cost before running. Log it in the README. Consider a `--dry-run` that validates fixtures without calling the LLM.

More from langchain-ai/deepagentsjs