configure-axis

$npx mdskill add netlify/axis/configure-axis

AXIS (Agent Experience Index Score) is a synthetic testing framework for AI agents. This skill teaches you to author the two files an AXIS user maintains:

SKILL.md
.github/skills/configure-axisView on GitHub ↗
---
name: configure-axis
description: Author AXIS (Agent Experience Index Score) scenarios and axis.config.json for a project. Use when the user asks to set up AXIS, add a scenario, write or edit axis.config.json, or evaluate an AI agent with AXIS.
---

# Configure AXIS

AXIS (Agent Experience Index Score) is a synthetic testing framework for AI agents. This skill teaches you to author the two files an AXIS user maintains:

1. **Scenarios** under `scenarios/` (or any path the config points at): one JSON file per task the agent will be asked to perform.
2. **`axis.config.json`** at the project root: which agents to run, where scenarios live, and what is shared across them.

## When to use this skill

Trigger phrases include "set up AXIS", "add an AXIS scenario", "write an axis.config.json", "evaluate my agent with AXIS", "score my agent on X".

Before authoring, do this:

1. Look for an existing `axis.config.json` (or `axis.config.{js,ts}`) at the project root. If one exists, read it; do not overwrite without confirmation.
2. List the existing scenarios directory if present. Match its naming and style.
3. If no config exists, suggest running `npx @netlify/axis init` first, or scaffold one yourself using the patterns below.

## Conceptual model

For each scenario, AXIS runs every configured agent against the same prompt in an isolated workspace, then scores each run on four dimensions and produces an HTML + JSON report.

- **Goal achievement** (default weight 0.4): did the agent satisfy the judge's checks?
- **Environment** (0.2): did filesystem / shell / network operations succeed reliably?
- **Service** (0.2): did external services (APIs, MCP servers) respond reliably?
- **Agent** (0.2): were the agent's decisions sound across every tool call?

Refer to the framework's output as the **AXIS Result**. The acronym expands to **Agent Experience Index Score**.

## Authoring a scenario

A scenario is a JSON file under the scenarios directory. The file path (without extension, relative to that directory) becomes the scenario's `key`. Only `name`, `prompt`, and `judge` are required.

> The annotated examples below are labeled `jsonc` for documentation only. They contain `// comments` and trailing commas. Real `.json` files do NOT support either. When copying these examples into output, strip every `//` comment line and every trailing comma. Plain JSON examples (in the Recipes section) are safe to copy verbatim.

Full annotated shape:

```jsonc
{
  // Display name shown in reports.
  "name": "Refactor utility module",

  // Set true to exclude this scenario from runs without deleting it.
  "skip": false,

  // Run before the agent starts. Two action types: run_script and copy.
  "setup": [
    { "action": "run_script", "command": "git init -q && git add -A && git commit -q -m init" },
    { "action": "copy", "match": "fixtures/sample-repo/**", "destination": "." },
  ],

  // The task. Be specific and verifiable.
  "prompt": "Refactor src/utils.js to split it into two files: src/strings.js and src/numbers.js. Update all import sites and ensure `npm test` still passes.",

  // Judge: either a single string OR an array of weighted checks.
  // Use the array form for multi-criterion scoring. Weight is optional;
  // remaining weight is distributed equally across unweighted entries.
  "judge": [
    { "check": "src/strings.js exists and exports the string utilities", "weight": 0.3 },
    { "check": "src/numbers.js exists and exports the numeric utilities", "weight": 0.3 },
    { "check": "All import sites are updated and reference the new files", "weight": 0.2 },
    { "check": "`npm test` passes after the change", "weight": 0.2 },
  ],

  // Run after the agent finishes, before scoring.
  "teardown": [{ "action": "run_script", "command": "rm -rf node_modules" }],

  // Only these agents run this scenario. Overrides the top-level agents list.
  "agents": ["claude-code"],

  // Skills passed to the agent under test. Each entry is a local path,
  // GitHub shorthand (owner/repo), or a full GitHub URL.
  "skills": ["./skills/repo-conventions", "anthropics/skills"],

  // MCP servers available to the agent for this scenario only. Merged
  // with the top-level mcp_servers from axis.config.json.
  "mcp_servers": {
    "filesystem": { "type": "stdio", "command": "npx", "args": ["-y", "@modelcontextprotocol/server-filesystem", "."] },
  },

  // Per-scenario time/token limits. Defaults: 15 minutes, no token cap.
  "limits": { "time_minutes": 10, "tokens": 200000 },

  // Glob patterns (relative to the workspace) of files to capture into the
  // report after teardown. Merged with the top-level artifacts list.
  "artifacts": ["src/**/*.js", "test-output.log"],
}
```

### Judge: string vs weighted array

- **String** when one statement captures the whole pass/fail bar: `"The agent should have written summary.md with at least three sentences."`
- **Weighted array** when there are multiple checks the agent could partially satisfy. Each entry has a `check` (the statement) and an optional `weight` (sum your weights to 1.0; unweighted entries split the remainder evenly).

Write checks as **observable facts a third party could verify**, not vibes. "Agent wrote a file named `summary.md` with at least three sentences" beats "Agent did a good job summarizing".

### Lifecycle actions

Two action types are allowed in `setup` and `teardown`:

- `{ "action": "run_script", "command": "<shell command>" }`: runs with the agent's workspace as cwd. Available env vars include `AXIS_PHASE` (`setup`/`teardown`), `AXIS_WORKSPACE`, and `AXIS_OUTPUT` (a file path where the script can append markdown that will surface in the report).
- `{ "action": "copy", "match": "<glob>", "destination": "<workspace-relative path>" }`: copies files matching `match` (resolved relative to the config file) into `destination` (relative to the workspace). The path of each matched file relative to the longest non-glob prefix of `match` is preserved under `destination`.

### Variants

When a scenario should run multiple times with small differences, define `variants`. The parent does not run by itself; each variant inherits all parent fields and may override `prompt`, `judge`, `setup`, `teardown`, `agents`, `skills`, `mcp_servers`, `limits`, `artifacts`, or `skip`.

```jsonc
{
  "name": "Summarize the docs",
  "prompt": "Summarize the contents of docs/ into summary.md.",
  "judge": "summary.md exists and is at least 200 words.",
  "variants": [
    { "name": "baseline" },
    { "name": "concise", "prompt": "Summarize the contents of docs/ into summary.md in fewer than 100 words." },
  ],
}
```

Each variant's key is `{scenarioKey}@{variantName}`. Variant names must match `/^[a-zA-Z0-9_-]+$/`.

## Authoring `axis.config.json`

Sits at the project root. Minimum viable file:

```json
{
  "scenarios": "./scenarios",
  "agents": ["claude-code"]
}
```

Full annotated shape:

```jsonc
{
  // Shown in report headers.
  "name": "My Project",

  // Where scenarios come from. Three forms:
  //  - "./path" (a directory, walked recursively for *.json scenarios)
  //  - ["./path1", "./scenarios/special.json"] (mix of dirs and single files)
  //  - Mixed with inline scenarios (objects with a required "key" field) when
  //    authoring axis.config.{js,ts} programmatically.
  //  - Git URLs ("https://github.com/owner/repo") cloned into .axis/remotes/
  //    and merged from their own axis.config.
  // Defaults to "./scenarios" when omitted.
  "scenarios": ["./scenarios", "https://github.com/netlify/agent-runner-orchestrator"],

  // Agents to evaluate. Strings are shorthand for { "agent": "<name>" }.
  "agents": [
    "claude-code",
    { "agent": "claude-code", "model": "claude-opus-4-6" },
    { "agent": "codex", "model": "gpt-5-codex" },
    {
      "agent": "echo", // custom adapter (see "adapters" below)
      "command": "./bin/my-agent", // CLI override for custom adapters
      "scenarios": ["hello-world"], // restrict this agent to a subset (scenario keys)
      "skills": ["./skills/my-conventions"], // per-agent skills
      "flags": { "debug": true, "max-turns": "5" },
    },
  ],

  // Agents used to score runs. When omitted, each agent judges itself
  // (it scores its own transcript). Otherwise the first entry whose adapter
  // name differs from the run's own agent is picked; if every entry matches,
  // the first entry is used. Order is precedence.
  "judging": {
    "agents": [{ "agent": "claude-code", "model": "claude-opus-4-7" }, "codex"],
  },

  // MCP servers shared across all agents and all scenarios. Two types:
  // stdio (local subprocess) and http (remote endpoint).
  "mcp_servers": {
    "filesystem": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "."],
      "env": { "DEBUG": "1" },
    },
    "search": {
      "type": "http",
      "url": "https://mcp.example.com/search",
      "headers": { "Authorization": "Bearer ${SEARCH_TOKEN}" },
    },
  },

  // Skills shared across all agents. Each entry is one of:
  //  - local path: "./skills/my-skill"
  //  - GitHub shorthand: "anthropics/skills"
  //  - GitHub URL: "https://github.com/anthropics/skills"
  "skills": ["./skills/repo-conventions"],

  // Glob patterns (relative to each scenario's workspace) of files captured
  // into the report after teardown. Merged with per-scenario artifacts.
  "artifacts": ["./*.md", "src/**/*.js"],

  // Custom adapters. Keys are adapter names referenced by "agents" entries;
  // values are paths (relative to this config) to JS/TS modules that export
  // an AgentAdapter as default or as `adapter`.
  "adapters": { "echo": "./adapters/echo.ts" },

  // Extra env vars allowed through to agent processes. PATH, HOME, and the
  // default agent API keys (ANTHROPIC_API_KEY, CODEX_API_KEY, GEMINI_API_KEY)
  // pass through automatically; entries here are merged on top.
  "env": ["GITHUB_TOKEN", "MY_FEATURE_FLAG"],

  // Run-wide settings.
  "settings": {
    "concurrency": 8,
    "limits": {
      "run": { "time_minutes": 60, "tokens": 5000000 },
      "scenario": { "time_minutes": 10, "tokens": 200000 },
    },
    "remotes": { "maxDepth": 1 },
  },

  // CLI-only lifecycle hooks. beforeAll runs once before any scenarios start;
  // afterAll runs once after every scenario is scored. Scripts execute with
  // cwd set to the config directory. afterAll receives AXIS_REPORT_DIR,
  // AXIS_TOTAL, AXIS_COMPLETED, AXIS_FAILED, AXIS_DURATION_MS as env vars.
  // These are not fired by the programmatic run() API.
  "beforeAll": [{ "action": "run_script", "command": "npm install --silent" }],
  "afterAll": [{ "action": "run_script", "command": "echo Report at $AXIS_REPORT_DIR" }],
}
```

### Built-in adapters

- `claude-code`: Anthropic Claude Code CLI. Requires `ANTHROPIC_API_KEY`.
- `codex`: OpenAI Codex CLI. Requires `CODEX_API_KEY`.
- `gemini`: Google Gemini CLI. Requires `GEMINI_API_KEY`.

Any other name in `agents[].agent` must be declared in the `adapters` map and point to a module that exports an `AgentAdapter`.

### Judging precedence

For each run, AXIS finds a judge by scanning `judging.agents` in order and picking the first entry whose adapter name differs from the agent being scored. If every entry matches, the first entry is used. When `judging` is omitted, the run's own agent judges itself.

## Recipes

### Minimal scenario

```json
{
  "name": "Hello world",
  "prompt": "Say hello.",
  "judge": "The agent should say hello."
}
```

### Realistic scenario with setup, weighted judge, and teardown

```json
{
  "name": "Debug and fix a broken script",
  "setup": [
    {
      "action": "run_script",
      "command": "mkdir -p /tmp/demo && echo 'function add(a, b) { return a - b; }' > /tmp/demo/add.js"
    }
  ],
  "prompt": "There is a JavaScript file at /tmp/demo/add.js with a bug. Run it, fix it, and verify the fix.",
  "judge": [
    { "check": "Agent ran the script and observed the wrong output", "weight": 0.25 },
    { "check": "Agent identified the subtraction-instead-of-addition bug", "weight": 0.25 },
    { "check": "Agent fixed the bug in the file", "weight": 0.25 },
    { "check": "Agent re-ran the script and confirmed correct output", "weight": 0.25 }
  ],
  "teardown": [{ "action": "run_script", "command": "rm -rf /tmp/demo" }]
}
```

### Multi-agent comparison config

```json
{
  "scenarios": "./scenarios",
  "agents": [
    { "agent": "claude-code", "model": "claude-sonnet-4-6" },
    { "agent": "claude-code", "model": "claude-opus-4-6" },
    "codex"
  ],
  "judging": {
    "agents": [{ "agent": "claude-code", "model": "claude-opus-4-7" }]
  }
}
```

### Custom adapter wiring

Module `adapters/echo.ts`:

```ts
import { createAgentAdapter } from "@netlify/axis";

export default createAgentAdapter<{ stdout: string }>({
  name: "echo",
  resolveCommand: () => ({ command: "echo", prefixArgs: [] }),
  buildArgs: (input) => [input.prompt],
  initialState: () => ({ stdout: "" }),
  streamConfig: {
    mode: "aggregate",
    onChunk: (chunk, ctx) => {
      ctx.state.stdout += chunk;
    },
  },
  getResult: (ctx) => ({ result: ctx.state.stdout.trim() || null }),
});
```

`axis.config.json`:

```json
{
  "adapters": { "echo": "./adapters/echo.ts" },
  "scenarios": "./scenarios",
  "agents": [{ "agent": "echo" }]
}
```

## Reading AXIS reports

When asked to interpret a report, compare reports, identify regressions, or explain a score, the structure to navigate:

- `.axis/reports/{reportId}/report.json` is the run manifest. Top-level fields: `version`, `reportId`, `timestamp`, `durationMs`, `summary`, `results[]`.
- `.axis/reports/{reportId}/scenarios/{scenarioKey}/{agentName}.json` is the per-run detail.
- `.axis/baselines/{name}.json` is a saved baseline (default name: `main`). Entries contain `axisScore`, `goalAchievement`, `environment`, `service`, `agent`.

### Per-result shape (inside `results[]`)

- `scenarioKey`, `scenarioName`, `agentName`
- `durationMs`, `exitCode`, `tokenUsage: { input, output, cacheReadInput }`
- `score.axisScore` (0-100, composite)
- `score.goalAchievement.{score, criteria[]}` where each criterion has `check`, `weight`, `score`, `rationale`
- `score.environment.{score, dimensions, audits[]}` with `dimensions: { success, speed, weight, relevance, necessity }`
- `score.service.{score, dimensions, audits[]}` (same shape)
- `score.agent.{score, dimensions, audits[]}` (same shape, but the agent dimension audits every interaction, not just agent-tagged ones)

### What each dimension measures

- **Goal achievement** (default weight 0.4): LLM judge scores the agent against the scenario's `judge` checks. The only dimension that depends on the prompt; the other three are intrinsic.
- **Environment** (0.2): execution reliability of filesystem, shell, and network operations. Did `ls` / `cat` / `bash` / `fetch` calls succeed cleanly? NOT about whether the output was useful.
- **Service** (0.2): execution reliability of external service interactions (MCP servers, APIs). Same idea: did the service respond cleanly?
- **Agent** (0.2): decision quality across every interaction (every tool call is an agent choice, including a plain `ls`). Sub-dimensions: success (0.1), speed (0.1), weight (0.2), relevance (0.2), necessity (0.4). Necessity is the biggest signal: redundant calls are penalized hard.

Speed is always heuristic (threshold buckets per category), never LLM-evaluated. All other dimensions use an LLM judge.

### Composite formula

`axisScore = goalAchievement * w_goal + environment * w_env + service * w_svc + agent * w_agent`

Default weights `{ goal_achievement: 0.4, environment: 0.2, service: 0.2, agent: 0.2 }`. Override in `axis.config.json` under `settings.scoring_weights`.

### Calibration

All categories use log-normal CDF mapping with median=0.5, sigma=0.4:

- raw 0.5 → 50
- raw 0.8 → 88
- raw 0.985 → 96

So even "perfect" runs typically cap around 95-99. You will not see a clean 100 unless every interaction is flawless across every sub-dimension; treat 95+ as a top-band result.

### Finding a regression vs a baseline

1. Match on `scenarioKey` (variants like `foo@bar` count as distinct scenarios).
2. Subtract: `delta = report.axisScore - baseline.axisScore`.
3. The dimension that dropped the most indicates the failure mode:
   - Goal dropped: the agent did not satisfy the judge checks. Read `criteria[].rationale` to see which checks failed.
   - Environment dropped: shell, filesystem, or network operations failed. Read `environment.audits[]` and look for `success < 1`.
   - Service dropped: external service interactions flaked.
   - Agent dropped: redundant, unnecessary, or low-relevance tool calls. Look at `agent.dimensions.necessity` first.

### Signals to cite when writing analyses

Cite numbers from the actual report file. Do not paraphrase or guess.

- "AXIS Result dropped from 84 to 53 (a 31-point regression)"
- "Service success collapsed from 0.95 to 0.30"
- "Agent necessity was 0.32: roughly 68% of tool calls were judged unnecessary"
- "5 of 7 service interactions returned errors per the audits[] entries"

If a value is not in the file, do not invent it. Open the file and read it.

## Rules you must follow

1. Refer to the score as the **AXIS Result**, never "AXIS Score" (which reads as "score score"). The acronym is **Agent Experience Index Score**, never "eXperience".
2. Do not use em dashes in any prose, comment, or judge check you author. Use a comma, semicolon, colon, parenthesis, or a new sentence instead.
3. Do not invent fields. The authoritative schemas live at `src/types/scenario.ts` and `src/types/config.ts` in the netlify/axis repo. If you are unsure whether a field exists, omit it and tell the user where to look.
4. Judge checks must be specific and verifiable. Prefer "Agent wrote a file named X with property Y" over "Agent did well at the task".
5. The default per-scenario time limit is 15 minutes. Only override it when the task warrants a different ceiling.
6. For `skills` entries, pick the simplest form that works: prefer a local path during development, GitHub shorthand (`owner/repo`) for public skills, full URLs only when needed.
7. When `judge` is a weighted array, sum your weights to 1.0 (or leave some unweighted to split the remainder; do not exceed 1.0).
8. Variant names match `/^[a-zA-Z0-9_-]+$/`. Scenario keys are derived from the file path; do not set `key` for file-based scenarios.
9. `beforeAll` and `afterAll` only fire from the CLI. Do not rely on them when the user runs AXIS programmatically via `run()`.
10. In an isolated AXIS scenario workspace, do NOT run verification commands like `tsc`, `node -e "require(...)"`, `git diff`, `git status`, or `npm install` to check your authored files. The workspace is intentionally minimal: no `node_modules`, no git history. These commands fail with environment errors that tank the environment dimension. Write the file correctly against the schema. The AXIS judge inspects your output directly.
11. When asked to make a targeted edit (add a field, fix a single bug), edit ONLY what the prompt specifies. Do not reorganize, reformat, or add unrelated fields. Preserve every field the prompt did not name. The judge often checks "original X and Y fields are preserved unchanged".
12. Minimize unnecessary tool calls. Every tool call is evaluated as an agent decision; redundant `ls`, repeated `cat` of the same file, exploratory `find` that you do not act on, all tank the agent dimension via the `necessity` sub-dimension. Read each file you need once. Write each edit once. Stop when the task is done.
13. Field-name discipline. AXIS uses snake_case in all JSON config fields: `mcp_servers` not `mcpServers`, `time_minutes` not `timeMinutes`, `run_script` not `runScript` or `shell`. The deprecated alias `rubric` exists for backwards compat; prefer `judge`. Other commonly-invented names that are WRONG: `criteria`, `success_criteria`, `expected`, `tasks`, `evaluators`, `models`, `timeout`, `maxTokens`, `tokenLimit`, `timeoutMinutes`.

## Validation

After authoring or editing, ask the user to run:

```
npx @netlify/axis run --help
```

to confirm the CLI is installed, then:

```
npx @netlify/axis run
```

The config loader validates the file on load and prints actionable errors (missing required fields, unknown adapter names, malformed limits, invalid skill sources, etc.). Fix any reported errors before declaring the work done.

## Reference

- Documentation site: https://axis.run
- Scenario schema: `src/types/scenario.ts` in the netlify/axis repo
- Config schema: `src/types/config.ts` in the netlify/axis repo
- Validator (source of truth for accepted shapes): `src/config/validator.ts`

## Installing this skill

To make this skill available to your AI tool in a project, drop it under `.claude/skills/`:

```
mkdir -p .claude/skills/configure-axis
curl -fsSL https://raw.githubusercontent.com/netlify/axis/main/skills/configure-axis/SKILL.md \
  -o .claude/skills/configure-axis/SKILL.md
```