rewardkit

Name: rewardkit
Author: harbor-framework/harbor
$npx mdskill add harbor-framework/harbor/rewardkit
Write Harbor task verifiers using Reward Kit to define grading criteria and reward scores
Solve the problem of creating test criteria for agent tasks with programmatic checks and judges
Depends on Reward Kit, a Python package, and TOML/Python for criteria definitions
Evaluates task outputs using functions in checks.py and judge.toml for LLM/agent scoring
Generates a reward.json file with scores and logs for integration into Harbor workflows
SKILL.md
.github/skills/rewardkitView on GitHub ↗
---
name: rewardkit
description: Write Harbor task verifiers using Reward Kit. Use when creating or editing a 
  task's tests/ directory, adding grading criteria, setting up LLM/agent judges, or designing 
  verifiers that produce a reward score.
---

Help the user write task verifiers with Reward Kit. Reward Kit is a lightweight Python 
package that turns a directory of criteria files into a reward score. Each criterion is a 
Python function call or a TOML judge file; folders become separate rewards.

## Setup in a Harbor task

Put criteria alongside `test.sh` in the task's `tests/` directory:

```
tests/
├── test.sh
├── checks.py         # programmatic criteria
└── judge.toml        # optional LLM/agent judge
```

`tests/test.sh`:
```bash
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests
```

This runs all criteria in `/tests/` against the workspace at `/app` and writes 
`/logs/verifier/reward.json`. Defaults match Harbor's conventions — no extra config needed.

If judge criteria need API keys, pass them through `task.toml`:
```toml
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
```

Ask whether Reward Kit should run in the agent's shared environment or in a
separate verifier environment. Prefer a separate verifier environment when judge
prompts, grading dependencies, API keys, or clean-room checks should not be
available to the agent:

```toml
[verifier]
environment_mode = "separate"

[verifier.environment]
docker_image = "python:3.12-slim"
allow_internet = true
```

In separate mode, `tests/` is the verifier image build context and must provide
`/tests/test.sh` at runtime; Harbor does not upload `tests/` into the running
verifier container.

## Programmatic criteria

Call built-ins from any `.py` file in `tests/`:

```python
import rewardkit as rk

rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")
```

All criteria accept `weight` (default `1.0`) and `isolated` (default `False`, runs in 
overlayfs so side effects don't leak).

### Available built-ins

- **Files**: `file_exists`, `file_not_exists`, `file_contains`, `file_contains_regex`, 
  `file_matches`, `files_equal`, `diff_ratio`
- **Commands**: `command_succeeds`, `command_output_contains`, `command_output_matches`, 
  `command_output_matches_regex` (30s default timeout, optional `cwd`)
- **Data**: `json_key_equals`, `json_path_equals`, `csv_cell_equals`, `xlsx_cell_equals` 
  (needs `[office]` extra), `sqlite_query_equals`
- **HTTP**: `http_status_equals`, `http_response_contains`
- **Images**: `image_similarity`, `image_size_equals` (needs `[image]` extra)
- **Trajectory**: `trajectory_tool_used`, `trajectory_tool_not_used`, `trajectory_turn_count`

For extras, install with `uv tool install harbor-rewardkit[all]`.

## Custom criteria

Use the `@criterion` decorator. First parameter is always `workspace: Path`. Returns 
`bool` or `float`:

```python
from pathlib import Path
from rewardkit import criterion

@criterion
def has_valid_output(workspace: Path) -> bool:
    return (workspace / "output.txt").read_text().strip() != ""
```

Zero-parameter criteria auto-register. Criteria with extra args must be called via `rk`:

```python
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
    return len((workspace / "output.txt").read_text().splitlines()) >= n

rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)
```

For criteria shared across reward subdirs, define with `shared=True` in a root-level file 
and call from subdirs.

## Judge criteria (LLM or agent-as-a-judge)

For subjective checks (quality, readability, edge cases), create a TOML file:

```toml
[judge]
judge = "anthropic/claude-sonnet-4-6"   # LiteLLM model string
files = ["/app/main.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0
```

Criterion types:
- `binary` — yes/no → 1.0 or 0.0
- `likert` — 1..points, normalized to [0, 1]
- `numeric` — min..max, normalized to [0, 1]

### Agent judges

Agent judges shell out to a CLI and can explore the filesystem:

```toml
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true

[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"
```

Slower and more expensive than LLM judges, but they can run commands and inspect files.

### Useful `[judge]` options

`timeout` (default 300), `reasoning_effort` (`low`|`medium`|`high`), `reference` (path to 
reference solution), `atif-trajectory` (evaluate the agent's trajectory), `weight`, 
`prompt_template` (custom prompt with `{criteria}` placeholder).

### Scoring aggregation

```toml
[scoring]
aggregation = "all_pass"   # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7             # only for threshold
```

Only affects aggregation *within* this TOML file.

## Multi-reward tasks

Put criteria in subdirectories — each becomes a separate reward:

```
tests/
├── test.sh
├── correctness/
│   └── check.py
├── structure/
│   └── files_exist.py
└── quality/
    └── quality.toml
```

Produces:
```json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }
```

## Output files

- `/logs/verifier/reward.json` — per-reward scores
- `/logs/verifier/reward-details.json` — per-criterion results, judge reasoning, errors

## Multi-step tasks

In a multi-step task, each step has its own `tests/` under
`steps/{name}/tests/`, and the verifier runs once per step. Reward Kit behaves
the same as in a single-step task: for each step it reads `/tests`, runs the
criteria against `/app`, and writes `/logs/verifier/reward.json` for that step.
Harbor then aggregates per-step results into a trial-level reward via
`multi_step_reward_strategy` in `task.toml` — aggregation happens *outside*
Reward Kit, so don't try to encode cross-step logic in your criteria.

A task-level `tests/` directory (at the task root) is uploaded to `/tests`
first, then the step's own `tests/` is layered on top (same-name files win).
Put shared helpers (common `checks.py` functions with `shared=True`, fixture
files, a fallback `test.sh`) at the task level, and step-specific criteria
under each step.

Multi-reward subdirectories still work *within* a step: `steps/foo/tests/`
can contain `correctness/`, `structure/`, `quality/` — each produces a
separate reward key for that step, and `multi_step_reward_strategy = "mean"`
averages each key across steps. Use `"final"` when the last step is an
end-to-end check whose rewards already represent the full task.

## When to reach for what

- **Use built-ins** for file existence, string matches, command output, JSON/CSV checks, 
  HTTP probes.
- **Use `@criterion`** when logic is task-specific but still programmatic.
- **Use LLM judges** for subjective quality dimensions (readability, correctness of prose).
- **Use agent judges** when the rubric requires exploring the filesystem or running code 
  (e.g. "does the test suite actually pass?").
- **Use subdirectories** when you want separate scores (correctness vs structure vs 
  quality) rather than one blended number.
- **Use `isolated=True`** for any criterion that runs mutating commands, so it doesn't 
  corrupt the workspace for other criteria.

## Working example

See `examples/tasks/reward-kit-example/` in the Harbor repo.