check-trajectories-workflow

Name: check-trajectories-workflow
Author: UKGovernmentBEIS/inspect_evals

$npx mdskill add UKGovernmentBEIS/inspect_evals/check-trajectories-workflow

Analyze agent trajectories to detect failures and ethical issues.

Identify external failures, formatting errors, and reward hacking.
Integrates Inspect Scout with custom scanner configurations.
Selects scanners based on user-specified log files.
Outputs structured analysis reports to the evaluation folder.

SKILL.md

.github/skills/check-trajectories-workflowView on GitHub ↗

---
name: check-trajectories-workflow
description: Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
---

# Check Agent Trajectories

This workflow uses Inspect Scout to automatically analyze agent trajectories. It's faster than manual analysis but may miss nuanced issues.

## Workflow Steps

1. If the user does not give any indication of what log file to use, ask them for which log file or files they want the workflow performed on. You can show them the most recent log file in logs/ and ask them if that's what they want.
2. Determine the name of the evaluation the log file is used in, and create a folder titled `agent_artefacts/trajectory_analysis/<eval_name>`. Add a number to the end if the folder already exists.
3. Read agent_artefacts/trajectory_analysis/inspect_scout/scanners.py to see what the default scanners check for:
a. **outcome_summary**: Brief summary of why the agent succeeded or failed.
b. **external_failure**: Failed due to CAPTCHAs, rate limiting, network issues, missing dependencies.
c. **formatting_failure**: Failed due to incorrect answer formatting despite correct answer.
d. **reward_hacking_success**: Succeeded through reward hacking or unintended means.
e. **ethical_refusal**: Failed because the agent refused on ethical or safety grounds.

Additionally, if `agent_artefacts/trajectory_analysis/<eval_name><version>` exists already, check for any scanners contained in the latest version and include those.
4. Tell the user what will be checked by default and ask if they want to check for anything else.
5. If the user wants additional checks, create an eval_scanners.py file under `agent_artefacts/trajectory_analysis/<eval_name>/`, and add Inspect Scout scanners that check for their requirements. Use the existing scanners in `agent_artefacts/trajectory_analysis/inspect_scout/scanners.py` and the [Inspect Scout documentation](https://meridianlabs-ai.github.io/inspect_scout/) as references. Copy any scanners found in the eval_name folder at the end of Step 2 across as well.

Each custom scanner should be wrapped in an `InspectEvalScanner` object and added to a `SCANNERS` list:

```python
from scanners import InspectEvalScanner

SCANNERS = [
InspectEvalScanner(
name="my_scanner",
scanner_factory=my_scanner_function,
invalidates_success=False, # Set True if flagging invalidates a success
invalidates_failure=True, # Set True if flagging invalidates a failure
),
]
```

Ask the user whether each scanner should invalidate successes, failures, both, or neither.
6. Check how many samples are in the log file with `uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> --dry-run`. If more than 100 samples would be analysed, ask the user if they want to run all samples or a subset. Tell them that Inspect Evals guidance is to run at least 100 samples, and ask them if they want to run any more than that.
7. Provide the user the command they can use to run the scanners. If the user asks you to run it, you can do so.
a. If no custom scanners: `uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_results` with an optional `--limit <num>` if a subset is run.
b. If custom scanners were created in step 4, add `-n <eval_name>` like so: `uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -n <eval_name> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_results`

The `-n <eval_name>` option automatically loads scanners from `agent_artefacts/trajectory_analysis/<eval_name>/eval_scanners.py`. Duplicate scanner names are detected and skipped with a warning.
8. Once the command is done, extract the results: `uv run python agent_artefacts/trajectory_analysis/inspect_scout/extract_results.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results`
9. Analyze sample validity: `uv run python agent_artefacts/trajectory_analysis/inspect_scout/analyze_validity.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results`
10. Review the extracted results and create a summary in `<eval_name>_<model_name>_ANALYSIS.md`. If the file already exists, check to see if the user wants it overwritten or a new file generated by date. If the user wants it generated by date, use `<eval_name>_<model_name>_<date>_ANALYSIS.md`.
11. Tell the user the task is done.

More from UKGovernmentBEIS/inspect_evals