check-trajectories-workflow
$
npx mdskill add UKGovernmentBEIS/inspect_evals/check-trajectories-workflowAnalyze agent trajectories to detect failures and ethical issues.
- Identify external failures, formatting errors, and reward hacking.
- Integrates Inspect Scout with custom scanner configurations.
- Selects scanners based on user-specified log files.
- Outputs structured analysis reports to the evaluation folder.
SKILL.md
.github/skills/check-trajectories-workflowView on GitHub ↗
---
name: check-trajectories-workflow
description: Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
---
# Check Agent Trajectories
This workflow uses Inspect Scout to automatically analyze agent trajectories. It's faster than manual analysis but may miss nuanced issues.
## Workflow Steps
1. If the user does not give any indication of what log file to use, ask them for which log file or files they want the workflow performed on. You can show them the most recent log file in logs/ and ask them if that's what they want.
2. Determine the name of the evaluation the log file is used in, and create a folder titled `agent_artefacts/trajectory_analysis/<eval_name>`. Add a number to the end if the folder already exists.
3. Read agent_artefacts/trajectory_analysis/inspect_scout/scanners.py to see what the default scanners check for:
a. **outcome_summary**: Brief summary of why the agent succeeded or failed.
b. **external_failure**: Failed due to CAPTCHAs, rate limiting, network issues, missing dependencies.
c. **formatting_failure**: Failed due to incorrect answer formatting despite correct answer.
d. **reward_hacking_success**: Succeeded through reward hacking or unintended means.
e. **ethical_refusal**: Failed because the agent refused on ethical or safety grounds.
Additionally, if `agent_artefacts/trajectory_analysis/<eval_name><version>` exists already, check for any scanners contained in the latest version and include those.
4. Tell the user what will be checked by default and ask if they want to check for anything else.
5. If the user wants additional checks, create an eval_scanners.py file under `agent_artefacts/trajectory_analysis/<eval_name>/`, and add Inspect Scout scanners that check for their requirements. Use the existing scanners in `agent_artefacts/trajectory_analysis/inspect_scout/scanners.py` and the [Inspect Scout documentation](https://meridianlabs-ai.github.io/inspect_scout/) as references. Copy any scanners found in the eval_name folder at the end of Step 2 across as well.
Each custom scanner should be wrapped in an `InspectEvalScanner` object and added to a `SCANNERS` list:
```python
from scanners import InspectEvalScanner
SCANNERS = [
InspectEvalScanner(
name="my_scanner",
scanner_factory=my_scanner_function,
invalidates_success=False, # Set True if flagging invalidates a success
invalidates_failure=True, # Set True if flagging invalidates a failure
),
]
```
Ask the user whether each scanner should invalidate successes, failures, both, or neither.
6. Check how many samples are in the log file with `uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> --dry-run`. If more than 100 samples would be analysed, ask the user if they want to run all samples or a subset. Tell them that Inspect Evals guidance is to run at least 100 samples, and ask them if they want to run any more than that.
7. Provide the user the command they can use to run the scanners. If the user asks you to run it, you can do so.
a. If no custom scanners: `uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_results` with an optional `--limit <num>` if a subset is run.
b. If custom scanners were created in step 4, add `-n <eval_name>` like so: `uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -n <eval_name> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_results`
The `-n <eval_name>` option automatically loads scanners from `agent_artefacts/trajectory_analysis/<eval_name>/eval_scanners.py`. Duplicate scanner names are detected and skipped with a warning.
8. Once the command is done, extract the results: `uv run python agent_artefacts/trajectory_analysis/inspect_scout/extract_results.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results`
9. Analyze sample validity: `uv run python agent_artefacts/trajectory_analysis/inspect_scout/analyze_validity.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results`
10. Review the extracted results and create a summary in `<eval_name>_<model_name>_ANALYSIS.md`. If the file already exists, check to see if the user wants it overwritten or a new file generated by date. If the user wants it generated by date, use `<eval_name>_<model_name>_<date>_ANALYSIS.md`.
11. Tell the user the task is done.
More from UKGovernmentBEIS/inspect_evals
- build-repo-contextCrawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
- ci-maintenance-workflowCI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.
- code-quality-fix-allFix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
- code-quality-review-allReview all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead).
- create-evalRedirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.
- ensure-test-coverageEnsure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).
- eval-quality-workflowFix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
- eval-report-workflowCreate an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
- eval-validity-reviewReview a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).
- generate-asset-actionsGenerate asset-actions.yaml from ASSETS.yaml by classifying assets into priority tiers. Use when the user asks to regenerate, update, or refresh the asset actions.